TCAM-based forwarding engines

Overview

Ternary Content Addressable Memory, or TCAM,  is a critical component of a modern router. It is a powerful and fast hardware lookup engine for IP Prefixes.  It is also complex, expensive and power hungry.  Not surprisingly, there never seems to be never enough on whatever system you use.

TCAM has historically been used to perform hardware-table-lookups of Access-list, Netflow or QoS tables in routers and switches.   Most of these implementations from your favorite brand name vendors use one or more external TCAM chips to perform these lookups.

Modern switch-on-chip ASICS such as the Broadcom Trident or Trident+ also use TCAM but normally the TCAM space is on-chip.   Having TCAM internal to the ASIC burns a lot of transistors and thus TCAM size is limited.  This keeps costs low for the ASIC manufacturer, but also introduces engineering constraints for the router which can catch you out if you don’t pay attention.

Commodity ASICS also use TCAM within the forwarding engine for Longest Prefix Match routing lookups.  This rest of this post will describe how TCAM works and the operation of TCAM in the forwarding engine.

A quick look at CAM

Before we dive into TCAM, let’s take a look at how CAM works.  Content Addressable Memory (CAM) allows you to input a search ‘word’ (or binary IP address  in our example) and search the entire memory in a single operation returning one (or more) matches.  To contrast, standard RAM takes a memory location as an input and returns the cells contents at those addresses.

CAM a special type of RAM which combines both storage cells and the comparison logic-gates.  This makes CAM much more complex and expensive to produce than standard RAM.  CAM performs ‘exact’ matches of input words and excels at performing lookups for one-to-one mappings, like IP-to-MAC and MAC-to-Interface lookup tables.

However, CAM isn’t well suited to prefix lookups, as it can only perform exact matches.  This causes headaches when you want to lookup an ip address against a prefix.  Your OS could expand all elements of a subnet into individual IP addresses before programming your CAM.  However entering all these exact matches would quickly exhaust your limited CAM capacity, and would negate many of the benefits of resource conservation.

TCAM overview

Ternary Content Addressable Memory (TCAM) is a more flexible type of CAM and allows you to store a full prefix/mask.

The Terrnary in TCAM refers to the three states which can exist in a TCAM cell.  The ‘Terrnary’ in TCAM refers to the third state which can exist in a TCAM cell.  Regular (Binary) CAM stores two states ‘1’ and ‘0’.  TCAMs can store a ‘don’t-care’ bit and require twice the number of logic gates to handle this more complex state and 3-way decision.

TCAM Operation

Let’s work through a simplified example.   The example below is for a 4 entry TCAM, with each entry storing a 4-bit word. We’ll use 4-bit pseudo-IP addresses in this example.  The subnet bits are programed into TCAM and the host-bits get programmed as ‘X’ as we ‘don’t-care’ about matching them.  The last entry is all X’s to represent a default route.   We’ll pretend we’re looking to lookup pseudo-IP 1011 for routing in this example.

Pfx/mask       TCAM format

101/3             101X

111/3             111X

10/2               10XX

0/0                 XXXX

The diagram below shows 4 cells programmed with the 4-bit words:

The match lines are all charged with a ‘high’ voltage before the lookup begins. The search line driver encodes the search word onto the sense lines, which are charged as ‘1’ or ‘0’. It is this need to power every match line, and waste most of the energy on non-matches, that makes TCAM so power hungry.    Each cell compares it’s stored value with that of the search word.

A single bit miss will pull down the entire match line and the results is 0 volts out (no match).  If there is a don’t care entry in the cell it will always match, i.e. it will not pull down supply lines. The sense amps on the right signal a ‘match’ when it senses a high voltage.

So the result is multiple matches.  If the encoder has multiple matches, it will pick the first match line (00) in this case, and will output an encoded version of that line number.  The encoders output can be used to address another forwarding table and complete the forwarding task.  So the final outcome was that the TCAM/Encoder duo ‘picked’ the longest prefix from the three matches it found.  Cool!

But….. the encoder didn’t know that 101/3 was the longest, only that it was the the match with the lowest match line number; the first match.  If you review the original prefixes you’ll see the the deck was stacked.  That is, the prefixes were pre-sorted, with the longest prefixes programed into the TCAM first and the default programmed in last.

Takeaway

CAM still has a role in network devices for fast exact-match lookups, but isn’t well suited to prefix-lookups.   TCAM excels at prefix lookups and the encoder plays an important role in choosing a single match.

The crucial point however is that TCAM and the encoder can only perform a ‘first-match’ against a given prefix. But… if you pre-sort the forwarding table by prefix-length before TCAM programming, the resultant lookup becomes a Longest Prefix Match (LPM).

Open Question

I want to learn more about how the sorting requirement impacts router performance.  Does the Router OS pack all the sorted prefixes tightly in the table with no slots free? What would happen if there was a flapping /30 in your network.  Would that would mean that all prefixes from /30 and shorter have to be constantly sorted and re-inserted into the TCAM? That seems a little inefficient to me, but carving the TCAM into banks based on TCAM size seems inflexible also.

Does any reader know how this forwarding-table TCAM sorting/programming problem is sorted in the real world?

Further reading

I borrowed heavily from: http://www.pagiamtzis.com/cam/camintro/  when preparing this post.  It goes into a lot more detail on TCAM, and is worth bookmarking.

(Visited 1 times, 1 visits today)
Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInShare on RedditBuffer this pageShare on StumbleUpon

15 thoughts on “TCAM-based forwarding engines

  1. hi john,

    Have you got the complete knowledge of CAM/TCAM architechure?
    I mean internally what happens when we say a read/ write/ Deletion of a MAC address is happening in it?
    And also wt are sorting algorithms mostly used?
    And what if duplicate entries are present?
    What is MAC MOVE?

    Thanks in advance.

    with regards,
    Dhanasekar R

    • Sorry Dhanasekar,
      I don’t have those details on the TCAM architecture. I see you’re coming from Cisco address space, so hopefully someone internally can get you the answers you seek.

      Regards,
      John Harrington.

    • You don’t sort a TCAM table, the entire table is matched in a single clock cycle. Sorting isn’t relevant for TCAM operations.

      MAC Move relates to BCAM updates. Most CAM have limited updated performance.

  2. Hi John,

    Really interesting! In the real world, the SDK manages the prefix sorting and inserting. I haven’t fully understood its algorithm. But small number of entries need to be moved around for performing L3 LPM entry insert/update/delete. All of them are performed by CPU.

    Regards,

    -Guohan

    • Hey Guohan,
      Thanks for the comment. I’m interested to see if the fib programming algorithm packs the prefixes loosely or tightly. If you packed the prefixes loosely, you would be able to re-shuffle only a few prefixes after a new prefix was added. If packed tightly every prefix with a longer mask than the new prefix would need to be shifted down in order to make room. Maybe there’s a ‘shift’ method exposed in the SDK to allow the CPU to shift existing entries, make a free slot and insert new prefix directly into that slot. Would love to hear what you learn about the sort/insert algorithm.

      /John.

  3. Awesome post! I’m writing a blog article on the theoretical performance possible with software based routing and switching as compared to hardware (a concern that dawned on me during an NSX training class and VMWare refused to acknowledge).

    This article is a big help in my research on it. Thanks!!

    • Hi John,
      If you make any progress I’d love to see your workings. I started looking at the non-TCAM switching latencies (e.g. RLDRAM etc) but got bogged down in memory speeds and number of clock cycles per prefix lookup (tries, etc) and abandoned it. Thanks for the feedback.

  4. Hi! Thank you! I really found this post helpful.
    One more thing that i found is that address is broadcasted into CAM/TCAM, so you don’t have to go across entire table. Rather multiple outputs about matching or not matching particular entry are sent to something called priority encoder. That priority encoder gives you an actual output after choosing the best match, according to its best match logic. One good explanation here: _Link_Removed_
    Sorry for any grammar mistakes in comment. English is not my native language.

    • Hi Gmu,
      Thanks for the comment, the priority encoder is the angled box to the right of the diagram above; I just labelled it the ‘encoder’. Priority is implicitly provided by the encoder. Have another read through this section above which starts as:
      “But….. the encoder didn’t know that 101/3 was the longest, only that it was the the match with the lowest match line number…. ”

      Regards,
      John Harrington.

  5. Can you tell me what should table with 16 entries for IPv4 addresses look like? What is the minimum signals for this table?

Leave a Reply

Your email address will not be published. Required fields are marked *