TCAM-based forwarding engines
Ternary Content Addressable Memory, or TCAM, is a critical component of a modern router. It is a powerful and fast hardware lookup engine for IP Prefixes. It is also complex, expensive and power hungry. Not surprisingly, there never seems to be never enough on whatever system you use.
TCAM has historically been used to perform hardware-table lookups of Access-list, Netflow or QoS tables in routers and switches. Most of these implementations from your favorite brand-name vendors use one or more external TCAM chips to perform these lookups.
Modern switch-on-chip ASICS such as the Broadcom Trident or Trident+ also use TCAM but normally the TCAM space is on-chip. Having TCAM internal to the ASIC makes lookup exceedingly fast, but burns a lot of transistors.
There is an overall transistor limit per ASIC, so TCAM size is limited. This introduces engineering constraints for the router which can catch you out if you don’t pay attention.
Commodity ASICs also use TCAM within the forwarding engine for Longest Prefix Match routing lookups. This rest of this post will describe how TCAM works and the operation of TCAM in the forwarding engine.
A quick look at CAM
Before we dive into TCAM, let’s take a look at how CAM works. Content Addressable Memory (CAM) allows you to input a search ‘word’ (or binary IP address in our example) and search the entire memory in a single operation returning one (or more) matches.
In contrast, standard RAM takes a memory location as an input and returns the contents of the memory cells at those addresses.
CAM a special type of RAM which combines both storage cells and the comparison logic-gates. This makes CAM much more complex and expensive to produce than standard RAM. CAM performs ‘exact’ matches of input words and excels at performing lookups for one-to-one mappings, like IP-to-MAC and MAC-to-Interface lookup tables.
However, CAM isn’t well suited to prefix lookups, as it can only perform exact matches. This causes headaches when you want to lookup an ip address against a prefix. Your network OS could expand all elements of a subnet into individual IP addresses before programming your CAM.
However entering all these exact matches would quickly exhaust your limited CAM capacity, and would negate many of the benefits of resource conservation.
Ternary Content Addressable Memory (TCAM) is a more flexible type of CAM and allows you to store a full prefix/mask.
The ‘Ternary’ in TCAM refers to the third state which can exist in a TCAM cell. Regular (akay binary) CAM stores two states ‘1’ and ‘0’. However TCAMs can also store a ‘don’t-care’ bit and require twice the number of logic gates to handle this more complex state and the 3-way decision.
Let’s work through a simplified example. The example below is for a 4 entry TCAM, with each entry storing a 4-bit word. We’ll use 4-bit pseudo-IP addresses in this example. The subnet bits are programmed into TCAM and the host-bits get programmed as ‘X’ as we ‘don’t-care’ about matching them. The last entry is all X’s to represent a default route.
We’ll pretend we’re looking to lookup pseudo-IP 1011 for routing in this example.
The diagram below shows 4 cells programmed with the 4-bit words:
A single bit miss will pull down the entire match line and the results is 0 volts out (no match). If there is a don’t care entry in the cell it will always match, i.e. it will not pull down supply lines. The sense amps on the right signal a ‘match’ when it senses a high voltage.
So the result is multiple matches. If the encoder has multiple matches, it will pick the first match line (00) in this case, and will output an encoded version of that line number. The encoders output can be used to address (or index) another forwarding table and complete the forwarding task. So the final outcome was that the TCAM/Encoder duo ‘picked’ the longest prefix from the three matches it found. Cool!
But….. the encoder didn’t know that 101/3 was the longest. It could only tell that it was the match with the lowest match line number; the first match.
If you review the original prefixes you’ll see the the deck was stacked. That is – the prefixes were pre-sorted, with the longest prefixes programmed into the TCAM first and the default programmed in last.
CAM still has a role in network devices for fast exact-match lookups, but isn’t well suited to prefix-lookups. TCAM excels at prefix lookups and the encoder plays an important role in choosing a single match.
The crucial point however is that TCAM and the encoder can only perform a ‘first-match’ against a given prefix. But… if you pre-sort the forwarding table by prefix-length before TCAM programming, the resultant lookup becomes a Longest Prefix Match (LPM).
I want to learn more about how the sorting requirement impacts router performance. Does the Router OS pack all the sorted prefixes tightly in the table with no slots free?
What would happen if there was a flapping /30 in your network. Would that would mean that all prefixes from /30 and shorter have to be constantly sorted and re-inserted into the TCAM? That seems a little inefficient to me, but carving the TCAM into banks based on TCAM size seems inflexible also.
Does any reader know how this forwarding-table TCAM sorting/programming problem is sorted in the real world?
I borrowed heavily from: http://www.pagiamtzis.com/cam/camintro/ when preparing this post. It goes into a lot more detail on TCAM, and is worth bookmarking.
[amazon_link asins=’0471360902,1482207117,0471330361,1441963081′ template=’ProductCarousel’ store=’seamlnetwo-20′ marketplace=’US’ link_id=’d8240af3-f3f4-11e7-a1c4-b1421334aa32′]
17 thoughts on “TCAM-based forwarding engines”
Have you got the complete knowledge of CAM/TCAM architechure?
I mean internally what happens when we say a read/ write/ Deletion of a MAC address is happening in it?
And also wt are sorting algorithms mostly used?
And what if duplicate entries are present?
What is MAC MOVE?
Thanks in advance.
I don’t have those details on the TCAM architecture. I see you’re coming from Cisco address space, so hopefully someone internally can get you the answers you seek.
You don’t sort a TCAM table, the entire table is matched in a single clock cycle. Sorting isn’t relevant for TCAM operations.
MAC Move relates to BCAM updates. Most CAM have limited updated performance.
According to the blog post, having the prefixes sorted by mask length is critical to achieving a shortest prefix match (since the encoder will pick the “first” match).
Correct, the entries are sorted ‘before’ being programmed into the TCAM.
In the reply I think there is an error it should be “Longest prefix match” instead it is “Shortest prefix match”.
Really interesting! In the real world, the SDK manages the prefix sorting and inserting. I haven’t fully understood its algorithm. But small number of entries need to be moved around for performing L3 LPM entry insert/update/delete. All of them are performed by CPU.
Thanks for the comment. I’m interested to see if the fib programming algorithm packs the prefixes loosely or tightly. If you packed the prefixes loosely, you would be able to re-shuffle only a few prefixes after a new prefix was added. If packed tightly every prefix with a longer mask than the new prefix would need to be shifted down in order to make room. Maybe there’s a ‘shift’ method exposed in the SDK to allow the CPU to shift existing entries, make a free slot and insert new prefix directly into that slot. Would love to hear what you learn about the sort/insert algorithm.
Excellent post, John!
Hey thanks Jag!
Awesome post! I’m writing a blog article on the theoretical performance possible with software based routing and switching as compared to hardware (a concern that dawned on me during an NSX training class and VMWare refused to acknowledge).
This article is a big help in my research on it. Thanks!!
If you make any progress I’d love to see your workings. I started looking at the non-TCAM switching latencies (e.g. RLDRAM etc) but got bogged down in memory speeds and number of clock cycles per prefix lookup (tries, etc) and abandoned it. Thanks for the feedback.
Hi! Thank you! I really found this post helpful.
One more thing that i found is that address is broadcasted into CAM/TCAM, so you don’t have to go across entire table. Rather multiple outputs about matching or not matching particular entry are sent to something called priority encoder. That priority encoder gives you an actual output after choosing the best match, according to its best match logic. One good explanation here: _Link_Removed_
Sorry for any grammar mistakes in comment. English is not my native language.
Thanks for the comment, the priority encoder is the angled box to the right of the diagram above; I just labelled it the ‘encoder’. Priority is implicitly provided by the encoder. Have another read through this section above which starts as:
“But….. the encoder didn’t know that 101/3 was the longest, only that it was the the match with the lowest match line number…. ”
Can you tell me what should table with 16 entries for IPv4 addresses look like? What is the minimum signals for this table?
There are areas like this one which I never got.
I loved your explanation under TCAM operation and also the way you started at the very beginning Best regards,
Very nice intro abt TCAM.