Performance Limitations of the Tofino Software Model in Mininet #64

lorepap · 2025-01-31T16:44:02Z

Hi,

I'm conducting a stress test using Mininet and the Tofino software model on a high-performance server with 32 CPU cores and 256GB RAM, running Ubuntu 20.04.

For my experiment, I set up two hosts and established an iperf session over links configured at 100Mbps. However, the Tofino software model seems to struggle beyond ~40Mbps of traffic.

I'm aware that the software model is meant for testing/debugging purposes, but I was wondering if there is any documented information on the throughput limitations. I haven't been able to find clear benchmarks online regarding its expected performance.

Any insights would be greatly appreciated!

jafingerhut · 2025-01-31T17:50:33Z

I am not aware of any documented information on throughput limitations of the Tofino model.

I would guess that on today's highest performance CPU cores, several thousand packets per second is pretty close to its maximum limit. As you say, it is intended for testing and debugging. If you are thinking of using it in a production network for processing packets, you are spending way more CPU cycles to do that than can be achieved on a general purpose CPU by many other techniques, e.g. writing a C or Golang program to process the packets.

vgurevich · 2025-01-31T20:02:55Z

Dear @lorepap ,

First of all, I am somewhat surprised about the numbers you report. Even if you use jumbo frames (9216 bytes), 40Mbps of traffic would be equivalent to about 550pps (packets per second) which is quite a bit more than I would even expect. And I assume you didn't disable logging, right?

Here are some facts about the model performance (and performance measurements in general) that might help you to make the right decisions.

Tofino model is designed for the very accurate simulation of the actual Tofino hardware and not for performance. In fact, it is quite typical that even incorrectly compiled code will produce the same results on both the model and the ASIC. That's how accurate it is.

Tofino model performance depends on:

The CPU performance
The amount of logging
The actual P4 program
The specific set of headers in a given packet.

Tofino model performance does not depend on:

The packet's payload length
a. P4 programs do not process the payload
The number of CPUs in the system for as long as the model can have one CPU to itself
a. The processing within a single pipe is definitely single-threaded by design
b. There was an effort to allow each processing pipe to use a separate thread, but I do not remember if that was put in or not.
c. Even if that code was put in, In any case, you will not see any positive effects of that unless you simultaneously inject packets into the ports that belong to separate pipes.
The amount of memory in the system for as long as there is no swapping

Given that the size of the payload does not affect P4 program performance on Tofino (as long as the average packet size is within the spec), it is customary (and more practical) to measure the performance in packets-per-second rather than in bits-per-second, since the former number does not depend on the packet length.

Note also, that even though the model's performance is not very high (I typically quote 100pps although I haven't personally measured it in a while), it can be used to simulate processing of the flows of any bandwidth. The "secret" is that it has a special feature called "manual time advancement" that acts like a very sophisticated time machine. This is something that is covered in the courses I teach.

lorepap · 2025-01-31T22:05:36Z

Thank you for your insights @vgurevich

Yes, I have disabled logging, so the reported numbers reflect the model's raw performance without logging overhead. For reference, I observed approximately 579 pps (~43Mbps with udp packets).

Regarding your point on "manual time advancement," I am curious whether this feature could be leveraged to approximate the performance I might expect under high-rate traffic conditions. Could you provide some guidance on how to use it effectively? Specifically, can it be used to infer the expected throughput on actual Tofino hardware, or is it only useful for logical validation of pipeline behavior?

Looking forward to your insights.

vgurevich · 2025-01-31T22:47:02Z

@lorepap -- great to hear that you disabled logging. How big is your program? The number I quoted above (50-100pps) was measured on a pretty complex one (switch.p4_16).

The manual time advancement, it is, indeed, designed to accurately simulate time-dependent aspects of the pipeline functionality.

As for Tofino performance in general, I can say the following:

In the absence of congestion, it is entirely predictable.
a. Each match-action pipeline processes one packet per clock, meaning that when you run the device at 1.22GHz, each pipeline processes 1.22E+09 packets per second (there are more details to that, but this is the gist)
b. The packet processing latency in the match-action pipeline is determined by the program and output by the compiler
c. Parser latency depends on both the program and the specific set of headers in the program and is also calculated and output by the compiler
d. All other latencies are constant. They might not be explicitly published, but they are known and can be measured
In the presence of congestion, all bets are off. There is no magical formula and too many input variables (even if it existed).
a. Having said that, Tofino TM algorithms are known, so it is possible to reason about the specific behavior(s). Unfortunately, Intel has not yet released anyone from their NDAs, so further discussions are better to be had on ICRP Forum (assuming you have access to it).
b. In any case, the model is not going to help you to predict anything since it does not model the traffic manager. This is something that needs to be measured on the real device.

vgurevich · 2025-01-31T23:56:30Z

I just did a quick measurement of the model performance on my VM (AWS m5a.xlarge instance) using the absolutely minimal P4 program (attached), logs disabled, etc. and I got about 400pps. /proc/cpuinfo shows 4398.93 BogoMIPS and 2.1-2.5GHz. So, this seems to match what you are seeing.

That minimal program minprog.p4.txt does not even use match-action pipeline per se -- everything is done inside one parser state.

On a more complicated program that used all 12 stages of the ingress pipeline (albeit minimally), the performance dropped to about 170pps.

I would not be surprised to see that performance dropping by another factor of two if the packet had to undergo both ingress and egress processing, If I used mode complicated processing (e.g. involving more parser states, more logical tables, stateful externs, more hash calculations, etc.) it would surely would go down even more.

So, I'd say that my initial number (50-100ps) is still a good rule of thumb for most practical purposes, whereas it looks like it is possible to quote performance of "up to" 500 (or maybe even more) pps :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Limitations of the Tofino Software Model in Mininet #64

Performance Limitations of the Tofino Software Model in Mininet #64

lorepap commented Jan 31, 2025

jafingerhut commented Jan 31, 2025

vgurevich commented Jan 31, 2025

lorepap commented Jan 31, 2025

vgurevich commented Jan 31, 2025

vgurevich commented Jan 31, 2025

Performance Limitations of the Tofino Software Model in Mininet #64

Performance Limitations of the Tofino Software Model in Mininet #64

Comments

lorepap commented Jan 31, 2025

jafingerhut commented Jan 31, 2025

vgurevich commented Jan 31, 2025

lorepap commented Jan 31, 2025

vgurevich commented Jan 31, 2025

vgurevich commented Jan 31, 2025