Reducing UDP Latency

dwheeler · on March 20, 2020

I'm glad he solved his problem, and I guess it's good to post "I solved this problem" information.

But this seems painfully obvious. TL;DR: "I had a demanding real-time requirement for a Linux process, I fixed it by using the Linux real-time scheduler for my real-time process."

Let's look at this in more detail. He's trying to implement time-critical functions with Linux and has a demanding real-time requirement ("Maximum acceptable latency is 0.1ms while basic Linux solution could only provide 0.5ms."). He's even struggling getting this latency with an RTOS (!). He's not getting the real-time latency he needs by using the default Linux settings. He also can't get it by setting the "nice" value, but the "nice" value isn't relevant for real-time processes... which makes it clear that he's trying to get real-time performance without using the real-time schedulers.

His solution:

    chrt --rr 99 ./client

That asks Linux to using the SCHED_RR (Round-robin scheduling) scheduler, one of its real-time policies, instead of the default (non-real-time) scheduler.

In short: If you're doing real-time work, you need to use a real-time scheduler for it.

Technically this doesn't completely solve his problem; if he was serious about 0.1 msec, SCHED_RR isn't doing it as shown by the posted bar chart. But he seemed happy enough with it, so perhaps his requirements are really more probabilistic ("X% of the time, UDP must be delivered in 0.1 msec").

Again, I'm glad he solved his problem, but I hope most developers already know that they need to use real-time schedulers for real-time work.

zwieback · on March 20, 2020

I interpreted the Linux thing as just making the Linux host a better analysis tool, the main focus was the embedded system side.

I would probably use Wireshark to find the packet timing, who knows what happens in something as complex an OS as Linux. You can also use something like Wireshark to extract debug payloads like ISR or DMA timing on the embedded system although admittedly a script on the LInux side is easier.

abondarev · on March 20, 2020

I hope too that most developers already know that they need to use real-time schedulers for real-time work.

But I'm not sure that you understood to the end the problem and solution.

chrt --rr 99 ./client

It's only the test machine to make sure that de0_nano_fpga part work correctly. Linux on de0_nano_fpga, of cause, run with real-time features. The times (Min: 0.74ms) are for Linux with real-time schedulers.

zackmorris · on March 20, 2020

This is interesting, but I would have thought that gigabit ethernet would have higher latency than 100 megabit, because with spread-spectrum communication, latency tends to go up as bandwidth goes up. So ethernet must use a more discrete encoding.

I started looking up info related to this, but I'd be curious to hear opinions first. I'm having trouble finding what the potential bandwidth for ethernet would be if it went to spread-spectrum, and how much that might increase latency.

CodesInChaos · on March 20, 2020

For 512-byte packets transmission over 100Mbit takes ~50us by itself, so the OP's target latency is difficult to reach on 100Mbit, even under ideal circumstances.

Though I don't think most of the difference the OP found is caused by the linkspeed difference, but the hardware or driver being better after the swap.

easytiger · on March 20, 2020

Got UDP send down to 800nanoseconds blocking calls for small payloads on solarflare cards using efvi

emj · on March 21, 2020

Sending in 0.8 µs is scary fast, I can't even get my clock synced to that with a (crappy) pps, but that would be on a 10G board I guess which is probably not going to be available on a embedded system with resource constraints.

absurdmind · on March 21, 2020

For the right combination of budget, application and team you can get a custom solution with much lower latency. Back in the days my team beat Solarflare hands down, but only for a single application. Don't remember the total number, but each additional 4 bytes of the payload added 5-7ns, IIRC.

abondarev · on March 21, 2020

Yes, of cause. If you have the big-budget and great team you can optimize your application very much. But in Embox the user applications stay the same, and customization was very easy!

xxpor · on March 20, 2020

I mean, if you really want to reduce latency the classic solution is to use DPDK: skip the kernel entirely and run in polling mode.

Not great if you care about power consumption, but your latency will be low.

abondarev · on March 20, 2020

I agree, but in Embox you can use usual applications. Kernel is rather difficult

farazbabar · on March 20, 2020

Wouldn't you want to take your process (and threads if appropriate) out of scheduling entirely using cpu and thread/process pinning? Using taskset command for instance.

mmxmb · on March 20, 2020

Off-topic: code formatting on Medium is terrible.

abondarev · on March 20, 2020

Yes, I agree. There is no syntax highlight

jsjddbbwj · on March 20, 2020

That's not that bad, what's worse is that code blocks don't have a horizontal scrollbar so long lines simply wrap on a mobile browser

abondarev · on March 20, 2020

Yes, sure!