A Quick Look at AWS Scalable Reliable Datagram Protocol
One of the most exciting announcements from the last AWS re:Invent was the Elastic Network Adapter (ENA) Express functionality that uses the Scalable Reliable Datagram (SRD) protocol as the transport protocol for the overlay virtual networks. AWS claims ENA Express can push 25 Gbps over a single TCP flow and that SRD improves the tail latency (99.9 percentile) for high-throughput workloads by 85%.
Ignoring the “DPUs could change the network forever” blogosphere reactions (hint: they won’t), let’s see what could be happening behind the scenes and why SRD improves TCP throughput and tail latency.
AWS developed SRD as a transport protocol for Elastic Fabric Adapters (the networking part of their HPC implementation). There’s no magic behind SRD; it’s just another data point in the transport protocol solution space:
- Like UDP, it provides a datagram transport.
- Unlike UDP, the datagram delivery is reliable.
- Unlike TCP (another reliable delivery protocol), SRD can reorder packets in transit and deliver them out-of-order.
SRD consumers are supposed to deal with packet reordering (unlike some UDP consumers that drop reordered packets).
Dropping the “in-order delivery” requirement allows AWS to send SRD packets in parallel over all alternate paths. That approach also reduces link congestion – instead of a long burst of packets landing on a single link, the packet burst is spread across many parallel links.
Next step: ENA Express uses SRD instead of GRE (or VXLAN or GENEVE) to transport Ethernet frames between hypervisor hosts, resulting in faster, reliable delivery. According to AWS documentation, they even reorder incoming SRD packets into the proper sequence before passing them to the VM TCP/IP stack:
Handles some tasks directly in the network layer, such as packet reordering on the receiving end, and most retransmits that are needed. This frees up the application layer for other work.
It’s obvious why SRD increases the throughput of a single TCP session – packets from a single session can be sent over multiple links1 – but why does it decrease the tail latency?
I know just enough about TCP to have incorrect opinions, but (based on what people who should know better told me) there are several reasons for variability in TCP throughput and latency:
- Packet drops could be a Really Bad Thing if your TCP stack uses drop-sensitive congestion avoidance algorithm. Having reliable underlay transport solves this one.
- Early TCP implementations could interpret reordered packets as a packet drop of the intermediate packets (see above). I was told this was a solved problem and should have disappeared in recent TCP implementations after the TCP Selective ACK was implemented.
- Packet reordering also kills hardware-based Receive Side Coalescing, but it looks like AWS was more than willing to sacrifice the CPU cycles needed to sort the packets in software to get better performance.
- Losing the last packet of a packet burst is a killer, even if your TCP stack uses Selective ACK. The receiver can’t acknowledge the lost packet or send Selective ACK because there’s no subsequent packet, forcing the sender to wait for the timeout. The real SNAFU: the minimum TCP timeout on most operating systems is a few milliseconds, while the fabric transit times are measured in microseconds. No wonder the tail latency is through the roof and can be fixed with reliable transport of IP datagrams.
Now that you understand some of the reasons why ENA Express improves performance and tail latency, let’s quickly deal with the inevitable hype:
- Is this a new idea? Of course not. We’ve used reliable frame transport for ages on media that were too noisy (or drop-prone) for TCP, starting with X.25 and analog modems using V.42 error correction and continuing with radio networks.
- Do you need a DPU to implement something like ENA Express? Absolutely not; you could implement it in the virtual switch like GRE, VXLAN, or GENEVE. It’s just the question of which CPU cycles you like to burn.
- Could someone else do something similar? Of course, but it would require a focus on customer performance, a deep understanding of transport protocols, and engineering prowess. Charging the customers for services they consume also helps to focus your thinking. Is it fair to expect any of that from a company that needed years to add LACP to its virtual switch?
Want to know more about networking in AWS? Watch the Amazon Web Services Networking webinar ;)
- One of my readers politely pointed out that ENA Express reorders incoming SRD packets if needed. Fixed the relevant bits in the blog post.
Brocade did something similar ages ago in the VCS Fabric. ↩︎
Wouldn't it make sense to implement SRD into, for example, the linux kernel and get rid of TCP for a lot of applications? I don't see it replacing UDP. But SRD for all of webtraffic? Wouldn't be that bad. I read somewhere a few days ago that we should replace TCP in the datacenter; HA.. I found it again: https://arxiv.org/pdf/2210.00714.pdf They're talking about "Homa", I don't know enough application and transport stuff to know if the Homa idea is good or bad. And how it relates to SRD.
In the AWS case, SRD is used instead of VXLAN/GENEVE to transport customer data, so it's still Ethernet/IP/TCP above it.
Replacing TCP with SRD is hard because the original designers of the socket API wrote QDS code before thinking it through... https://blog.ipspace.net/2009/08/what-went-wrong-socket-api.html
As for "replacing TCP in data center", that article is so full of bullshit that I postponed writing a reply till next year ;)
I thought it was your idea for a next step to use SRD as a replacement for overlays because you wrote „Next step“. Sorry.
But could it replace TCP for real? I’m not talking overlays, just plain TCP.
Could SRD replace TCP? Of course not. Applications consuming TCP streams expect in-order delivery. You could add in-order delivery to SRD, but then you just reinvented TCP ;)
When I first read about AWS ENA and SRD I was thinking, did AWS “roll their own” in a manner like RTP. EIGRP’s own protocol to reliably send data to all "nodes", with piggyback acks, etc that uses multicast and unicast methods.