SDN Router @ Spotify on Software Gone Wild
Imagine you need a data center WAN edge router with multiple 10GE uplinks. You’d probably go for an ASR or a MX-series router, right? How about using a 2 Tbps ToR switch and an SDN solution to make it work with full Internet routing table?
If you happen to have iTunes on your computer, please spend 10 seconds rating the podcast before you start listening to it. Thank you!
David Barroso from Spotify decided to do exactly that. He figured out he needs only a small fraction of the functionality of a reassuringly expensive WAN edge router. Then he listened to the NANOG presentation by Elisa Jasinska, said “that’s exactly how I could figure out what I really need in my forwarding tables”, and implemented a good-enough solution that solves his WAN connectivity challenges with an Arista ToR switch.
You’ll find more details in Episode 19 of Software Gone Wild; and here’s even more background information:
- David’s presentation from SDN meetup in Stockholm
- SDN Internet Router documentation
- SDN Internet Router project at Github
- SDN Internet Router - Part 1 (what is Internet and other stories)
- SDN Internet Router - Part 2 (technical details and deployment experience)
- SDN Internet Router In Production - followup podcast with David Barroso
- Toolsmith @ Netflix podcast with Elisa Jasinska
- Pmacct podcast with Paolo Lucente
- NANOG presentation from Elisa Jasinska and Paolo Lucente
The real crux is how the BGP controller programs the upstream forwarding plane devices.
You may be able to do some fancy stuff on things like Cumulus Linux or even the Cisco Nexus 3K around running something like ExaBGP in a container so the routes do not even hit the RIB of the device. Your BGP controller then has another session to the transit devices where it populates the device only with routes it needs with specific NH addresses. I2RS may be another option in the future.
http://www.metaswitch.com/sites/default/files/lean-switch-white-paper-final-1.pdf
Metaswitch had the same basic premise by proxying the BGP sessions from the transit/peer connected device through to the controller, so the transit device doesn't even have a RIB and runs no control plane protocols except the OF agent. Definitely a purist solution. :)
To me installing specific routes on the borders doesn't seem to be too advantageous. What would be more interesting is say putting a switch on each transit connection, with a default route, then manipulating BGP routes sent downstream to TOR or servers to send traffic out a specific switch using some constraint.
Regarding "manipulating BGP routes sent downstream to TOR or servers to send traffic out a specific switch", that should be doable with this software and some overlay like MPLS or GRE tunnels. Potentially you could use pmacct to gather flow statistics and some other tool to test your peers/transit networks, then use this tool to correlate information from both tools, choose routes based on your policies and the metrics gathered and install them on the FIB or send them to ToR/hosts.
The real question is, do you really need that? 60k routes seems enough for most networks and if you want to steer certain flows via a specific peer you could easily extend this tool to install PBR/OF entries for those specific flows. That would allow you to use the routes in the FIB as default and to steer traffic for certain applications using PBR/OF.
The tool in fact is just a framework that potentially could allow you to do any sort of traffic engineering based on flow statistics, metrics, company policies, etc... You could even run this tool peering with an ASR/MX and use it to choose your egress peer instead of limiting the amount of routes you want. All the functionality is built in the form of plugins.
In summary, when building a solution you have to think on your use case and build a solution that is as simple as possible and iterate over it. And always keep in mind hardware, whitebox switches might look attractive because they are cheap but their form factor might not suit your needs for certain use cases so I would rather prefer to stick with a generic solution I can run everywhere than a solution that forces me to use a specific platform.
The key is definitely open and extensible. There are many use cases out there and of course not all solutions work for everyone. My python skills aren't great but I will take a look at the software and see if there is anything I could add plugin wise. If you see Nic@Spotify tell him Phil says hi. :)
Their solution is using FPGAs...which provides programmable HW, feature enhancements to HW will become much easier than fork lift upgrade.
They also have relatively poor CPU in them to keep costs down. It's the main hardware differentiator between Juniper's QFX5100 and their ONIE-compatible white label switch.
Well, if you are peering with someone, the normal behavior is for them to announce you longer prefixes through their peer connections compared to their transit.
That alone would already select the traffic back from you AS to go through the peer connection and 'offload' the data off transit connections.
Your destinations would be reachable via either, transit or peers.
In that case - assuming that your peers are not masochists SoBs that prefer to use their transit connections instead of their peers connections - wouldn't be simpler to just accept default routes from your upstream providers and naturally steer the traffic using BGP announcement policies from your peers?
Sorry for the dumb question.