SDN Router @ Spotify on Software Gone Wild

Imagine you need a data center WAN edge router with multiple 10GE uplinks. You’d probably go for an ASR or a MX-series router, right? How about using a 2 Tbps ToR switch and an SDN solution to make it work with full Internet routing table?

If you happen to have iTunes on your computer, please spend 10 seconds rating the podcast before you start listening to it. Thank you!

David Barroso from Spotify decided to do exactly that. He figured out he needs only a small fraction of the functionality of a reassuringly expensive WAN edge router. Then he listened to the NANOG presentation by Elisa Jasinska, said “that’s exactly how I could figure out what I really need in my forwarding tables”, and implemented a good-enough solution that solves his WAN connectivity challenges with an Arista ToR switch.

You’ll find more details in Episode 19 of Software Gone Wild; and here’s even more background information:


  1. It's funny I built something similar some time ago but it used an intermediate netflow probe since I was using an openflow "transit" device. The transit "device" was really just OVS on a server with ExaBGP receiving transit/peering routes. What I was really trying to test was realtime route aggregation to reduce ultimate RIB/FIB size.

    The real crux is how the BGP controller programs the upstream forwarding plane devices.

    You may be able to do some fancy stuff on things like Cumulus Linux or even the Cisco Nexus 3K around running something like ExaBGP in a container so the routes do not even hit the RIB of the device. Your BGP controller then has another session to the transit devices where it populates the device only with routes it needs with specific NH addresses. I2RS may be another option in the future.

    1. Somewhere in the podcast I asked David why he didn't use OpenFlow. His answer was along the lines: "I wanted to use something that I could make to work with minimum effort" ;)
  2. Sorry I didn't listen to the podcast. :) I'm definitely not a big OF proponent, I don't think it has a real future, but it's not really very difficult to get working these days.

    Metaswitch had the same basic premise by proxying the BGP sessions from the transit/peer connected device through to the controller, so the transit device doesn't even have a RIB and runs no control plane protocols except the OF agent. Definitely a purist solution. :)

    To me installing specific routes on the borders doesn't seem to be too advantageous. What would be more interesting is say putting a switch on each transit connection, with a default route, then manipulating BGP routes sent downstream to TOR or servers to send traffic out a specific switch using some constraint.
    1. The main problem with OF is that you can install a couple of thousands of OF entries while if you use your FIB you can install easily more than 60.000 routes.

      Regarding "manipulating BGP routes sent downstream to TOR or servers to send traffic out a specific switch", that should be doable with this software and some overlay like MPLS or GRE tunnels. Potentially you could use pmacct to gather flow statistics and some other tool to test your peers/transit networks, then use this tool to correlate information from both tools, choose routes based on your policies and the metrics gathered and install them on the FIB or send them to ToR/hosts.

      The real question is, do you really need that? 60k routes seems enough for most networks and if you want to steer certain flows via a specific peer you could easily extend this tool to install PBR/OF entries for those specific flows. That would allow you to use the routes in the FIB as default and to steer traffic for certain applications using PBR/OF.

      The tool in fact is just a framework that potentially could allow you to do any sort of traffic engineering based on flow statistics, metrics, company policies, etc... You could even run this tool peering with an ASR/MX and use it to choose your egress peer instead of limiting the amount of routes you want. All the functionality is built in the form of plugins.

      In summary, when building a solution you have to think on your use case and build a solution that is as simple as possible and iterate over it. And always keep in mind hardware, whitebox switches might look attractive because they are cheap but their form factor might not suit your needs for certain use cases so I would rather prefer to stick with a generic solution I can run everywhere than a solution that forces me to use a specific platform.
  3. Well OF entries is really all about hardware just like anything else. There are some tricks you can play with the Trident II to get >100K flow entries just like you can get 128K LPM v4 entries vs. the default 32K. You can also get some PCI cards for COTS (like Flownic) which accelerate OVS and support millions of entries. I digress though since I'm not really a fan of OF, it was just a means of programming the device. :)

    The key is definitely open and extensible. There are many use cases out there and of course not all solutions work for everyone. My python skills aren't great but I will take a look at the software and see if there is anything I could add plugin wise. If you see Nic@Spotify tell him Phil says hi. :)
  4. You may want to check:
    Their solution is using FPGAs...which provides programmable HW, feature enhancements to HW will become much easier than fork lift upgrade.
    1. Yes Corsa and Noviflow are two hardware vendors making actual OpenFlow switches which carry 1M+ entries. The cost is going to be higher than a white label Trident 2 box though. The Corsa boxes have a lot of buffer memory versus what you'll find on a Trident which drives the price up. Buffering is an issue if you are aggregating lots of 10G ToR connections in a datacenter to a transit switch with a 10G transit uplink. These cheap white label boxes do not have the buffers to deal with it.

      They also have relatively poor CPU in them to keep costs down. It's the main hardware differentiator between Juniper's QFX5100 and their ONIE-compatible white label switch.
  5. I've just watched your presentation in YouTube and listened to the podcast and I'd like to raise some points for further clarification.
    Well, if you are peering with someone, the normal behavior is for them to announce you longer prefixes through their peer connections compared to their transit.
    That alone would already select the traffic back from you AS to go through the peer connection and 'offload' the data off transit connections.
    Your destinations would be reachable via either, transit or peers.
    In that case - assuming that your peers are not masochists SoBs that prefer to use their transit connections instead of their peers connections - wouldn't be simpler to just accept default routes from your upstream providers and naturally steer the traffic using BGP announcement policies from your peers?
    Sorry for the dumb question.
Add comment