Facebook Is Close to Having an IPv6-only Data Center

Whenever I mention the idea of IPv6-only data centers, I get the usual question: “Sounds great, but is anyone actually using it?” So far, my answer was: “Yeah, I know a great guy in Norway that runs this in production” As of last week, the answer is way more persuasive: “Facebook is almost there.

Background: Paul Saab from Facebook had a great presentation during last week’s V6 World Congress describing Facebook’s IPv6 deployment plans and the gotchas they encountered on the way.

Why: They ran out of RFC 1918 address space in their data center (that’s a nice problem to have). IPv6 was the only way forward. Also, they wanted to encourage developers to stop writing IPv4-only code (and taking IPv4 away seems to be the only way to do that).

How: They decided IPv6-only data center is the way to go. It’s much easier to handle two protocols at the edge (where they have load balancers anyway) than throughout the data center.

Problems: Plenty of them, from switches falling back to process switching (lovely, isn’t it?) to TCAM limits (told you), software crashes, BGP problems, Linux kernel cache trashing, glibc and curl errors… The usual when you’re the first one pushing the envelope.

How far did they get: They had IPv6 throughout the data center in 2011. All hosts support IPv6 now, 75% of internal traffic (including 100% of memcached traffic) is already IPv6.

What next: Plans to remove IPv4 from first clusters by the end of 2014.

Why does it matter: Facebook proved it can be done at scale, and discovered (and helped fix) a lot of bugs on the way. Everyone might eventually have a slightly easier transition to IPv6 because of their efforts. Thank you!

Related technical details: Watch the IPv6-only data center videos (free) and other IPv6 webinars.

10 comments:

  1. Hi Ivan - a tangential question - I was wondering if there is a clear, and simple depiction of what happens behind the scenes from the time someone types let's say a youtube.com in a browser, back to when the video is loaded and played. Any pointers? Or maybe can you sketch something up? :P
    Replies
    1. Obviously we don't know (and never will) what exactly is happening on the application layer, for some network-related details do check out my TCP/HTTP/SPDY course.

      http://demo.ipspace.net/get/SPDY#Videos
  2. DivSu: Not exactly what you are looking for, but Youtube published a lot of infos about their stack two years ago:

    http://highscalability.com/blog/2012/3/26/7-years-of-youtube-scalability-lessons-in-30-minutes.html
  3. Hello, very good document on Facebook. I was wondering: How it can be possible that "Traffic goes only over links where ND happened before BGP". This is a comment on page 13 of the document. How the hell BGP could goes up BEFORE ND ?

    - Martin B.

    Regards,
    Replies
    1. No idea. I'm guessing BGP gets stuck if it tries to establish the session before ND happened. Nasty bug if that's true (hopefully fixed in GA code by now).
    2. We have to run the IPv6 BGP sessions over IPv4. It was covered in a slide. This is due to limitations in the aggregation switches which cannot support the 2x the BGP sessions when you are doing both IPv4 and IPv6.
  4. "They ran out of RFC 1918 address space.." Wait.... Whaaaat?
    How?
    Good step forward for the Internet as a whole though
    Replies
    1. You'd need a few million servers for that (they used /24 for each rack).
    2. Even if you had few a million servers you have to have some fairly lax ip address allocation policies to run out of IPv4 addresses in your private network.

      Sounds like poor design to me.
    3. /24 per rack in the 10/8 space is 65k racks. Presume they have some allocation inefficiencies (they probably have the IP space assigned in blocks to individual datacenters) and they're out of IPs at 10k racks. Presume you have 40 servers in a rack that's 400k servers, which doesn't seem unreasonable for Facebook. The main inefficiency here would seem to be using a /24 per rack. A big Internet company I used to work for used a /25 per rack, which seems a little more reasonable. A /26 is probably cutting it a bit close. But even a /25 doesn't move the needle very far, it is still quite reasonable for them to be running out of RFC 1918 space.
Add comment
Sidebar