Facebook Is Close to Having an IPv6-only Data Center

Tuesday, March 25, 2014 08:18 +0100

Facebook Is Close to Having an IPv6-only Data Center

Whenever I mention the idea of IPv6-only data centers, I get the usual question: “Sounds great, but is anyone actually using it?” So far, my answer was: “Yeah, I know a great guy in Norway that runs this in production” As of last week, the answer is way more persuasive: “Facebook is almost there.”

Background: Paul Saab from Facebook had a great presentation during last week’s V6 World Congress describing Facebook’s IPv6 deployment plans and the gotchas they encountered on the way.

Why: They ran out of RFC 1918 address space in their data center (that’s a nice problem to have). IPv6 was the only way forward. Also, they wanted to encourage developers to stop writing IPv4-only code (and taking IPv4 away seems to be the only way to do that).

How: They decided IPv6-only data center is the way to go. It’s much easier to handle two protocols at the edge (where they have load balancers anyway) than throughout the data center.

Problems: Plenty of them, from switches falling back to process switching (lovely, isn’t it?) to TCAM limits (told you), software crashes, BGP problems, Linux kernel cache trashing, glibc and curl errors… The usual when you’re the first one pushing the envelope.

How far did they get: They had IPv6 throughout the data center in 2011. All hosts support IPv6 now, 75% of internal traffic (including 100% of memcached traffic) is already IPv6.

What next: Plans to remove IPv4 from first clusters by the end of 2014.

Why does it matter: Facebook proved it can be done at scale, and discovered (and helped fix) a lot of bugs on the way. Everyone might eventually have a slightly easier transition to IPv6 because of their efforts. Thank you!

Related technical details: Watch the IPv6-only data center videos (free) and other IPv6 webinars.

10 comments:

DivSu 25 March 2014 11:52

Hi Ivan - a tangential question - I was wondering if there is a clear, and simple depiction of what happens behind the scenes from the time someone types let's say a youtube.com in a browser, back to when the video is loaded and played. Any pointers? Or maybe can you sketch something up? :P

Ivan Pepelnjak 25 March 2014 12:22

Obviously we don't know (and never will) what exactly is happening on the application layer, for some network-related details do check out my TCP/HTTP/SPDY course.

http://demo.ipspace.net/get/SPDY#Videos

Ben 25 March 2014 18:02

DivSu: Not exactly what you are looking for, but Youtube published a lot of infos about their stack two years ago:

http://highscalability.com/blog/2012/3/26/7-years-of-youtube-scalability-lessons-in-30-minutes.html

Anonymous 26 March 2014 21:56

Hello, very good document on Facebook. I was wondering: How it can be possible that "Traffic goes only over links where ND happened before BGP". This is a comment on page 13 of the document. How the hell BGP could goes up BEFORE ND ?

- Martin B.

Regards,

Ivan Pepelnjak 27 March 2014 06:52

No idea. I'm guessing BGP gets stuck if it tries to establish the session before ND happened. Nasty bug if that's true (hopefully fixed in GA code by now).

Unknown 27 March 2014 15:10

We have to run the IPv6 BGP sessions over IPv4. It was covered in a slide. This is due to limitations in the aggregation switches which cannot support the 2x the BGP sessions when you are doing both IPv4 and IPv6.

Unknown 27 March 2014 06:48

"They ran out of RFC 1918 address space.." Wait.... Whaaaat?
How?
Good step forward for the Internet as a whole though

Ivan Pepelnjak 27 March 2014 06:53

You'd need a few million servers for that (they used /24 for each rack).

Anonymous 30 March 2014 23:34

Even if you had few a million servers you have to have some fairly lax ip address allocation policies to run out of IPv4 addresses in your private network.

Sounds like poor design to me.

Jason Heiss 02 April 2014 02:03

/24 per rack in the 10/8 space is 65k racks. Presume they have some allocation inefficiencies (they probably have the IP space assigned in blocks to individual datacenters) and they're out of IPs at 10k racks. Presume you have 40 servers in a rack that's 400k servers, which doesn't seem unreasonable for Facebook. The main inefficiency here would seem to be using a /24 per rack. A big Internet company I used to work for used a /25 per rack, which seems a little more reasonable. A /26 is probably cutting it a bit close. But even a /25 doesn't move the needle very far, it is still quite reasonable for them to be running out of RFC 1918 space.

Recent posts in the same categories

IPv6

data center

10 comments: