What Went Wrong: the Socket API
You might think that the lack of a decent session layer in the TCP/IP protocol suite is the main culprit for our reliance on IP multihoming and related explosion of the IP routing tables. Unfortunately, we have an even bigger problem: the Berkeley Socket API, which is around 40 years old and used in almost all TCP/IP software implementations and clients (including high-level scripting languages like PERL or Python).
To establish a client-to-server connection using Socket API you have to perform these calls:
- Create a socket with the socket() call
- Convert a hostname into a L3 address (IPv4 or IPv6) with the getaddrinfo() or (obsolete) gethostbyname() call.
- Connect to the remote L3 address with the connect() call.
The set of calls you have to perform is not surprising; Socket API is older than DNS. However, the reliance on IP addresses passed around inside the application and a total disconnect between name resolution and session establishment is a disaster.
Just to give you an example: you might have a server farm offering a service, and describe it in DNS with numerous A records for the same name (for example, scs.msg.yahoo.com).
nslookup scs.msg.yahoo.com.
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
scs.msg.yahoo.com canonical name = vcs0.msg.g03.yahoodns.net.
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.114.52
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.114.76
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.114.68
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.114.97
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.121.49
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.120.52
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.114.81
Name: vcs0.msg.g03.yahoodns.net
Address: 66.196.121.40
The DNS entry for scs.msg.yahoo.com looks awesome, but doesn’t help a bit unless the client application uses that information. In reality, most applications:
- Perform the getaddrinfo() call which returns the list of addresses (regardless of whether they are reachable or not)
- Use the first address (or all of them in sequence) in the connect() call (happy eyeballs implementations are an obvious exception).
If the DNS lookup returned a temporarily unreachable IP address you’re doomed.
Obviously you could reinvent happy eyeballs. You could make DNS calls yourself using the resolver library (or parse the information returned by getaddrinfo()), collect all IP addresses and try to connect to more than one of them. Web browsers usually do that quite well, or we would quickly stop using them.
You could even implement a connection-failure cache listing those addresses that were recently unreachable to speed up the future session setup process. But let’s be realistic: how many application programmers do you know that really understand the intricacies of TCP/IP (let’s lower the bar: how many of them could use the resolver library)? Most of them want to get their job done and end up using recipes from sources like Network Programming with Perl.
The name-to-address mapping problem should have been abstracted into the OS kernel (or system library) decades ago (at the latest when DNS became widespread) and the applications should have been kept blissfully unaware of the complexities; the connect() call should accept a hostname and do the rest behind the scenes. Even Microsoft got that right with the NetBIOS API. But then, what could you expect: the Socket API is a direct mapping to the TCP/IP protocol stack (where DNS is just one of the applications). To make matters worse, it looks like we missed another opportunity to get networking API right – according to Drew DeVault, Plan9 operating system treated networking connections like files, but those ideas were never ported into Linux.
With the sorry state of the Socket API, the best you can do if your service is reachable through multiple IP addresses is to randomize the DNS responses (this will give you some limited load sharing), adjust the list of A records in the DNS responses based on server availability (while hoping that the intermediate DNS servers or the clients will not ignore the TTL settings in the DNS responses) … and as the last resort make sure all the IP addresses are always reachable, which brings us back to where we’ve started: IP multihoming. You could also use a load balancer and a single (obviously multihomed) IP address.
Revision History
- 2016-07-08
- gethostbyname is obsolete. Also added a reference to happy eyeballs which got popular after this blog post was written.
- 2022-11-19
- Added a pointer to Plan9 blog post, removed obsolete links, and polished the text a bit.
First of all, I love good discussion ... but prefer to have it with people who have at least a unique (even if fictitious) identity, so I would appreciate if you could use a unique identifier for comments that might evolve into an (hopefully interesting) discussion.
Now for the getaddrinfo: I don't understand what you mean with the "correct loop". While the getaddrinfo is supposed to provide more than one address,it looks like that's not always the case. What were you trying to say?
What I'm saying is that the Socket API is conceptually broken and that the handling of L3 addresses that the applications are forced to do severely hinders our ability to address problems we're having with the exploding Internet.
My reference to the Wikipedia is not meant to give a programmer a pointer to a reference documentation (which, BTW, differs between operating systems), but to give some background information to those that are not familiar with the structure of the Socket API (not everyone has been blessed with exposure to C++ programming), so that they could better understand my arguments.
If someone is trying to learn programming reading my blog posts, he's found a wrong source.
I cannot speak for windows or linux systems, but in FreeBSD (and other BSD systems) getadddrinfo does the job properly.
Of course, since a name can have several IP addresses, you need to loop it till you get a connection. You can even use wildcards as parameters to getaddrinfo so you'll be able to try all address on all family address available for that name. It's up to the programmer to do it properly.
From "man getaddrinfo" (FreeBSD 7.1):
The following code tries to connect to ``www.kame.net'' service ``http''
via a stream socket. It loops through all the addresses available,
regardless of address family. If the destination resolves to an IPv4
address, it will use an AF_INET socket. Similarly, if it resolves to
IPv6, an AF_INET6 socket is used. Observe that there is no hardcoded
reference to a particular address family. The code works even if
getaddrinfo() returns addresses that are not IPv4/v6.
struct addrinfo hints, *res, *res0;
int error;
int s;
const char *cause = NULL;
memset(&hints, 0, sizeof(hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
error = getaddrinfo("www.kame.net", "http", &hints, &res0);
if (error) {
errx(1, "%s", gai_strerror(error));
/*NOTREACHED*/
}
s = -1;
for (res = res0; res; res = res->ai_next) {
s = socket(res->ai_family, res->ai_socktype,
res->ai_protocol);
if (s < 0) {
cause = "socket";
continue;
}
if (connect(s, res->ai_addr, res->ai_addrlen) < 0) {
cause = "connect";
close(s);
s = -1;
continue;
}
break; /* okay we got one */
}
if (s < 0) {
err(1, "%s", cause);
/*NOTREACHED*/
}
freeaddrinfo(res0);
They are still using the BSD API, but didn't update their version, though.
have a look to other utility libraries there, this is their desing principal.
http://tools.ietf.org/html/draft-ietf-tsvwg-sctpsocket-19
It would appear that the multihoming issue could get better, assuming the one-to-many API gets deployed. I haven't had a chance to play around with it yet, but libsctp appears to be the Linux implementation.
"If a bind() is not called prior to a sendmsg() call that initiates a
new association, the system picks an ephemeral port and will choose
an address set equivalent to binding with a wildcard address. One of
those addresses will be the primary address for the association.
This automatically enables the multi-homing capability of SCTP."
Also, sctp_getpaddrs looks promising, returning all of the addresses of an existing endpoint. Now how that works in practice? Not sure...
And as usual, thanks for the insights!
If we truly want reliable network applications, the API for opening a socket should be something along the lines of:
Stream myStream;
try {
myStream = Network.Connect("somehost.foobar.com", "http");
} catch (NetworkException) {
// Oops, it didn't work. Deal with it somehow.
}
DoSomething(myStream);
Anything more complicated will result in a handful of well-behaved applications and a vast multitude of crap.
Does it fix "what went wrong" ?
I think that one of the main issues is that the current networking "protocol suite" is oriented towards moving data between interfaces of computers, instead of allowing applications to communicate (the issue is not just the sockets interface...)
"However, most of the applications will perform the gethostbyname() call which returns one of the addresses (regardless of whether it’s reachable or not) that is then passed to the connect() call. "
The long deprecated gethostbyname() can indeed return multiple IP addresses if the DNS reply has multiple A records. The main reason gethostbyname() was deprecated in favor of getaddrinfo() is because of the lack of IPv6 support in the former and not because the former could return only one IP address. The link you gave to explain the issue with getaddrinfo() is invalid now. If there is an alternate link please share it.