What Went Wrong: the Socket API

You might think that the lack of a decent session layer in the TCP/IP protocol suite is the main culprit for our reliance on IP multihoming and related explosion of the IP routing tables. Unfortunately, we have an even bigger problem: the Berkeley Socket API, which is around 40 years old and used in almost all TCP/IP software implementations and clients (including high-level scripting languages like PERL or Python).

To establish a client-to-server connection using Socket API you have to perform these calls:

The set of calls you have to perform is not surprising; Socket API is older than DNS. However, the reliance on IP addresses passed around inside the application and a total disconnect between name resolution and session establishment is a disaster.

Just to give you an example: you might have a server farm offering a service, and describe it in DNS with numerous A records for the same name (for example, scs.msg.yahoo.com).

A sample DNS entry for scs.msg.yahoo.com on November 19th 2022
nslookup scs.msg.yahoo.com.
Server:		1.1.1.1
Address:	1.1.1.1#53

Non-authoritative answer:
scs.msg.yahoo.com	canonical name = vcs0.msg.g03.yahoodns.net.
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.114.52
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.114.76
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.114.68
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.114.97
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.121.49
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.120.52
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.114.81
Name:	vcs0.msg.g03.yahoodns.net
Address: 66.196.121.40

The DNS entry for scs.msg.yahoo.com looks awesome, but doesn’t help a bit unless the client application uses that information. In reality, most applications:

  • Perform the getaddrinfo() call which returns the list of addresses (regardless of whether they are reachable or not)
  • Use the first address (or all of them in sequence) in the connect() call (happy eyeballs implementations are an obvious exception).

If the DNS lookup returned a temporarily unreachable IP address you’re doomed.

Obviously you could reinvent happy eyeballs. You could make DNS calls yourself using the resolver library (or parse the information returned by getaddrinfo()), collect all IP addresses and try to connect to more than one of them. Web browsers usually do that quite well, or we would quickly stop using them.

You could even implement a connection-failure cache listing those addresses that were recently unreachable to speed up the future session setup process. But let’s be realistic: how many application programmers do you know that really understand the intricacies of TCP/IP (let’s lower the bar: how many of them could use the resolver library)? Most of them want to get their job done and end up using recipes from sources like Network Programming with Perl.

It looks like people writing Yahoo Messenger knew what they were doing; otherwise it wouldn’t make sense to have numerous A records for their IM servers.

The name-to-address mapping problem should have been abstracted into the OS kernel (or system library) decades ago (at the latest when DNS became widespread) and the applications should have been kept blissfully unaware of the complexities; the connect() call should accept a hostname and do the rest behind the scenes. Even Microsoft got that right with the NetBIOS API. But then, what could you expect: the Socket API is a direct mapping to the TCP/IP protocol stack (where DNS is just one of the applications). To make matters worse, it looks like we missed another opportunity to get networking API right – according to Drew DeVault, Plan9 operating system treated networking connections like files, but those ideas were never ported into Linux.

With the sorry state of the Socket API, the best you can do if your service is reachable through multiple IP addresses is to randomize the DNS responses (this will give you some limited load sharing), adjust the list of A records in the DNS responses based on server availability (while hoping that the intermediate DNS servers or the clients will not ignore the TTL settings in the DNS responses) … and as the last resort make sure all the IP addresses are always reachable, which brings us back to where we’ve started: IP multihoming. You could also use a load balancer and a single (obviously multihomed) IP address.

Revision History

2016-07-08
gethostbyname is obsolete. Also added a reference to happy eyeballs which got popular after this blog post was written.
2022-11-19
Added a pointer to Plan9 blog post, removed obsolete links, and polished the text a bit.

18 comments:

  1. I agree this API is totally brain-dead. It is essentially synchronous (well, you can use non-blocking sockets for connect/send/rcv/...) whereas it should be event-driven. The DNS API doesn't provide a way to resolve names asynchronously (you can use threads with "_r" functions, but the programs become quickly a big mess). Also a big mess when you want to do low-levels things like setting the TTL or other funny things, which are often not portable.
  2. Any thoughts on SCTP?
  3. getaddrinfo provides the correct loop. Please do not copy and waste programms from 1983.
  4. And PLEASE, pretty please, do link to documentation pages! Wikipedia is NOT a trusted source for programmers. If you really programm without reading the manual ... please do not blog about.
  5. Dear Guest!

    First of all, I love good discussion ... but prefer to have it with people who have at least a unique (even if fictitious) identity, so I would appreciate if you could use a unique identifier for comments that might evolve into an (hopefully interesting) discussion.

    Now for the getaddrinfo: I don't understand what you mean with the "correct loop". While the getaddrinfo is supposed to provide more than one address,it looks like that's not always the case. What were you trying to say?
  6. SCTP looks good (at least from the distance), but is unfortunately totally useless (also because of broken Socket API :).
  7. Please note that this is NOT a programming blog. I am not trying to teach anyone how to program client-server architectures in C (or any other programming language); there are millions of people better qualified to do that.

    What I'm saying is that the Socket API is conceptually broken and that the handling of L3 addresses that the applications are forced to do severely hinders our ability to address problems we're having with the exploding Internet.

    My reference to the Wikipedia is not meant to give a programmer a pointer to a reference documentation (which, BTW, differs between operating systems), but to give some background information to those that are not familiar with the structure of the Socket API (not everyone has been blessed with exposure to C++ programming), so that they could better understand my arguments.

    If someone is trying to learn programming reading my blog posts, he's found a wrong source.
  8. There's always something more to be said :) Last but not least, the Wikipedia article about Berkeley sockets does provide all the relevant references and external links at the end of the article, so those that need a trusted official source can find them ;)
  9. Ivan,

    I cannot speak for windows or linux systems, but in FreeBSD (and other BSD systems) getadddrinfo does the job properly.

    Of course, since a name can have several IP addresses, you need to loop it till you get a connection. You can even use wildcards as parameters to getaddrinfo so you'll be able to try all address on all family address available for that name. It's up to the programmer to do it properly.

    From "man getaddrinfo" (FreeBSD 7.1):

    The following code tries to connect to ``www.kame.net'' service ``http''
    via a stream socket. It loops through all the addresses available,
    regardless of address family. If the destination resolves to an IPv4
    address, it will use an AF_INET socket. Similarly, if it resolves to
    IPv6, an AF_INET6 socket is used. Observe that there is no hardcoded
    reference to a particular address family. The code works even if
    getaddrinfo() returns addresses that are not IPv4/v6.

    struct addrinfo hints, *res, *res0;
    int error;
    int s;
    const char *cause = NULL;

    memset(&hints, 0, sizeof(hints));
    hints.ai_family = PF_UNSPEC;
    hints.ai_socktype = SOCK_STREAM;
    error = getaddrinfo("www.kame.net", "http", &hints, &res0);
    if (error) {
    errx(1, "%s", gai_strerror(error));
    /*NOTREACHED*/
    }
    s = -1;
    for (res = res0; res; res = res->ai_next) {
    s = socket(res->ai_family, res->ai_socktype,
    res->ai_protocol);
    if (s < 0) {
    cause = "socket";
    continue;
    }

    if (connect(s, res->ai_addr, res->ai_addrlen) < 0) {
    cause = "connect";
    close(s);
    s = -1;
    continue;
    }

    break; /* okay we got one */
    }
    if (s < 0) {
    err(1, "%s", cause);
    /*NOTREACHED*/
    }
    freeaddrinfo(res0);

    They are still using the BSD API, but didn't update their version, though.
  10. I can't agree with this - lib C provides you with basic functionaly for simple reason more functionality you will put there more bugs you will create and more testig it's required this will have performance impact as well.Lib C is not right place to put this additional functionality
    have a look to other utility libraries there, this is their desing principal.
  11. Do you believe this offers hope?

    http://tools.ietf.org/html/draft-ietf-tsvwg-sctpsocket-19

    It would appear that the multihoming issue could get better, assuming the one-to-many API gets deployed. I haven't had a chance to play around with it yet, but libsctp appears to be the Linux implementation.

    "If a bind() is not called prior to a sendmsg() call that initiates a
    new association, the system picks an ephemeral port and will choose
    an address set equivalent to binding with a wildcard address. One of
    those addresses will be the primary address for the association.
    This automatically enables the multi-homing capability of SCTP."

    Also, sctp_getpaddrs looks promising, returning all of the addresses of an existing endpoint. Now how that works in practice? Not sure...

    And as usual, thanks for the insights!
  12. The modified Socket API is already implemented in Linux and not used for a simple reason: you have to indicate which transport protocol to use in your application and there's no push to change existing applications.
  13. This assumes the programmer knows what a network is and how it works. Very few among the best of programmers do, and most of those are too lazy to do this much work to create a robust application.

    If we truly want reliable network applications, the API for opening a socket should be something along the lines of:

    Stream myStream;

    try {
    myStream = Network.Connect("somehost.foobar.com", "http");
    } catch (NetworkException) {
    // Oops, it didn't work. Deal with it somehow.
    }

    DoSomething(myStream);

    Anything more complicated will result in a handful of well-behaved applications and a vast multitude of crap.
  14. Thank you, Phil! A fantastic summary of what I’ve been trying to say.
  15. I'd be interested to hear what you think about the recent extension to sockets called ZeroMQ.
    Does it fix "what went wrong" ?
  16. I agree that the sockets API doesn't facilitate building distributed applications at all. A decent API would be oriented towards Inter Process Communication (IPC), and provide primitives to allocate flows between applications by name, allowing a certain QoS specification (for example: allocate me a flow to application B, and data should be delivered in order, reliably, maximum this delay). How to honour this request is the business of the "networking stack", not the application.

    I think that one of the main issues is that the current networking "protocol suite" is oriented towards moving data between interfaces of computers, instead of allowing applications to communicate (the issue is not just the sockets interface...)
  17. Ivan, This is a old post. But I would like point out that the following statement is patently false..

    "However, most of the applications will perform the gethostbyname() call which returns one of the addresses (regardless of whether it’s reachable or not) that is then passed to the connect() call. "

    The long deprecated gethostbyname() can indeed return multiple IP addresses if the DNS reply has multiple A records. The main reason gethostbyname() was deprecated in favor of getaddrinfo() is because of the lack of IPv6 support in the former and not because the former could return only one IP address. The link you gave to explain the issue with getaddrinfo() is invalid now. If there is an alternate link please share it.
    Replies
    1. (Somewhat) fixed the text. Thank you!
Add comment
Sidebar