What went wrong: the Socket API

You might think that the lack of a decent session layer in the TCP/IP protocol suite is the main culprit for our reliance on IP multihoming and related explosion of the IP routing tables. Unfortunately, we have an even bigger problem: the Berkeley Socket API, which is over 25 years old and used in almost all TCP/IP software implementations (including the high-level scripting languages like PERL).

To establish a client-to-server connection using Socket API you have to perform these calls:

The set of calls you have to perform is not surprising; Socket API is older than DNS. However, the reliance on L3 addresses passed around inside the application and a total disconnect between name resolution and session establishment is a disaster.

Just to give you an example: you might have a server farm offering a service (for example, scs.msg.yahoo.com or www.X.google.com) properly set up in DNS with numerous A records for the same name. However, most of the applications will perform the gethostbyname() call which returns one of the addresses (regardless of whether it’s reachable or not) that is then passed to the connect() call. If the gethostbyname() returned a temporarily unreachable IP address you’re doomed.

When properly implemented, the getaddrinfo() call could return more than one address associated with the hostname … but that’s not always the case.

Obviously you could write better application code. You could make DNS calls yourself using the resolver library (or parse the information returned by getaddrinfo()), collect all IP addresses and try to connect to more than one of them. Telnet clients usually do that quite well.

You could even implement a connection-failure cache listing those addresses that were recently unreachable to speed up the future session setup process. But let’s be realistic: how many application programmers do you know that really understand the intricacies of TCP/IP (let’s lower the bar: how many of them could use the resolver library)? Most of them want to get their job done and end up using recipes from sources like Network Programming with Perl.

It looks like people writing Yahoo Messenger knew what they were doing; otherwise it wouldn’t make sense to have numerous A records for their IM servers.

The name-to-address mapping problem should have been abstracted into the OS kernel (or system library) decades ago (at the latest when DNS became widespread) and the applications should have been kept blissfully unaware of the complexities; the connect() call should accept a hostname and do the rest behind the scenes. Even Microsoft got that right with the NetBIOS API. But then, what could you expect: the Socket API is a direct mapping to the TCP/IP protocol stack (where DNS is just one of the applications).

With the sorry state of the Socket API, the best you can do if your service is reachable through multiple IP addresses is to randomize the DNS responses (this will give you some limited load sharing), adjust the list of A records in the DNS responses based on server availability (while hoping that the intermediate DNS servers or the clients will not ignore the TTL settings in the DNS responses) … and as the last resort make sure all the IP addresses are always reachable, which brings us back to where we’ve started: the IP multihoming. You could also use a load balancer and a single (obviously multihomed) IP address.

16 comments:

  1. I agree this API is totally brain-dead. It is essentially synchronous (well, you can use non-blocking sockets for connect/send/rcv/...) whereas it should be event-driven. The DNS API doesn't provide a way to resolve names asynchronously (you can use threads with "_r" functions, but the programs become quickly a big mess). Also a big mess when you want to do low-levels things like setting the TTL or other funny things, which are often not portable.

    ReplyDelete
  2. Any thoughts on SCTP?

    ReplyDelete
  3. getaddrinfo provides the correct loop. Please do not copy and waste programms from 1983.

    ReplyDelete
  4. And PLEASE, pretty please, do link to documentation pages! Wikipedia is NOT a trusted source for programmers. If you really programm without reading the manual ... please do not blog about.

    ReplyDelete
  5. Ivan Pepelnjak25 August, 2009 15:22

    Dear Guest!

    First of all, I love good discussion ... but prefer to have it with people who have at least a unique (even if fictitious) identity, so I would appreciate if you could use a unique identifier for comments that might evolve into an (hopefully interesting) discussion.

    Now for the getaddrinfo: I don't understand what you mean with the "correct loop". While the getaddrinfo is supposed to provide more than one address,it looks like that's not always the case. What were you trying to say?

    ReplyDelete
  6. Ivan Pepelnjak25 August, 2009 15:23

    SCTP looks good (at least from the distance), but is unfortunately totally useless (also because of broken Socket API :).

    ReplyDelete
  7. Ivan Pepelnjak25 August, 2009 15:27

    Please note that this is NOT a programming blog. I am not trying to teach anyone how to program client-server architectures in C (or any other programming language); there are millions of people better qualified to do that.

    What I'm saying is that the Socket API is conceptually broken and that the handling of L3 addresses that the applications are forced to do severely hinders our ability to address problems we're having with the exploding Internet.

    My reference to the Wikipedia is not meant to give a programmer a pointer to a reference documentation (which, BTW, differs between operating systems), but to give some background information to those that are not familiar with the structure of the Socket API (not everyone has been blessed with exposure to C++ programming), so that they could better understand my arguments.

    If someone is trying to learn programming reading my blog posts, he's found a wrong source.

    ReplyDelete
  8. Ivan Pepelnjak25 August, 2009 15:30

    There's always something more to be said :) Last but not least, the Wikipedia article about Berkeley sockets does provide all the relevant references and external links at the end of the article, so those that need a trusted official source can find them ;)

    ReplyDelete
  9. Ivan,

    I cannot speak for windows or linux systems, but in FreeBSD (and other BSD systems) getadddrinfo does the job properly.

    Of course, since a name can have several IP addresses, you need to loop it till you get a connection. You can even use wildcards as parameters to getaddrinfo so you'll be able to try all address on all family address available for that name. It's up to the programmer to do it properly.

    From "man getaddrinfo" (FreeBSD 7.1):

    The following code tries to connect to ``www.kame.net'' service ``http''
    via a stream socket. It loops through all the addresses available,
    regardless of address family. If the destination resolves to an IPv4
    address, it will use an AF_INET socket. Similarly, if it resolves to
    IPv6, an AF_INET6 socket is used. Observe that there is no hardcoded
    reference to a particular address family. The code works even if
    getaddrinfo() returns addresses that are not IPv4/v6.

    struct addrinfo hints, *res, *res0;
    int error;
    int s;
    const char *cause = NULL;

    memset(&hints, 0, sizeof(hints));
    hints.ai_family = PF_UNSPEC;
    hints.ai_socktype = SOCK_STREAM;
    error = getaddrinfo("www.kame.net", "http", &hints, &res0);
    if (error) {
    errx(1, "%s", gai_strerror(error));
    /*NOTREACHED*/
    }
    s = -1;
    for (res = res0; res; res = res->ai_next) {
    s = socket(res->ai_family, res->ai_socktype,
    res->ai_protocol);
    if (s < 0) {
    cause = "socket";
    continue;
    }

    if (connect(s, res->ai_addr, res->ai_addrlen) < 0) {
    cause = "connect";
    close(s);
    s = -1;
    continue;
    }

    break; /* okay we got one */
    }
    if (s < 0) {
    err(1, "%s", cause);
    /*NOTREACHED*/
    }
    freeaddrinfo(res0);

    They are still using the BSD API, but didn't update their version, though.

    ReplyDelete
  10. I can't agree with this - lib C provides you with basic functionaly for simple reason more functionality you will put there more bugs you will create and more testig it's required this will have performance impact as well.Lib C is not right place to put this additional functionality
    have a look to other utility libraries there, this is their desing principal.

    ReplyDelete
  11. SCTP question Guest27 August, 2009 18:07

    Do you believe this offers hope?

    http://tools.ietf.org/html/draft-ietf-tsvwg-sctpsocket-19

    It would appear that the multihoming issue could get better, assuming the one-to-many API gets deployed. I haven't had a chance to play around with it yet, but libsctp appears to be the Linux implementation.

    "If a bind() is not called prior to a sendmsg() call that initiates a
    new association, the system picks an ephemeral port and will choose
    an address set equivalent to binding with a wildcard address. One of
    those addresses will be the primary address for the association.
    This automatically enables the multi-homing capability of SCTP."

    Also, sctp_getpaddrs looks promising, returning all of the addresses of an existing endpoint. Now how that works in practice? Not sure...

    And as usual, thanks for the insights!

    ReplyDelete
  12. Ivan Pepelnjak30 August, 2009 10:10

    The modified Socket API is already implemented in Linux and not used for a simple reason: you have to indicate which transport protocol to use in your application and there's no push to change existing applications.

    ReplyDelete
  13. This assumes the programmer knows what a network is and how it works. Very few among the best of programmers do, and most of those are too lazy to do this much work to create a robust application.

    If we truly want reliable network applications, the API for opening a socket should be something along the lines of:

    Stream myStream;

    try {
    myStream = Network.Connect("somehost.foobar.com", "http");
    } catch (NetworkException) {
    // Oops, it didn't work. Deal with it somehow.
    }

    DoSomething(myStream);

    Anything more complicated will result in a handful of well-behaved applications and a vast multitude of crap.

    ReplyDelete
  14. Ivan Pepelnjak02 April, 2010 11:21

    Thank you, Phil! A fantastic summary of what I’ve been trying to say.

    ReplyDelete
  15. I'd be interested to hear what you think about the recent extension to sockets called ZeroMQ.
    Does it fix "what went wrong" ?

    ReplyDelete
  16. I agree that the sockets API doesn't facilitate building distributed applications at all. A decent API would be oriented towards Inter Process Communication (IPC), and provide primitives to allocate flows between applications by name, allowing a certain QoS specification (for example: allocate me a flow to application B, and data should be delivered in order, reliably, maximum this delay). How to honour this request is the business of the "networking stack", not the application.

    I think that one of the main issues is that the current networking "protocol suite" is oriented towards moving data between interfaces of computers, instead of allowing applications to communicate (the issue is not just the sockets interface...)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.