Worth Reading: Notes on Distributed Systems

I long while ago I stumbled upon an excellent resource describing why distributed systems are hard (what I happened to be claiming years ago when OpenFlow was at the peak of the hype cycle ;)… lost it and found it again a few weeks ago.

If you want to understand why networking is hard (apart from the obvious MacGyver reasons) read it several times; here are just a few points:

  • Distributed systems are hard because they fail more often;
  • Writing robust distribute systems costs more than writing robust single-machine system;
  • Coordination is hard;
  • Find ways to be partially available;

The one thing I’d add to the list is “you have to deal with byzantine failures”.

Next time someone tells you “networking engineers are so obtuse, we solved $whatever in some other domain in no time” point him to this document… not that it would help, RFC 1925 rule 4 cannot be beaten.

7 comments:

  1. Obligatory mention: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  2. So does that mean that OpenFlow is the solution to the problem?
    Replies
    1. OpenFlow might be a solution to a problem... now we just have to find the suitable problem ;))
    2. You can't be wrong with OpenFlow because Google runs it in production. So that proves it works.
    3. If you think Google is solving the problems you have, you'll get the results you deserve ;)
  3. The real issue is timing.

    When we learn to program, we usually learn to program a single-threaded solution, running on a single machine, with a single core. Keep it simple. We know exactly in which order stuff happens. The way most people think is very focused on this single-thread, single-machine model.

    When we go distributed (multi-thread, multi-cpu or multi-machine), we need to keep timing in mind. In what order does stuff happen ? In which order can it happen ? What's the worst that can happen ? When I send a message, does it arrive ? What to do if it doesn't arrive ? How can I know it has arrived ? What do I do if I have to wait ? How long should I wait ? When I send a message, does the receive (already) know what I expect him to know ? Etc, etc. There are zillions of new problems, new scenarios I need to keep in the back of my head. Just using semaphores in threads or using TCP in a network is not the solution.

    This timing stuff is hard. Imho it's the core of the problem. Most people can't program, even if they'd spend 10 years trying to learn. And I think the majority of decent programmers will not be able to write good distributed software, no matter what. On top of that, it seems there is still very little educational material to teach people to write distributed programs (beyond the basic stuff that's in every book).

    And that's a story about the people who build networking equipment. When you talk about the people deploying that equipment, life probably won't get better.
  4. Something black/white proponents of SDx and micro services architectures will learn the hard way. Oh wait, by that time they'll have moved on and some other poor sod has to cope with it.
    Repeatedly asking "why and for whom ?" and realizing "a tool for a purpose" and understanding rfc1925 continues proving to be difficult.

Add comment
Sidebar