Does dedicated iSCSI infrastructure make sense?

Chris Marget recently asked a really interesting question:

I’ve encountered an environment where the iSCSI networks are built just like FC networks: Multipathing software in use on servers and storage, switches dedicated to “SAN A” and “SAN B” VLANs, and full isolation of paths (redundant paths) between server and storage. I understand creating a dedicated iSCSI VLAN, but why would you need two? Isn’t the whole thing running on top of TCP? Am I missing something?

Well, it actually makes sense in some mission-critical environments.

First, there’s the layer 8-10 part: Things you do will never fail (after all, you’re doing them). Things that are far enough from you (and thus you’re totally clueless about them) will never fail (after all, how could a unicorn-powered reality-distortion magic black box ever fail) – see also Fallacies of Distributed Computing. Things that are adjacent to what you do are the scary part - after all, you know enough about them to know they could fail, and you are absolutely sure people running them can never be as good as you are ;) That’s why programmers worry about server failures while ignoring the quality of their code, and remain completely oblivious to network failures.

Now for more realistic arguments.

Compared to networking, storage is serious business. If you drop a user session (or phone call), nobody but the affected user cares. If the user happens to be your boss or CEO (or you drop thousands of sessions for minutes or hours), you might feel the heat, otherwise everyone accepts that just-good-enough contains some elements of shit-happens.

Storage is a totally different beast. A SCSI failure arriving at just the right time could easily result in the famous colored screen (or some other panic), bringing down an application server or an operating system, not a session. If that server happens to be your database or SAP server, someone has to start polishing the resume.

Furthermore, we (= networking engineers) don’t really care if user sessions experience data corruption. After all, if we would, we wouldn’t use 8-bit checksum in IP and TCP; these checksums can fail more often than one would think.

Storage experts developing iSCSI realized that’s not good enough and added application-level checksums to iSCSI to prevent data corruption on inter-subnet iSCSI sessions - you wouldn’t want to store corrupted data into a database (or anywhere else) where it can stay for years, would you?

Someone from IBM published a great article explaining the need for iSCSI checksums a while ago (and of course I can’t find it; RFC 3385 does contain some hints), and in case you wonder whether routers can actually corrupt packets, the answer is YES (and someone managed to isolate a faulty router three AS-es down the road). I have also seen cut-through switches helping broken NICs (that didn’t check the checksum) corrupt the data (admittedly that was 20 years ago, but the history has a nasty circular habit).

Of course some vendors cutting corners launched crippled boxes that require layer-2 connectivity for iSCSI replication. The only explanation I can find for that abominable restriction is lack of iSCSI-level checksums1. Ethernet checksums are orders of magnitude better than IP/TCP ones, and actually comparable to iSCSI checksums, so if you’re too lazy (or your hardware is too crappy) to do the proper thing, you stomp on the complexity sausage and push the hard work the other way. After all, what could possible go wrong with long-distance STP-assisted layer-2 connectivity (aka “the thing we don’t understand at all, but it seems to work”).

It turns out those vendors believe in CRC fairy. Stretched VLAN is not a data protection measure, and once you start transporting Ethernet frames over VXLAN all bets are off anyway. If your iSCSI vendor doesn’t support application-level checksum, your data is at risk no matter what.

Last but definitely not least, when the proverbial substance hits the rotating blades, it’s usually not limited to a single box, particularly in layer-2 environments favored by the just-mentioned storage vendors. A single STP loop (or any other loop-generating bug, including some past vPC bugs) can bring down the whole layer-2 domain, including both server-to-storage paths.

With all this in mind, it makes perfect sense to have two airgapped iSCSI networks in some environments. Obviously some people (or their CFOs) don’t care enough about data availability to invest in them, and prefer the cargo cult approach of using two iSCSI VLANs to pretend they have FC-like connectivity, reminding me of some other people who don’t want to invest in proper application development (and use of DNS for IP address resolution), load balancers etc. Believing in fairies and unicorns is so much easier.

Finally, a storage protocol rant wouldn’t be complete without a passing mention of FCoE. Every single vendor (ahem, both of them) is so proud of the A/B separation at server-to-ToR boundary (which, BTW, violates IEEE 802.1ax and BB-5) that they forget to mention the A/B separation at the edge of a single network doesn’t help once you have a network-wide meltdown.

This is what happens when I start discussing FCoE and LAG (author: Jon Hudson)

This is what happens when I start discussing FCoE and LAG (author: Jon Hudson)

More Information

Revision History

2023-04-14
Cleanup, added VXLAN transport considerations

  1. There’s also the tiny problem of not having a decent routing stack. VMware finally fixed that around vSphere 6. ↩︎

7 comments:

  1. Ivan,

    I think you're looking for the following:

    http://www.research.ibm.com/haifa/satran/ips/PaloAlto-MarkBakke-crc-recovery.pdf

    Cheers :)
  2. We run iSCSI this way simply because it's what the SAN vendor design guide specifies. Storage vendors love to blame the network; it's one less thing they can complain about.
  3. I have never seen the benefit of running iSCSI this way. It only made sense to me if you were running lower speed switches that were available when iSCSI was first popularized (i.e.3750). When you have fast switches (Nexus, Brocade, etc.) why not collapse the storage distribution A&B sides and mix it in with the front-side transport? 802.1Qbb is supposed to allow you to mix lossy and lossless transport together, the MTBF on network gear is just has high as storage in what I have seen (1 each in 5 years) and both events ended up as nothing-burgers due to redundancy. I still don't see the need for ethernet AB separation from data let alone separation from each side. I think the temptation to put in lesser gear when you have built that much "redundancy" is just too tempting for management. I think that will result in more failures than just putting in big, bad monster switches and calling it a day.

    Now to be fair, you work with much larger DCs than I do so this might just be a scale thing.

  4. I interviewed a few people with this set up a while back (4yrs?) and the answers were:
    - We know Ethernet better than we know FC, so if we build an iSCSI SAN we don't have to hire FC people.
    - We know Ethernet can handle iSCSI and we know it can handle our front end traffic, but we're not sure it can do both at the same time.
    - If we keep it separate, the storage vendor can't blame the network (as jswan mentioned)
    - I don't recall many specific technical reasons for the decision, most were political or being conservative.
  5. While the alleged "FC-BB-5 violation" does indeed give you two independent paths from the server into the ToR FCoE switch, it hardly provide "A/B separation". For that, you need an actual air-gap between two completely independent fabric.

    In your own words: "A single STP loop (or any other loop-generating bug, including some past vPC bugs) can bring down the whole layer-2 domain ... including both server-to-storage paths.", and I couldn't agree more. That's why my stance on FCoE is that it's great but if you really want (more like need) the HA required for mission critical applications you better break it out into native FC to go to separate FC A/B fabrics as soon as possible (first hop). Remember, logical A/B separation has been possible in native FC environments for many years (with VSANs and Virtual Fabrics), yet no mission critical FC storage environment is built on this logical isolation, and there's a good reason for that.

    Same thing goes for iSCSI. If you really, *really* want to build a truly HA, mission-critical iSCSI SAN then you need two dedicated iSCSI networks (and not VLANs on the same network). Remember, the proverbial substance *will* hit the rotating blades one day...
  6. "A SCSI failure arriving at just the right time could easily result in the famous colored screen".
    Kernel panics only happen if the root filesystem is on the SAN. Meaning that the corruption has to be to the system files that are running the OS. And even then only the components running in kernel space and some important user space components like the service manager can crash the OS. Other system files are just user processes.
    The OS in most deployments is connected through a DAS. Diskless thinclients connected to SANs are not really popular because HDDs are dirt cheap and engineering a SAN where hundreds of PCs pull their OS data through a fabric would cost sick amounts in dollars.
  7. "Remember, the proverbial substance *will* hit the rotating blades one day..."
    I mean, who cares. If you are running Linux and to a lesser degree, Windows, you probably are going to get burned by cascading bad updates much sooner than a L2 meltdown.
    Don't people remember when Dreamhost went down for days because of a bad Debian update.
    Unless you are running some hardened OS like CapROS or VOS you are looking at a software caused meltdown to be as realistic as catching cold in the flu season.
Add comment
Sidebar