Decide How Badly You Want to Fail
Every time I’m running a data center-related workshop I inevitably get pulled into stretched VLAN and stretched clusters discussion. While I always tell the attendees what the right way of doing this is, and explain the challenges of stretched VLANs from all perspectives (application, database, storage, routing, and broadcast domains) the sad truth is that sometimes there’s nothing you can do.
In those sad cases, I can give the workshop attendees only one advice: face the reality, and figure out how badly you might fail. It’s useless pretending that you won’t get into a split-brain scenario - redundant equipment just makes it less likely unless you over-complicated it in which case adding redundancy reduces availability. It’s also useless pretending you won’t be facing a forwarding loop.
Be an engineer, do your job, and figure out how things are going to fail and under what conditions:
- Start by identifying all potential failure scenarios;
- Try to figure out how likely they are. A forwarding loop is probably more likely than a major earthquake (unless you’re in California or a few other places) or similar natural disaster, yet most everyone focuses on having geographic redundancy and believes in the magic powers of $vendor unicorn dust;
- Evaluate how the whole application stack will behave under each failure scenario. Figure out whether a particular failure results in a working application stack or not (hint: how will routing toward a split subnet work?);
- Don’t focus just on networking. In some scenarios, the disaster recovery plans that seem great in PowerPoint never work in practice because someone forgot considering a “small” component like storage;
- Once you KNOW (or think you know) what’s about to happen, test it. Usually there’s a gap between theory and reality.
- Don’t forget to test applications that are supposed to be migrated to another data center on-the-fly under general panic/increased latency/reduced bandwidth conditions. It might turn out they would be useless anyway, so it’s safer to shut them down and restart them.
Unfortunately you can’t rely on vendors engineers to get the job done for you, or expect to find vendor blog posts (or white papers) explaining how their products fail. The only material on that topic I found were explanations on how VSS, vPC or HP IRF would fail under split brain scenarios, but nobody ever touched on the whole picture, with the notable exception of Duncan Epping’s How to Test Failure Scenarios, but even there he focused exclusively on non-networking VMware components
Once you have all that data in place, sit down with everyone else who should be involved in the discussion (application, server, virtualization, storage and security teams) and figure out what the best solution would be for your company not for individual teams or beloved $vendors. Sometimes you’ll need to step back and change quite a few things.
2 comments: