VMware VSAN Can Stretch – Should It?
Pirmin Sidler read the stretched VSAN blog posts by Duncan Epping (intro, HA/DRS considerations, demo) and asked me what I think about stretched VSAN considering my opinions on long-distance vMotion.
TL&DR answer: it makes way more sense than long-distance vMotion. However…
Does It Need Stretched VLANs?
VSAN is storage replication technology that runs on top of TCP, so it can run over any L3 network. It does need IP multicast between the VSAN nodes, but not between VSAN nodes and the witness node(s).
Stretched vSphere HA cluster running VSAN thus doesn’t require stretched VLANs (unless, of course, you plan to move VMs willy-nilly across the WAN links – most often a bad idea).
Can You Use It for Disaster Recovery?
VSAN data replication is almost identical (at least conceptually) to asynchronous synchronous storage replication – every write is immediately queued for mirroring, and acknowledged to the writer only when the other VSAN nodes acknowledge it.
Do keep in mind that VSAN runs on top of vSphere HA cluster, and if you don’t want to move VMs between data centers, you MUST use affinity rules to keep them contained (for more details, read Duncan’s blog post).
As you’d be relying on vSphere HA mechanisms for disaster recovery, you won’t get any of the goodies SRM brings to the table: each VM will be restarted automatically in whatever order, or not (if the remaining part of the cluster doesn’t have enough resources). This might be good enough for small independent workloads, but maybe not for complex application stacks.
Finally, you’ll have to solve network connectivity challenges, unless you plan to deploy stretched VLANs (and even Gartner agrees they’re not a good idea) and stretched firewall clusters (even worse idea).
All things considered, it might be best to write an orchestration solution (it would be even better if VMware would do that) that would:
- Create lost subnets in the new data center;
- Configure any other network services that may be required (unless you’re using virtual appliances);
- Restart VMs in proper startup order.
They do have dedicated products for asynchronous remote replication, and one can probably combine VSAN with them. But please don't ignore the added latency and physics :-)
As a worst-case example: your HDD has an average rotational latency of around 2-3ms - the time until a sector can be read or written. Assuming the sector is written instantly, it will still take 2ms on an average write operation.
If you're doing replication in the metro area with a millisecond roundtrip of overall network latency, this latency will add up for any write requests: your remote HDD probably won't have its data commited in 2ms, but in 2+1=3ms.
Depending on what your application actually does and how often synchronous data is forced onto disk, sync replication in this setup may be functionally decrease the overall hard disk performance by up to 50%.
Of course, in real life the various writeback-caches in operating systems, hypervisors, RAID controllers and hard disks lie about having something "really" written onto disk, so those 50% are "worst case" for "every single sector/block is forced to disk".Even if the write is not forced to disk, the network latency still adds up before the remote system can promise "having it written". So overall, the network latency adds up to the access time.
However, even in a standard OLTP mix (70% read, 30% write), the impact of high-latency writes is obvious: the read performance doesn't change, the write performance gets noticably worse.
If your application doesn't cope with extra latency on writes and you still do require synchronous writes, you may need to switch from HDD to SSD (reducing the local access time close from 2ms to zero, leaving you with pure network latency).
With more remote locations, the problem becomes worse: 3ms is negligible in world of WAN, but if your 2ms hard disk suddenly takes 5ms before some data can be written, it is a considerable decrease.
And when your top-notch high performance database's average write latency suddenly jumps from 0.1ms (SSD) to 3.1ms (remote SSD), someone will probably notice (+3000%).
Also replied on the blog ivan wrote, want to post it here as well for those who don't see that one.
Just to point out, Virtual SAN always writes to SSD. That is how the architecture has been designed from the start. We take advantage of the SSD, buffer the writes and then destage them when needed. The write is acknowledged to the Guest OS / Application as soon as it hits the SSD buffer. So the latency for a write to a device like this will not be 2ms but much lower than that.
I understand what you are saying, but we are forgetting that we are trying to solve a business problem here and not introduce one. Any stretched storage platform has the same challenge when it comes to latency, yet NetApp Metro, EMC VPLEX, 3Par etc etc are still relatively popular solutions. Why? Well simply because in many cases it is 10x easier to provide this level of resiliency through an infrastructure level solution rather than to rely on 3rd party application providers to change their full architecture to provide you the resiliency you need. As you know getting large vendors to change their application architecture isn't easy, and can take years... if at all.
These types of solutions are developed for relatively short distances, and also relatively low latency. Sure it has been validated to be able to incur a hit of 5ms, that doesn't mean that from a customer point of view this would (or should) be acceptable. That decision is up to the customer. Same applies to bandwidth, what can your afford, what is available in your region / between sites etc.
Stretched infrastructures are not easy to architect, or deploy for that matter, but I truly believe with Virtual SAN we made the storage aspects 10x easier to manage and deploy than they have ever been before.