ATAoE: response from Coraid
A few days after writing my ATAoE post I got a very nice e-mail from Sam Hopkins from Coraid responding to every single point I’ve raised in my post. I have to admit I’ve missed the tag field in the ATAoE packets which does allow parallel requests between a server and a storage array, solving some of the sequencing/fragmentation issues. I’m still not convinced, but here is the whole e-mail (I did just some slight formatting) with no further comments from my side.
The protocol does not contain a single sequence number that would allow servers and storage arrays to differentiate between requests or split a single request into multiple Ethernet frames. A server can thus have only a single outstanding request with any particular storage array. (Or maybe LUN -- who knows? The protocol specifications are silent.)
As packets between initiator and target are not connection based, sequence numbers are irrelevant. A client can, however, have multiple requests outstanding with different tag values which is how a target differentiates between requests. Spreading between Ethernet frames is performed by the client who is responsible for turning a large request into a series of MTU sized requests (a 64KB request becomes 8 - 8KB requests, eg). All storage protocols do this somewhere - right down to the SATA/SAS wire which chunks blocks into 4KB data FIS.
An AoE Major.Minor is used in practice to reflect a Shelf#.Lun. Shelf# is your IP equivalent and is assigned per storage server.
The protocol does not specify any packet loss detection or recovery mechanism. I can only assume that the only recovery the creators envision is request retransmission after a timeout, hoping that all requests can be repeated multiple times without side effects.
Yes, because retransmitted requests do not change in lba/data they are idempotent. Initiator drivers possess a TCP-like congestion avoidance algorithm to control the retransmit time as well as the outstanding window of requests. Initial negotiation of the window is performed via the query-config message type.
ATAoE requests fit directly into Ethernet frames, and there's no way to fragment a single request into multiple frames and achieve streamlined data flow. Unless you use jumbo frames, you'll be able to transfer at most two sectors (at 512 bytes) in each request. (iSCSI can transfer megabytes in a single transaction.) Add a few switches to the mix, and watch performance plummet.
As I mentioned this above, the initiator chops up large requests into multiple MTU sized requests. Adding switches to the mix does not cause performance to plummet. Conversely, because AoE is not connection based you can achieve performance you simply can't with iSCSI. It is a simple matter to connect multiple network ports on initiator and target and have all AoE requests flow across all possible network paths without any higher level (ie: bonding) configuration necessary. This round-robin mechanism is also provided by the initiator driver. Coraid's performance numbers are achievable because it's just this simple to attain performance.
Can you imagine that? A protocol proposed for use in the data center that has no authentication whatsoever! The Wikipedia article proves that whoever designed (or described) this protocol was a total stranger to the finer details of network security when they wrote: "The non-routability of AoE is a source of inherent security."
Security is intentionally disregarded in the protocol. What users need is a way to have shared block storage on a SAN without the fuss. iSCSI was designed originally to provide shared block storage over the internet. It has many bells and knobs that you just don't need in a closed environment with, eg, tens of servers running your VM architecture of choice connected to a single SAN storage pool.
The "security" comes from not being able to route AoE packets, so people outside your SAN broadcast d omain don't see your data "accidentally". The designers didn't trumpet that, but some people think it's a benefit.
Another amazing omission. A server does not have to establish a session with the storage. As soon as you guess a LUN number, you can start reading and writing its data.
If your SAN isn't secure, neither is your data. If you plug in a host to your SAN that can do all sorts of root-like things and explore your SAN maliciously, yes, you have a problem. Root users can also scramble your local disks.
Weak support of asynchronous writes
Due to lack of sequencing and retransmissions, asynchronous writes are handled in a truly cavalier fashion: You can use them and the storage array always returns a success status, but the actual operation might fail -- and the server will never be informed. I thought it would be nice to know if your write operation fails (after all, you might need that data in the future), but apparently that's not the case.
That's a wart in the protocol that will be removed in the next revision. No one uses asynchronous writes. We have had custom configurations where responses were not needed, but we've worked around this another way.
A protocol like this could have been good enough 30 years ago when TFTP was designed, but today it would make a great example of a totally broken protocol in any protocol design class. As for its usability: Go ahead and use it when building your home network. I would definitely not consider it for mission-critical data center applications.
I challenge you to dig a little deeper and correct your article. Many, many people use AoE based products for mission-critical data in situations where fast, affordable, scalable, easy to configure storage is needed. I think your primary complaint is that the protocol definition itself does not clarify all aspects of how AoE functions in practice. That's a fair argument. Our history is one of Bell Labs culture and in writing this we documented the essential core leaving many details up to discretion.
This forces a separate physical ethernet network for storage to achieve some security. Not what we want at all!!
AoE achieves it's features by stupid simplicity (UDP- connectionless, no security, dumb MTU handling) not because of it's inherent greatness.
It has no security. Also, it is not a pony, and will not make you taller or get the crabgrass out of your lawn.
If you need a storage protocol that you can expose to malicious unauthenticated users on the Internet, you're looking in the wrong category.
"I have to admit I’ve missed the tag field in the ATAoE packets which does allow parallel requests between a server and a storage array, solving some of the sequencing/fragmentation issues. I’m still not convinced,....."
Why are you still not convinced? Not clear what the reasons are.
Their customer list just keeps growing. Maybe it does work and work well.
We've adopted it over iSCSI (initially running in tandem with iSCSI, but after a few months the benefits were very clearly in favour of ATAoE) for all of our (oil/gas geological) bulk storage uses for 2 years now. You could probably call it Big Data, at over 3000 disks total and 5 Petabytes of data per site.
An unplanned side effect of this is that we have now abandoned all the Microsoft storage and use Debian and FreeBSD, giving us a lot less trouble overall, which was quite a surprise here.
At the time you wrote your blog post, we were just commissioning the systems, so I read it with great interest and I worried. I have to say that in practice, we haven't hit any of the worries you raised. The speeds over iSCSI on the same architecture were a big plus and it was far simpler to set up and in use is totally transparent and hasn't given a single problem down to the protocol choice, which is more than we can say for the iSCSI half of the project, which was disappointing in comparison.
Note that we don't use Coraid, so this is purely a comment on Debian-based ATAoE itself.
An alternative approach using the hardware RAID in the server has even shown slightly higher levels of failed hashes over time (every file is hashed and re-checked on a rolling basis when the load is below a certain threshold using idle time). This setup gave identical performance, so it was decided to move to individual drives per LUN for all of the servers while we investigated. The failures stopped, so we retested with hardware RAID. Failures came back, small numbers of random fails, no common cause. We then trialed the JBOD LUN method for 6 months on the same suspect hardware and since the hashing anomalies stopped immediately and never came back. All sites were affected to much the same extent, very low levels of unseen corruption of individual files over time. The sites were commissioned independently using no common staff, desings or vendors/hardware.
Very curious indeed. It's now all 1 drive per LUN permanently at all sites and will stay that way.
Since the same hardware was in use all the time this tends to suggest that the hardware raid is actually less reliable than mdadm, while performing identically for ATAoE, probably due to the transfer limits being dictated by the network fabric rather than the controllers themselves. A surprising result indeed.
The hardware is a homogeneous mix at all sites as initially we thought we'd need a much more complex layout to get the desired performance on some racks for certain applications. In the end it didn't matter though - it all was fine though some was strained by iSCSI before we abandoned it for ATAoE.
Ease of admin has been excellent throughout and we have learned a lot about convenience along the way. We are now looking into ZFS which we earlier thought was far too complex and slow. It turns out that one of our competitors has been using both ZFS and ATAoE for a similar purpose for 3+ years so we have some catching up to do !
The end-cycle audits have shown TRC/TCO of just under half what was allocated and expected so we will never go back to the bad old ways again, though the primary drive was for reliability alone. The simplicity of ATAoE appears to be the key to the success.
We are very pleased we were ordered to try it in tandem for a comparison trial. The ATAoE model has just been adopted globally at the end of the 2-year trial with Asia running smoothly for almost 4 months now using an identical approach but just over 19PB so far.
The vendor-free approach has been extremely flexible though I can see some major problems for the salesmen due to this - they were very hostile and the scare stories were getting tedious. They have gone strangely quiet now :-)
In practice, most SAN deployments rely fairly heavily on security of the network segments the protocols run over.
NFS is usually run without very much in the way of Auth beyond IP of the requester, All the deployments of iSCSI I've seen in practice don't use CHAP etc.
The Coraid implementation of AoE allows for 'masking' of resources to a specific initiator. IMO, if the storage fabric is effectively a trusted/protected zone, masking may be 'good enough' for a lot of use cases.
From what I'm told, it should also be possible to lock what MAC's are allowed on a given vlan @ the switching layer too, or atleast alert if previously unseen MACs are detected. If that's achievable, then accidental misconfiguration allowing untrusted initiators on the fabric should be able to be effectively mitigated.
So given that, maybe I'm missing something.... what's the problem? (that isn't realistically a problem with other converged fabrics in use currently).
For example, in an ESXi cluster (based on NFS or iSCSI storage), ANY of the hosts can delete pretty much all data that's presented to them.... as they are the primary custodians of that data anyway. Is that a risk, certainly.... but its not increased by using AoE LUNs, as far as I can tell.
All of that is assuming a fairly typical use case: Virtualization providing the abstraction / segregation between untrusted and trusted zones. Now if we're talking about presenting the converged fabric out to potentially untrusted initiators.... then yes, more security (and a fatter protocol) is probably required.
In this particular design, AoE is not supported by the ESXi cluster natively... we could add in an AoE HBA's, but in a blade environment, that means retrofitting.. which we wanted to avoid.
So we use AoE to present up to ZX headers. These use ZFS to aggregate the storage and it is presented to the ESXi cluster using NFS over a converged 10g fabric.
This effectively means the attack surface for AoE is just the ZX headers and the SRX shelves themselves.
The big advantage of this architecture is that we can chose to add other presentation headers/gateways in later, utilizing the same AoE back-end disk trays... and not necessarily a Coraid header solution either.
Headers can then add on any of the features we want to implement without any rip-replace of the backend disk trays. A simple, lightweight protocol is a big advantage in that model.
Proof of the pudding will be in the eating though, granted :)
For those which use it, still you happy with this ? is it reliable ?
I'm thinking for OpenSource based AOE. Which software are you using ? which OS ?
Do you have some advice to build a stable OpenSource AOE infrastructure ?