An interesting solution brief of Hyper-Converged Infrastructure from DataOn got my cogs ticking on a topic that has bugged me for a while… All-Flash storage.
Firstly, this DataON solution is a seriously slick contender… a 3M IOPS 4-node HCI in 8U. The tech geek in me would be quite pleased to have that bad boy driving services on my platform.
But is it necessary?
Ultimately, that is a question that only you can answer based on your requirements, but after deploying several Hyper-Converged Infrastructure (HCI) and Storage Spaces Direct (S2D) solutions from various vendors over the last 12 months, I wanted to share my thoughts on this All-Flash phenomenon.
Let’s start with taking a quick look at Storage Spaces Direct (S2D).
Unlike the legacy world of iSCSI & FC SANs, S2D is a distributed architecture that scales linearly as you add more nodes. A recent 4-node low end (SDD & HDD) solution easily delivers 680K IOPS simulating everyday workloads (70% read/30% write) with each node contributing approx. 165K IOPS to the aggregated total. Adjusting our benchmark to 100% random read results in the platform hitting almost 1.1M IOPS. Compare that to a 3Par 8450 4N (4 node All-Flash array), 3Par boasts around the same 1M IOPS at 100% random read…
Example from TechNet demonstrating the linear scale of S2D:
In a crude but direct comparison, both platforms mentioned above achieve (what I call ‘cheeky’) read IOPS over the 1 million mark. Which is great. But the S2D solution is about a 3rd of the cost plus it is Hyper-Converged. This means it also hosts the compute workload, not just the storage. A quick confession, I’m a fan of disaggregated architecture, but this is primarily based on experience over the years when a HCI architecture was either not possible or in its early infancy. I do have to admit though, HCI architecture in a managed platform build makes a lot of sense these days and allows for shifting or removal of bottlenecks and/or the all-important single points of failure. But that’s a whole other discussion…
All about the cache
All-Flash sounds sexy. It is sexy. No doubt! But Software Defined Storage (SDS) with S2D is all about the cache. The additional cost of ‘All-Flash’ makes sense with legacy SANs in their monolithic architecture, but is it worth it in the modern world of SDS and distributed architecture?
I had a chat with one of Dell’s Storage Architects at the Azure Stack Airlift in Redmond earlier this year (Azure Stack uses S2D under the hood) and we touched on this very topic. I challenged the ‘All-Flash’ concept debating that with S2D it’s all about the cache so for the average deployment, All-Flash is potentially a waste of money….
As you can imagine, that didn’t go down to well. But once I explained how the cache works in S2D as well as the underlying architecture (Yes, a hack like me had to explain how storage works to the Dell Storage Architect, but I digress…) he was struggling to find genuine reasons for All-Flash over a SSD/HDD platform. One point he did make is that the density of SSD has surpassed the available density of the spindle… for example, Seagate’s 60TB 3.5in SSD.
Ok, I agree that density is a potential reason for SSD over HDD, but at what cost?
A more realistic large capacity SSD would be a 15.4TB which is the biggest supported by HPE and Dell in their certified S2D offerings. Quick point to note is that those drives are Read Intensive (RI) and have a Daily Writes Per Day (DWPD) of less than 1 which is not ideal in this kind of implementation, but in the interest of science let’s continue anyway. We can get up to 28 and 30 of these in a 2U server from HPE or Dell respectively. I’m sure there are other vendors with more slots available but these are the two vendors I’ve worked with on these solutions of late. Assuming the vendors can get the config working (controllers etc) then we could have potentially 26 or 28 SSDs in a 2U server. Now this makes for an interesting discussion, as with S2D there is no read caching if the capacity tier is SSD. More detail on that from Stephane Budo’s S2D deep dive into S2D from MS Ignite AU. Caching behaviour is discussed at about 21:00…
$$ per TB
Now a solution with 20x SSDs and 4x NVMe per node (the current config option) would give us some serious IOPS, but I’m slightly uncomfortable with a few things here…
Firstly, the 15.4TB SSDs go for about $20K (AUD) each. Ok sure, if we’re looking at buying 80 (20 per node in a 4-node cluster) then I reckon we’d get a bit of a bulk discount, but they’re still not going to be cheap. Add in some NVMe for cache and we’d be looking at approx. $1.4M for the solution. In a configuration like this, thanks to this calculator, we’d have a usable 390TB. That’s a decent amount of storage squeezed in 8U, and an interesting thought none the less… That’s about $3.5K per usable TB…
Secondly, for most storage deployments, you’d want to think seriously about the bandwidth per usable TB. Assuming our workload demanded 390TB of All-Flash, we’d probably want more than 2x 10/25GbE bandwidth to each node meaning we’d likely need to consider a 100GbE network fabric… Not going to be cheap.
Thirdly, scale. If we need to scale the solution, then we’d have to add another node of the same sizing and config. To increase in size, we’re looking at about $350K just to entertain the thought of expansion. And that’s assuming we get the same discounted rate as we did in the original purchase…
Get to the point! Ok, sorry. This eventually leads me to reinvestigate one of the main benefits of S2D, the distributed architecture. If we have the need for 390TB, trying to cram that into 4 nodes just doesn’t make any sense…
Image: courtesy of Guinness world records
Scale out, not up…
For most storage requirements, a much more sensible approach to this would be to scale up a few more nodes and convert most of the SSDs to spindles. In this configuration S2D will use the cache for reads as well, keeping performance high. For your typical storage build, squeezing 15.4TB SSDs in for the cache are probably not within the current budget. So, we’ll look at the more mainstream 1.6TB NVMe (Write Intensive with over 10 DWPD) for about $8K in a solution like this.
Using the calculator again, with 3x 1.6TB SSD and 9x 10TB HDD (we’re switching to 3.5in to accommodate the fatter and slower spindles) we get 406TB usable with 4.8TB of cache per node (67.2TB aggregated) in a 14-node configuration. The 10TB spindles are much cheaper so this solution would be in the vicinity of $490K for 406TB with a smidge over 67TB of cache. That’s about $1.2K per TB. This solution would easily deliver 3M IOPS in a 70% read / 30% write benchmark.
$1.4M vs $490K for approx. 400TB of super slick storage. Or $3.5K vs $1.2K per TB…
In the above price-optimized solution, we’ve used 32U in a rack as opposed to the 8U requirement in the All-Flash. Sure, some added management overhead for 14 nodes vs 4, but our resiliency is also much higher. Part of the capital saving could go towards purchasing/leasing another rack to offset the size of 14-nodes. We could also go off-brief with our chosen vendors and leverage some Intel Optane or Samsung Z-SSD for our caching (always check your using certified hardware on the WSC). These are typically slightly more cost effective than the choices provided by Dell or HPE. As usual though, many things to consider.
Image: Samsung Z-SSD and Intel Optane
If you’re a service provider or enterprise with your everyday workloads, then this would likely be more than adequate. Plus, if you’re upgrading from a 3Par/EMC/NetAPP (insert legacy SAN vendor here) then you’re going get a significant performance boost and have a pocket full of spare cash to invest elsewhere. To be fair to the SAN market, the costs of legacy type SANs are tanking… I do wonder if this has anything to do with SDS becoming a more viable solution.
Now as with everything, this is always dependent on your workload. If you have a very specific workload such as 350TB of extremely hot random access databases, then an All-Flash option likely makes perfect sense to you. But most of the storage requirements I see out there don’t have that type of workload and our budgets probably wouldn’t allow for such considerations.
What are we doing in our environment?
Over the past 4 years, our Storage Spaces (2012 R2) platform on average serves 96% of our IOs from the cache. Additionally, on an average day we need only 20% of our storage to be cache to deliver 100% or our IO. Our cache tier is slightly over 10% of our total storage capacity. This begs the question, why spend big $$ on All-Flash that will largely serve as a capacity tier when you could just increase your cache? It makes sense to have at least 10% of our storage capacity as high performance cache. But more than 20% cache and we’re at the risk of wasting our money.
Right now we’re in the process of building our next platform, among many choices to make we’re aiming to have at least 15% of our storage to be cache. Any more than that and we’d struggle to have any real return on the investment.
Storage Spaces Direct is super flexible
The above highlights another reason why S2D is awesome and having a huge impact on the market, it’s flexibility. The configuration choices are endless and the S2D team are pushing out regular updates and blogs (like this one) demonstrating some of the features, flexibility and potential performance of the technology.
All that said, what is the point I’m trying to make? All-Flash is sexy. Makes good sense in legacy monolithic based storage architectures. But in a modern world of SDS with a game changer like S2D, you should think long and hard about whether it is the right choice for your environment.
As usual this all depends on your appetite, but never the less, this is still some good food for thought…
Happy software defined storage-ing!