The question: To enable or disable CSV balancer in S2D?
Before we continue, I want to clarify the Cluster AutoBalancer and CSV Balancer are separate things.
Cluster Autobalancer is relating to cluster roles, or in this case the VM’s. CSV balancer is about the CSV storage.
Info on each can be found here:
Cluster Autobalancer aka VM Load balancing: Clustering and High-Availability
CSV Balancer, i.e. what we are talking about here: Automatic SMB Scale-Out Rebalancing (this is related to SOFS and 2012R2 but gets the info across)
All up to date? Great 😀
I’ve had this discussion a couple of times now so thought it worth while to note down here, but first some context.
When deploying a S2D cluster, typically what most engineers do for performance benchmarking is to use a tool called VMFleet. I won’t go into the detail on how VMFleet works, but the part of its process is to disable CSV balaning to get the best performance statistics it can.
Now, the discussion is usually around whether or not to enable CSV balancer again.
As with all technical decisions, a lot of what determines the most appropriate answer for you is my favourite response of ‘it depends’.
In my case, I like to simplify my management life as much as possible. So for me, the preference is to enable CSV balancer so each node is managing it’s own CSV. Ok, more context; I have 1 CSV per node in each production instance. The S2D recommendation is to have an equal amount of CSV’s per node, so if you have more, then the same logic applies. Also most of my fabric is not HCI, it’s dis-aggregated. So my S2D clusters are behind Scale-Out File Servers. So for me, CSV balancing means all CSVs will be distributed evenly across the nodes giving the best SMB throughput performance available.
One of the main debates for not enabling CSV balancer is that in a world of HCI where a node owns the CSV that is home to a VM running that hypervisor, the best possible performance is achieved. This is true, no debate here.
My thinking around this though is that once we hand over VM creation to the masses via WAP or SCVMM, the newly deployed VMs could land anywhere… We could get creative with VM templates and storage profiles etc, but in my honest opinion we’re only adding unnecessary admin overhead to our platforms.
Another discussion I seem to have occasionally is about the effort required in wanting to achieve the best possible performance or squeezing the most IOPS out into a VM as we can. I often ask, do you really need it?
Whilst it is nice to know a single VM could push IOPS up into the several 100’s of 1000’s, is it really required for the majority of workloads were managing? As an example, on an average day one of my production clusters with over 500 VMs hums along at about 2-3K IOPS with a snappy response time of approx 300μs. That system is capable of over 1.4M IOPS so trying to optimize to get the last n’th of performance doesn’t seem like a worthwhile exercise. Plus the characteristics of the workloads can evolve over time…
But it’s not all about IOPS, what about latency? Good question! And you’re absolutely correct. But again, to what benefit. We’re squabbling over a few microseconds here.
Granted this is not the correct approach for everyone, but if you’re workloads require the kind of IOPS and low latency that are critical when 100μs have a serious impact, then I doubt you’ll be here asking this question anyway.
So, my answer: At the end of the day, I suggest enabling CSV balancer and just live with it. If at some point you have a performance bottleneck, I highly doubt that CSV balancer will be your answer.
To enable CSV balancer:
(get-cluster).CsvBalancer = 1
Disable CSV balancer:
(get-cluster).CsvBalancer = 0
This al lsaid and done, if you want to dabble in some scripted/automated balancing then here is a script to do this for you by Aidan Finn. I personally have not run this but have pointed clients to this in the past and have not heard any negative feedback.
Happy to discuss further if you feel like dropping me a line
Anyway, this is just my humble opinion…