A bit of Sanity for VSAN

Head Data

I had a meeting a month or so back with an IT Director for one of the large banks in London, he’s always been a big fan of NetApp technology but is also very pragmatic about looking at all options when deciding on an Infrastructure. As with most companies I meet they are heavily invested into virtualisation, in this case with VMware, toward the end of our meeting he stated…

“as soon as VSAN is available then we’ll use it for our virtual environments along with commodity storage”

It really was right at the end of our meeting so I had no time to get into a discussion with him about it, but after the meeting I really started thinking about this and how I could or would respond and I realised very quickly that I really didn’t have a good and relatively brief response for someone at this level, every time I thought about it I ended up digging down into the details. Now that I have thought it through some more, here’s what I’m thinking

I think there are some environments where VSAN might be an interesting fit, although I think it needs significantly more capabilities than it has today, and be more than a version 1 product. However, right now, for most environments you are likely to increase the cost of your storage, not reduce it, you’ll introduce significantly more complexity, not simplify things and above all you will bring additional risk to your data and applications

Cost – If you think that using VSAN will save you money then that simply may not be the case and here’s why,

You’ll obviously need to buy more servers to hold all the data, that’s the OS’s, Applications and all the data that they create, and what you may not initially realise is that you have to create 3 copies of all of the data, maybe even 4 if you actually want to achieve 5 x 9’s of availability. In the first version of VSAN there’s also no deduplication, no compression, in fact basically no data reduction capabilities at all, and don’t forget you’ve got to pay license costs for VSAN.

Yes, you removed that pesky efficient external storage array and in doing so you just exploded your storage requirements, all of the efficiencies gone! and on top of that you also have 3 or 4 copies of this now inefficient data. I’m not convinced that this will make your storage costs go down, especially as you have to find room and power for all these additional disks you’ve just bought.

Complexity – This assumes that you currently think that an external storage array is complex, compared to what? to VSAN? I disagree

In addition to now having a lot more data and a lot more copies of it, therefore significantly more physical disks and servers to manage, you’ve now also got to work out how you’re going to protect the applications and the data they create. The value of our Storage arrays is the way that they integrate Snapshots, SnapRestore and SnapMirror capabilities into the applications in one consistent method, the challenge you now have with VSAN is none of this type of capability exists. You can indeed take a Snapshot of a VM, which consumes huge resources when you do, but it is no more than crash consistent for the application inside, so you will now need to consider how you’re going to protect these applications and not suffer from unacceptable recovery time and point objectives.

Let this sink in for a second…you now have to find a completely new way to protect the Applications and Data that was all previously taken care of by the storage array, and it has to be able to deliver Recovery Time and Point (RTO and RPO) of seconds or minutes. Do you think you can find one tool that can do this for all of your virtualised apps? if not then you’ve just massively increased your complexity and again you have to add this to the cost of the VSAN solution.

Risk – Storage is different than the other layers in the Infrastructure, the servers run the applications that create the data, the networks move it, but it lives on the storage, everywhere else its just in transit but at the storage layer we expect it to be safe for years and that brings many many different considerations.

In the latest IDC storage tracker it showed that we shipped 1,168,018 Terabytes of storage in Q3, I’m not going to do a detailed breakdown here but this obviously means we’ve shipped many many millions of disks over the 20+ years we’ve been in business. Amongst the things we’ve learnt is that drives fail in the most unusual ways, from obvious physical failures to lost writes and torn pages, ensuring that the data is protected as these events occur is a major part of what we do and these lessons learned and technologies developed simply cannot be overlooked or understated.

If a server fails it’s inconvenient, but with VMware you can just bring the virtual servers back on line on a new server, if storage fails or worse still data becomes corrupted then it’s going to have consequences, you must expect your storage environment to reduce all of the possibilities for silent corruption and offer you tools to recover any data, for any application, in a consistent state in seconds or minutes. You simply don’t get this with VSAN today and you have to decide if it’s a risk you’re prepared to take.

You are going to hear the most incredible sale pitches for VSAN, you think the industry spoke a lot about Cloud, it’s just getting warmed up with Software Defined Storage. Don’t get me wrong here, I think that VSAN has the potential to be an interesting technology in the future, but building storage arrays with software and hardware is not a trivial task, so I just wanted to provide my perspective on what it is and what it offers right now.

The next time someone tells me they are going to move to VSAN for their virtual environments, I’ll ask them why? I really want to know this because from my perspective as it stands today

You will not necessarily reduce your costs, you will significantly increase complexity and you will introduce risk, and what do you get? what benefits are you getting that would make you accept this? with the capabilities in VSAN today I would say very very little.

6 thoughts on “A bit of Sanity for VSAN

  1. Nice article… It’s easy for customers to forget what the storage array currently does quietly in the Data Centre. Protecting companies most valuable asset…. It’s a little scary to think that large organisations will be swayed by a sales pitch for VSan as a tool to remove the wicked greedy storage vendors. I suspect customers like the one you have mentioned will not deploy vsan in anger and I suspect at the POC stage the realisation about data protection and efficiency a will become clear. What do the say about things that seem to good to be true???

  2. You certainly do have some valid points around storage in general. And yes, even VMware are aware of that. VMware have specifically targeted VSAN for very specific use cases. IT is not currently designed for Tier 1 storage. We have 4 key use cases:
    1. VDI Workloads
    2. Tier 2-3 Workloads
    3. Test/Dev
    4. Remote Office

    It is important that the customer is made aware of these specific use cases.

    In all these environments, 3 copies of the data is very unlikely, thus the management of data and disks are certainly reduced.

    VSAN is not just about a distributed storage system, its all about how we consume data. Its about setting policies around VM’s, and making sure that the VM always has the associated policy/performance/SLA regardless of where it resides on the storage. There are also no Raid Groups/Volumes/LUNS, etc. This reduces the complexity of setting up storage requirements for an IT department. Managing disks is not the difficult part, its the black magic that is often used to make sure that the right level of performance/redundancy is assigned to a specific workload.

    You will see that many, if not all of the storage vendors will start adopting this style of storage consumption in the near future (VVOLS)

    • Thanks for the response, I hope you don’t mind but I’m not really for having links to other blogs inside my comments so I’ve taken them out, I’m sure anyone wanting to find more details for VSAN and VVOLS can search and find their own sources for this information.

      Your response actually supports one of my reasons for writing this blog, I find it interesting when people say ‘VSAN is targeted for very specific use cases’ then list Tier 2 – 3 workloads as one of these. Many large companies I meet have somewhere between 1000 and 2000 applications and would probably class 50 – 60% of these as Tier 2 – 3 workloads, so the ‘very specific’ use case for VSAN is actually the majority of applications that a business runs…I just don’t see how this can be classed as specific. Now add on VDI, Test / Dev and Remote Office and again the ‘very specific’ use cases actually cover most of what a business runs today.

      Very specific is actually very vague, the level of data protection that I’d be comfortable with for VDI or Test / Dev is VERY different than what I would expect for many many other applications, so to just group them all together this way I think is misleading, people need to be aware of the huge amount of data protection features they are sacrificing by going the VSAN route right now and these workloads do not do this.

      Look at this the other way, what’s not in this very specific list? Tier 1, that’s it, apparently VSAN, a version 1 product is already good enough for everything else? I’m sorry but that is simply not the case.

      And finally many of these things that VVOLS promises in order to reduce complexity are what some of the storage vendors have already been doing for many years already now.

      Appreciate you stimulating a discussion on this

  3. Interesting read. I am involved with the Spiceworks community. Over there they LOVE VSAN. Keep in mind that Spiceworks is geared towards the SMB market, and not enterprise-class. SAN seems to be a dirty word there.

    Many times I hear people say a dual controller NAS/SAN is actually less reliable than a physical server. Seems absurd – I know. They also say instead of buying a dual controller NAS/SAN it would be far more reliable to setup two physical servers running ESX and using VEEAM to make a backup every hour to the second server and in case of failure you just power on the other VM’s. Their reasoning is that the complexity of the dual controller system actually makes the system less reliable. They point to firmware updates and the like that might cause both controllers to go down.

    Not that you have time, but here is a thread: http://community.spiceworks.com/topic/574708-san-nas-das-thoughts-on-a-direction

    They like to refer to the “inverted pyramid of doom” (3-2-1), where the NAS/SAN is the 1, Network switches are 2 and hosts are 3. So they basically say never buy a NAS/SAN because of this (unless you want to buy 2 separate SANs) and use VSAN or even the 2 server scenario with hourly backups to avoid the PMOD. If I could plainly show that the chances of the “1” failing were better than seeing a Nutrino it would help!

    I wish I had some real facts about dual controller reliability so I could effectively counter the “inverted pyramid of doom”.

    • One thing to address right off the bat that I find to be very interesting: the definition of SMB. The attitude on Spiceworks derives from the idea that SMB doesn’t need anything special because their requirements are different than any other tier. I beg to differ, and in many cases the requirements are even more stringent. The question that I always ask: If the requirements are so much less for reliability and significant feature sets for workloads, then why not just move to a public cloud deployment and remove all of the complexity and decision making out of the equation? The failure domains will be similar and It also moves your company to OPEX vs. CAPEX model which for every company in the US is a significant Tax relief. Given the apparent reluctance to move the workload to a public cloud, there needs to be reassurance of local reliability. I believe this may be due to the need for HA of the application state in their environment to prevent loss of productivity.

      The productivity issue is key. Two ways that this can be looked at are:

      SMB businesses generate fewer $$ per hour, so an outage of several hours will not have a material impact on the operations of the business.

      SMB businesses being smaller mean that an outage may have an impact on a higher percentage of the company’s assets, thus stopping the entire company till resolved

      If the former, then just goto a service provider to host, and enjoy the savings. If the latter, then you need to architect for resiliency based on an SLO for recovery to ensure what the level of acceptable resiliency is.

      As far as distributed SAN (VSAN) and local disks vs. SAN/NAS, especially in a hypervisor environment, there are advantages to NAS/SAN. Coalescing and caching of IO, Deduplication and compression for capacity savings (distributed/VSAN requires multiple copies), snapshots so no secondary copy needed for local recovery, etc. In addition to this, there are licensing issues in a vSphere environment such as number of cores, hosts in the vSphere license and the current need to be on Enterprise+ for VSAN. SAN/NAS also has data availability numbers for NetApp approaching SIX nines. I might guest post to elaborate on this more and get into the detailed points, but this should give you an idea of the challenges that need to be addressed and how NAS/SAN from a top vendor can help.

      Transparency: I have 25+ years’ experience in IT architecture and have worked at startups, SMB’s, top enterprises, resellers, and at vendors such as SGI and HP. I currently work for NetApp where I create best practice architectures for Virtualized environments.

      • Nathan, just to add to what Joel has said above, in our experience when a failure occurs, frequently it can be attributed to software. One of the reasons why Storage Arrays are very reliable is that they ‘just do storage’, if they were to be used for running Hypervisors and Operating Systems and Applications then you start to introduce a lot of potential failure areas, which is effectively the VSAN / Server Storage model.

        As I also mentioned in my blog, aside from noticeable and obvious physical failures, you absolutely have to take into account ‘torn pages’, ‘lost writes’ etc these are the silent corruptions that may be happening in the background that you won’t ever know about until you suddenly find that some part of your data can’t be read, storage arrays know about these and are ensuring that they either don’t happen, or are immediately fixed, not something that your VSAN / Server Storage is even considering.

        With our scale out storage architecture, software or firmware upgrades don’t cause any interruption to service, hardware upgrades or replacements don’t cause any interruption to service, silent corruptions are being dealt with and we’re doing all of this with massively high availability.

        I do think VSAN / Server Storage offers interesting possibilities, but ‘The inverted pyramid of doom’ seems to me to be a sensational post that makes a myriad of unsubstantiated assumptions about lesser reliability of storage controllers compared to servers (absolutely nothing to back this up though) and that doesn’t focus on any of the issues regarding Data Availability and Protection, which is actually what the whole point of storage is.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>