Marvin the Paranoid Biological Android
- Mar 3

HCHCI - The Great Broadcom Killer

Yesterdays Hero (3 Tier) HCHCI (new kid on the block)

Many of you familiar with HCI will ask, well, what is this HCHCI thang anyways? In my book, this is merely Hybrid Cloud, Hyper Converged Infrastructure.

What makes HCHCI different from plain vanilla HCI is that with this architectural model, the cluster Operating system runs anywhere, on prem or in any participating public cloud or on the edge.

Before we go there though, let us go back in time a wee bit to when I was considered by many to be an EMC, Hitachi, Netapp and IBM storage guru, as it was at about this time that I started to get a tad nervous about the big storage systems we were shoving down several customers throats with rather exorbitant accompanying price tags (2012).

I had for many years had my eye on Sun Micro Systems Amber road project as well as Microsoft lab Frankenstein server things that turned x64 Servers into clustered compute architectures that had everything they needed to run applications within the clustered system.

That is all Storage, Memory and CPU pooled together from all participating servers and somehow controlled by control and orchestration software to appear as one single system that could scale from two servers to thousands, as required.

The storage access part with the other roles of an old school 3 tier system being identified by myself and several other Silicon Valley startup folks as where opportunity lay for a new solution type of thinking to be applied to solving the myriad of complexity that came with several stand alone IT systems pressed into service as a working unit by many companies IT folks to deliver what the business part of what a target customer needed to run their business applications as simply and easily as possible on.

In other words make one appliance that does what up to nine other appliances do.

The idea being the cluster of servers resources would be consumed by the applications making the infrastructure invisible to the application.

When you ran out of resources you just needed to add servers to the cluster of existing servers and this would allow a more precise and more affordable scaling upwards that had no boundaries like traditional three tier data center solutions have with their separate components.

The idea being the administrator did not have to worry about what server, resources, networking, Operating systems, storage and a myriad of other stuff they normally had to provision to allow an app to run on a server (physical or virtual).

All that infrastructure setup would be done by a whole pile of processes controlled by a cluster Operating system which ran self healing and self administration algorithms on a continual 24x7 basis.

This could be expanded or shrunk by adding or taking away servers from the cluster but the starting minimum number of servers, nodes or hosts would be three, for basic HA purposes, 4 being a more serious HA starting point with serious resilience but for companies serious about resilience and uptime for tier 0 and 1 applications 5 nodes should be the minimum starting point.

The fab thing about cluster technology is the more servers you add, the faster everything gets, so it is not just linear expansion it is also linear performance increases with more HA protections the more servers you throw into the mix.

Another benefit of such a cluster based system was to layer complex automation and orchestration tools on top of the cluster function via embedding the capability into the Cluster operating System software bus.

These were not to be third party bolt on additions, but rather built into the Cluster operating system function from inception and the same thinking applied to security desires to protect the data and the data integrity which also covered replication schemas to remote clusters.

Basically all the add on features you can think of from encryption to firewalls and such needed to be mere license activation level efforts in the capability of the cluster operating system.

What this architecture allows you to do is offer the services stand alone appliance deliver as part of the cluster OS.

Thus you have a software bus with full API support for anything that you just enable from the Admin console.

Virtual machines get managed by a Virtual Machine service, Storage services get enabled by Block, NAS and Object services, High availability gets served by Replication and HA services, Backup gets served by BCDR services, security gets served by a myriad of specialist security services to cover Ransomware, Malware, Data Protection and so on, ad infinitum as the Romans used to say.

After playing with several Frankenstein cluster computer systems I built myself using Asus and Supermicro servers, I concluded that this was not as easy as it sounded.

The problem being the operating systems for each server were single entities designed to run as sole contributors.

Sure, active directory itself as an entity made any collection of servers a part of an Active directory domain but this was not really a single cohesive clustered architecture that behaved like a mainframe to task sub units with compute tasks Borg style with under the eye of the Borg Queen.

The problem was everyone wanted their schema to be the winning solution which meant you had to decide to be either Windows or Open Source Linux based and not really both on a mainly sole hypervisor made by VMware basis.

I did some paper napkin blueprints of some ideas I had and tried with clustering servers to interest some startup folks in the clustered x64 concept as I felt this could lower Compute and IT costs in the average company by a drastic amount if you could build a single cluster based solution that did everything you needed it to do and not have all these separate silo's of equipment and expertise in the average corporations IT group to deal with.

Simplifying things drastically, in other words.

I had mentioned some of the Amber road ideas I had seen to a few bright fellows over at Oracle back in 2006 and we had identified several areas that required development that stood in the way of achieving these ambitious goals.

I had also mentioned these ideas to several bright folks over at EMC and Cisco and VCE corporation was soon formed by some like minded souls to try solve this problem in another unique way a few years later.

Unfortunately the gave up all the basics we had agreed on in terms of how it would work and it ended up being three separate systems with a mediocre orchestration effort.

I myself was transitioning from Chiron Corp to Silicon Valley Bank at that time and then decided to join Forsythe solutions group as a storage specialist a short time afterward and focused on Hitachi, EMC and NetApp storage systems sales and architecture as this was where the big money was in IT at the time.

I had a great deal of success with this and started pulling in a good few architect level certifications while I was at it with a ton of deployment and operating experience observations of the customers using said solutions.

I also encouraged Cisco folks with their UCS server project as a new entrant into the Server market and that also saw developments along these lines in the VCE alliance that was formed in around 2008.

In this venture Cisco did servers and the Fibre channel and Ethernet networking, VMware did virtualization, management and Analytics and EMC did the storage as required by virtualization hosts in what they called a VBLOCK.

They basically built a whole rack of everything a target customer required and turned it into a single system pre-installed in racks that a customer bought turnkey style.

To me this was just elaborate and super expensive systems integration exercise shenanigans.

I was offered architect roles within VCE several times along their journey but declined the offers as the fruit of their labors was merely a collection of the same old same old systems under the umbrella of semi-serious software they wrote to control it all but it used API's to sub the tasks out to the default OS the various components had been designed to run.

Essentially they were gluing all the stand alone silo systems together and using a new GUI to make it look like a single entity, but it was not.

It was also as slow as shit.

This sort of super orchestration defeated the purpose of building a single server based cluster to serve all compute needs and you still needed to log on to the various components CLI's to get real shit done when the API missed the boat.

The result was actually way more complicated than the individual silo's were collectively.

I decided to join a small channel specialist company after a stint at Dimension Data USA called Nexus IS who Dimension Data later bought before various folks I had worked with there joined ePlus and Presidio.

When I arrived at Nexus they had $800K of EMC storage systems sales the entire previous year and I grew this to $24 million solo after I got rid of some local Hall competition I did not want to work with from my Forsythe days who were at Nexus IS themselves.

The guy who hired me at Nexus IS left about a year after I came on board to join a startup called SimpliVity who were in fact trying to do what I had napkined out years earlier.

However, their technology centered on a proprietary bit of hardware they did not own and some software running on specific Supermicro servers to make it all happen and while promising they had not gathered enough investment to write an OS properly that could accommodate any Hypervisor - their effort was centered around VMware ESXi only.

Like all startups, they looked to where the money currently was sans a future view and it was all in VMware ESXi as they saw it and so they developed to that platform to make money as fast as possible hoping VMware would never die.

I could see where that was going after SimpliVity failed to accommodate Microsoft's Hyper-V Hypervisor in their platform, pointing out that most IT shops were using both Linux and Windows and that Linux was eventually going to claw a lot of market away from Microsoft.

While you could run Linux on ESXi almost everything was centered around Windows server at first.

Everybody laughed at the notion I touted that Linux would grow to a huge size in the average IT shop at that time.

As one of my roles at Nexus IS was evaluating emerging technology, one day back in 2012 I got a visit from a new startup in San Jose who had a name I thought was similar to Nuticles which made me laugh at the time as I had fitted one of my dogs with said Nuticles a short time before I had run into these folks.

The company name was Nutanix and a guy there called Mohit was on the same rough track I was thinking of in terms of the idea for a cluster based architecture with it's own OS that had a software bus you could plug any Hypervisor or add on service like security into.

My idea was based on AMD and IBM silicon only by the way, I have no love for anything Intel make CPU wise.

As the Nutanix concept was what I considered then to be pre-Alpha with a solid concept framework, all I could do was encourage them and ask them if it was on a par with SimpliVity efforts.

At that time Simplivity was about 2 years ahead of the Nutanix early Acropolis Operating System (AOS) efforts.

Mohit schooled me on the differences and why his architecture was a winner vs SimpliVity which was very informative and worth the two weeks of my time.

I decided to keep my eye on Nutanix Progress though and in 2014 I started promoting a Hyper Converged Infrastructure group as a specialty within the reseller I was then working for and I started selling SimpliVity and Nutanix to customers as it was ready for prime time by then.

I later added Cisco Hyperflex and EMC's VXRAIL to our early HCI portfolio lineup but it was only after Dell bought EMC that VXRAIL started to become a half decent product.

This was because Michael Dell, who had a great relationship with the Nutanix folks had made early Dell commits to Nutanix which the EMC folks tried to undo.

It was only after the EMC brain trust sat down with Nutanix folks in San Jose for 5 week session that we started to see VXRAIL evolve from the real crap platform it was at 3.x of their then poor HCI effort into a more reasonable one, albeit an ESXi only hypervisor based platform.

As Dell had also bought VMware when it bought EMC they all had a vested interest in ESXi only platforms being at the center of everything they sold.

The difference between 3.x and version 4 of VXRAIL is like night and day thanks to the critical schooling Dell EMC folks got from Nutanix gurus.

The thing Nutanix lacked at that time was a dense storage target for backups and to my amazement they built the 6000 series to accommodate this desire from myself and several other customers as well, no doubt.

They made it feel like it was a response to my personal input, which it was not but it was a nice touch nonetheless.

It was unusual for an IT manufacturer to actually accommodate actual customer asks in IT, which is a stupid and short sighted failing on the part of most IT manufacturers, in my humble opinion.

I started to get these same Nutanix folks I was interacting with to focus on High Performance Compute use cases as I saw a huge market segment for this sort of solution running Oracle and SQL cluster databases in particular.

This tier 2 and 3 strategy Nutanix had declared was their initial focus was lacking in ambition and I lost no time telling them they needed to go for everything, wall to wall style, bar networking fabrics.

PayPal down the road in fact were super interested in what Gartner were now calling a Hyper Converged market segment in the Data Center IT space and they had tried SimpliVity but it was a toy for their needs at that time with more cons than pros.

Nonetheless, I had grand schemes for PayPal with Nutanix, which I actually successfully saw come to reality in 2019 when I was Director of Engineering at Rahi Systems.

The initial Nutanix platforms back in 2009 seemed to have been focused on tier 3 and 4 type applications and use cases, mainly VDI for Citrix and VMware horizon type use cases.

As Nutanix developed more serious platforms for their AOS and developed the operating system to evolve and solve bigger everyday Data Center compute problems they ran into as they went, the whole thing became rather quite impressive.

AOS is an operating system that actually orchestrates server hosts to make sure any virtual machine runs on any cluster host with everything it needs - RAM, CPU Cores, Storage and networking resources.

It turned a group of vanilla computer servers with components ruggedized for HCI operations into a super mainframe type cluster that could literally serve all data center needs.

This is where a lot of people make a mistake about Nutanix and trying to understand it from their traditional VMware VSAN perspectives.

People always try compare something new with what they know and most people understand storage systems and VSAN.

Nutanix is not just a storage system though, that is a mere small gear in the overall gearbox of what Nutanix AOS actually does.

For some reason many folks try compare AOS to VSAN and this is where they fall into the mud pit, mind and concept wise.

You have to compare ALL the three tier architecture silo's to what Nutanix AOS does and delivers.

AOS works with any hypervisor from a common bus point of view because it was designed to accommodate any hypervisor out the box.

All you have to do is imagine you have an operating system on the bare metal that ESXi or AHV plugs into with full API data flow either way.

Most people cannot compute ESXi plugging into AOS and not bare metal.

AOS takes all the storage from the nodes and has other processes who's names were borrowed from the Stargate TV series that manages the Virtual machine's access to the clusters resource pool, including the storage.

This way the VM does not need to make calls to a storage system for data that sits on a remote to it NAS or SAN storage array because with Nutanix AOS all resources are local to where the VM is running with HA copies spread all over the participating servers (nodes) SSD and HDD in case a host (node) goes out of service or gets full.

Nutanix use Ethernet as the common bus which makes the network fabric the HCI bus and not some proprietary SPOF back plane idea everyone else in this game got stuck at and on.

Because HCI trafffic is East-West in nature and not the traditional North-South most Ethernet Switching fabrics were built for, some development and special type switches were soon found to be the order of the day for serious Nutanix cluster HCI workloads.

Juniper and Mellanox leapt straight into it with Nutanix and pretty soon suitable switches specifically designed for HCI East-West flow traffic started to appear from other vendors as HCI became a thing.

Aruba and even Cisco swiftly jumped on board with the concept and the Cisco Nexus 9300 series was borne out of this need.

I wanted to join Nutanix back in 2017 when I started to realize AOS was now ripe and ready to be a real game changer and had a standing offer from several of their folks I was working with but these folks joined Mohit at his other startup endeavor called Cohesity so I had to develop other folks at Nutanix to pull me into Nutanix and went to go work for NetApp to help them with their HCI effort which I had heard from someone was getting serious for a look see while I waited for Nutanix to get serious about making me an offer.

NetApp however, had completely misjudged what AOS and HCI actually was. They assumed it was merely VCE light and Powershell madness.

They concluded this after looking at what Microsoft was doing with a similar HCI project that took individual IT components which were all sown together with complicated Powershell scripts and nothing more.

To say I was disappointment with the F Grade Solidworks based effort NetApp strung together with real stupid Powershell scripts would be a massive understatement.

Microsoft were also doing the same VCE styled thang with Dell, HPE and Lenovo servers and storage systems, sowing these individual systems together as a single one GUI illusion show.

I think some of this fallacious Microsoft thinking bled into NetApp unchecked and they screwed themselves in a flight of fantasy that went nowhere in particular.

It took Microsoft a long time to actually realize they needed to build an actual HCI OS like Nutanix had done to overcome the failings of Powershell and these various stand alone appliances.

I told my boss at NetApp he had made a serious error of judgement on this piece as around $3 Billion of investment had been poured into Nutanix AOS by this stage of the game which he point blank refused to accept as a reality.

After assessing the NetApp HCI system as being effectively as useful as tits on a bull I reached out to my Contact at Nutanix and got serious about switching from NetApp STAT.

The role they had at first intended me at Nutanix for was held up so I came onboard as an SE and the idea was I switch up into the piece that I wanted to be in.

Unfortunately my sponsor swiftly departed to AWS with the same chunk of people he had nurtured at Nutanix and then COVID struck...

I had deployed EPIC software systems on EMC VMAX and Powermax storage arrays at many hospitals, mainly under the KP umbrella as well as many independent children's hospitals on the West coast and had also deployed their competitors system on Windows based servers.

That company was called Meditech.

Even though the average hospital pays hundreds of millions to run EPIC software, they do not have to buy the best and most complicated storage arrays and IT systems to run it all on.

They just do because of all the cash they payed for Epic.

I had sold a lot of healthcare concerns a lot of VMAX platforms to run their Epic software on though and had even got my hard to get Symmetrix Architect certification at the fifth attempt and I even passed the even more difficult EMC Symmetrix Speed Guru exam than you needed to get 100% on to pass.

These two certifications were by far the most difficult ones I ever attempted and I am not bragging when I tell you I have done nearly all of them at one stage of my career or other.

However the thinking of folks who paid Epic these sums of money was that the Infrastructure they run it on needs to be made up of the best components possible.

I build many architectures based on EMC Symmetrix DM series storage arrays at first then VMAX and then PowerMax.

I also did this with Hitachi Data Systems and the re-badged Hitachi Arrays that HPE were also selling.

Most of these Epic systerms then ran on IBM AIX platforms but I did the first Linux one at Stanford Children's hospital back in 2016.

In any event a cluster based solution for either EPIC or Meditech is far faster, better, more HA resilient than what any PowerMax or Hitach based storage system with x64 or AIX hosts could ever hope to deliver.

Especially with locality of reference delivering no more than 200 microsecond latency!!

Thus my solution for Meditech and Epic systems start at 9 node clusters running RF3 and stretch to 14 nodes.

This is far more affordable, resilient with way better HA uptime for critical patient care than what any 3 tier system could ever hope to offer.

For super resilience and uber performance I also design these on AMD EPYC CPU nodes.

So far the hospitals that have tried these cannot stop enthusing how awesome, resilient and high performant my architecture is which also saved them a ton of cash on CAPEX as well as OPEX.

All the money in IT is spent on OPEX and my solution grabs a hold of that by the balls with a solid death grip!

I also build these solutions on 25/100GB Network fabrics that also cost a fraction of what the legacy network OEM vendors hawk their customers for 10/25GB fare.

Faster, Better and cheaper with 9X more HA and resilience? Who does not want some of that Pie?!

I am expecting to convert over 85% of my old school deployment$ over to this more resilient architecture and boy, is it ever popular!

The financial companies love this stuff as well, especially the banks...