This blog post comes off the back of the recent cluster expansion I did a few days ago. Here’s a quick rundown of the situation before the expansion:
- Single large node deployment at 16 vCPUs, 48 GB RAM and 3.5TB flash disk
- Large infrastructure being monitored – 10,000++ VMs, 400 hosts, 50 clusters etc..
- Default vSphere management pack – the only one in use
- 31 million configured metrics, 4+ million metrics being collected
Doubtless to say vROps was struggling to keep up with the burgeoning, ever expanding environment. Dashboards would time out collecting data, searches were slow, reports took long to run. Not a good situation given how awesome vROps otherwise is! Something had to be done about this.
VMware have this pretty handy sizing calculator available in the form of an excel sheet you can look at here. I punched in my environment’s size into the excel calculator and it spat out what I needed:
Away I went with deploying a new node to add to the mix. Keep in mind – the new node must be the exact same version, not one up nor one down (not even a minor release). I’d have thought VMware would support rolling upgrades, but found 18.104.22.1687xxx and 22.214.171.124xxxx dont go too well together. Anyway, just get the absolute matching OVA from VMware downloads and install the new node. I didn’t need a HA cluster so I didn’t enable high availability. I wont post up screenshots of the steps involved, VMware have done a good job of that here. Here’re some of the somewhat interesting things the cluster expansion went through.
Given the size of the environment, during the upgrade I saw the following message on and off:
The master node does not immediately hand over work to the data node:
You’ll see the status of the data node change to “Analytics is starting”
After a few hours, the cluster stabilized and you can see the data node now sharing the load:
I also recommend you stop the collection of metrics while the expansion is taking place. The master node works real hard when it’s handing over load to the data node and the collection of new metrics maxes out CPU and RAM, I saw 90%+ RAM and CPU utilization when the master’s doing its thing and collecting metrics.
Now, the cluster runs like a dream!
For some good information on how many metrics and object a node or multi-node installation can handle, I recommend you read this kb article.
I call out special thanks to VMware’s James Polizzi for his assistance with the numerous questions I asked, he’s a gun at vROps and breathes the product!