vROps 6.x cluster expansion – how and why

This blog post comes off the back of the recent cluster expansion I did a few days ago. Here’s a quick rundown of the situation before the expansion:

  • Single large node deployment at 16 vCPUs, 48 GB RAM and 3.5TB flash disk
  • Large infrastructure being monitored – 10,000++ VMs, 400 hosts, 50 clusters etc..
  • Default vSphere management pack – the only one in use
  • 31 million configured metrics, 4+ million metrics being collected

Doubtless to say vROps was struggling to keep up with the burgeoning, ever expanding environment. Dashboards would time out collecting data, searches were slow, reports took long to run. Not a good situation given how awesome vROps otherwise is! Something had to be done about this.

VMware have this pretty handy sizing calculator available in the form of an excel sheet you can look at here. I punched in my environment’s size into the excel calculator and it spat out what I needed:

vROpsclusterexpansionA

Away I went with deploying a new node to add to the mix. Keep in mind – the new node must be the exact same version, not one up nor one down (not even a minor release). I’d have thought VMware would support rolling upgrades, but found 6.0.2.277xxx and 6.0.2.24xxxx dont go too well together. Anyway, just get the absolute matching OVA from VMware downloads and install the new node. I didn’t need a HA cluster so I didn’t enable high availability. I wont post up screenshots of the steps involved, VMware have done a good job of that here. Here’re some of the somewhat interesting things the cluster expansion went through.

Given the size of the environment, during the upgrade I saw the following message on and off:

vROpsclusterexpansion1

The master node does not immediately hand over work to the data node:

vROpsclusterexpansion2

You’ll see the status of the data node change to “Analytics is starting”

vROpsclusterexpansion2.5

After a few hours, the cluster stabilized and you can see the data node now sharing the load:

vROpsclusterexpansion3

I also recommend you stop the collection of metrics while the expansion is taking place. The master node works real hard when it’s handing over load to the data node and the collection of new metrics maxes out CPU and RAM, I saw 90%+ RAM and CPU utilization when the master’s doing its thing and collecting metrics.

Now, the cluster runs like a dream!

For some good information on how many metrics and object a node or multi-node installation can handle, I recommend you read this kb article.

I call out special thanks to VMware’s James Polizzi for his assistance with the numerous questions I asked, he’s a gun at vROps and breathes the product!

2 Comments

 Add your comment
  1. This is a great article. I am currently in the middle between medium and large sizes of vrops cluster. Is there any benefit to going with 2 mediums vs changing my resources to match a single large?

    Glenn

    • Thanks for the comment Glenn! I’d go with 2 mediums – scale out. You keep within NUMA boundaries (likely) with smaller nodes, you could cluster them for HA, you lose one and it’s not like you’d need a rebuild. There’s a handy sizing guide you could refer to which’ll walk you through sizing the nodes. Going with multiple mediums is my recommendation obviously without knowing any more about your setup. Cheers!

Leave a Comment

Your email address will not be published.

2 Trackbacks

  1. Newsletter: November 21, 2015 | Notes from MWhite (Pingback)
  2. Outline of Installing vR Ops 6 | Notes from MWhite (Pingback)