I have recently had the good fortune of working on a large scale deployment of VMware’s vRealize Log Insight 2.5. The project included the design, deployment and some administration of the product. I thought I’d blog about my experience given the paucity of blog posts pertaining to the design and deployment of Log Insight. For brevity, I’ll hereafter call the product – LI.
First off, kudos to VMware on continuing the development of this software (after its acquisition) – an awesome piece of code! Highlights:
- Makes troubleshooting LOTS easier. Problem host? Search for it in Interactive Analytics, choose the search time period, look at logs, fix problem.
- No need to generate log bundles, VMware GSS webex in, run through the problem with you, look at Log Insight, troubleshoot, problem solved. Well, sometimes not so straightforward, but you get my drift.
- A variety of Content Packs allow you to look through the logs of say your Vblocks or your storage arrays and narrow down to potential causes of problems.
- Email alerting to warn you of potential issues if something’s happening over and over in a certain time period.
- Centralized logging for everything from your hosts and vCenter Servers to your arrays and network devices.
- A slick and very responsive interface.
Need I say more?! I highly recommend people download a trial, throw it on their labs at work or home and see the log ingestion magic for yourself.
Circling back to my project, I needed something that would be able to report what was going on in my vSphere environment and who was doing it. Cutting to the chase, Log Insight fitted the bill nicely and licensing was procured.
- Central logging of filtered data
- User access management
- Monitor hosts and vCenter Server systems only
- Auditing ability
- Create custom dashboards, reports and alerts
- Archive master LI logs for a number of years
- Clustering (or Log Insight HA) not required
- Minimize number of instances deployed
- Master LI be deployed in major datacenter 2 and be its own instance
- Limited WAN links between remote sites and central LI instance
- Limited inter-site links between Prod and DR sites
- VERY tight deadline, ~3 weeks to handover
- Many vCenter tasks performed by the ‘vpxuser’ account
- Another syslog collector platform will continue to serve its purpose of being the primary syslog server
- Minimal log data over the inter-site link at remote locations (between respective prod and DR sites)
- Authorized users will create their own dashboards
- Training will be provided
- No HA (apart from VMware HA for the LI virtual machine) = no local logging in the event the LI virtual machine at any site was restarted (till the time it was available again)
- Master LI may run out of disk
- New instances of Log Insight, post-deployment, may send unfiltered logs to the Master LI
I’ll split this up into two sections – Conceptual and Logical Design.
- Each site will have its own instance.
- Multiple remote instances will send filtered logs to a central instance over WAN links (constrained in some cases). I’ll explain the importance of ‘filtered logs’ a bit later.
- Admins and authorized users will be able to log on to each instance.
- Other syslog collector platform – no impact to or from this platform. It will continue to serve its purpose. WAN utilization increased marginally as a result of Log Insight being deployed – countered by the use of filters.
- Inter-site link at remote locations may be heavily used – alerts and QoS already in place. Minimal activity at these sites = minimal activity across the links.
- Training provided to authorized users – in progress.
Since I wont include a physical design here (for obvious reasons) I’ll put more detail into this section.
Looks like pictures do say a thousand words! Let’s elaborate though.
LI appliance sizing:
- Remote sites: Low number of hosts, the biggest remote site had fewer than 20 hosts. ‘Small’ instance will suffice, low IOPS requirements. Arrays at all sites able to handle required IOPS.
- Major datacenters 2 and 3: Large number of hosts, let’s just say upwards of 300 hosts, less than 700. ‘Large’ instance needed. A number of busy vCenter Servers managing large VDI deployments. Other vCenter Servers managing massive server VI ~ 20,000+ VMs.
- Major datacenter 1: 150+ hosts, ‘Medium’ instance needed. Couple of busy vCenter Servers managing sizable VDI deployments. Other vCenter Servers managing fairly large server VI.
- Master LI instance: Minimal disk and low IOPS requirements – only meant to be receiving filtered logs. ‘Small’ version sufficient.
All remote locations:
- All remote locations had both prod and DR sites. Given the small numbers of hosts and 1-2 vCenter Servers at these sites, it was not deemed necessary to deploy an instance in both prod and DR sites – single instance in the prod location was enough.
- The project also required minimum number of instances for lower administrative overhead. While the inter-site links between the remote sites were constrained, the link utilization was expected to be minimal given the small size of the these sites.
- Each of these instances in the prod sites would then send filtered logs to the master LI instance.
All 3 major datacenters:
- Each major datacenter was large enough to need its own LI instance instead of having to send logs over the inter-datacenter links.
- Each of these instances would then send filtered logs to the master LI instance.
Master LI instance:
- Deployed in datacenter 2 as required.
From the other syslogging platform, it was determined hosts in datacenters 2 and 3 were generating 45GB/day and under 1000 IOPS. Plenty of grunt in the array to handle these write IOPS.. The same applied to all other locations (far lower numbers on the other arrays).
This product’s about logging, right?!
- Central logging: This was a highlight of this product and this deployment. All LI instances needed to send filtered logs down to the Master LI.
- Filtered logs: The version of Log Insight that was deployed was 2.5.5439339, this particular version allowed multi-forwarders to the same destination Master LI. Here, forwarders are called filters that permit remote LI instances to send filtered logs only. Because of the requirements of this project, the filters resulted in minimizing the logs being sent down by up to 99% in some instances. Pretty handy considering versions pre-2.5 did not permit multiple forwarders/filters to the same destination. Filter screenshot is shown in the next section.
User access management/auditing:
This name may be a bit of a misnomer and a tad misleading. It implies keeping an eye on what was performed on any cluster, who made the change, when it happened and if it was authorized. After a lot of trial and error and with VMware’s assistance, a number of filters were used to try and capture changes to clusters/VMs/hosts/vCenter Servers. These filtered logs were then sent to the Master LI. Here’s a screenshot of the filters, names blanked out for obvious reasons:
Pre-vSphere 6.0, many vCenter tasks were performed under the context of the vpxuser account (this was pointed out as a constraint).
The Master LI needed archival due to it being the most important part of this project. However, because it only received filtered logs from other LI instances, it had very low disk requirements. It was deemed it didn’t need a special NFS share to archive to. In fact, in the month and a bit since this deployment, it has only used about 2GB. To be honest, I’ve been keeping tabs on disk consumption, just in case.
LI High Availability:
LI was deemed to not need high availability, as a result the product was not clustered. Keep in mind, clustering for LI isn’t the same as Microsoft failover clustering. An LI cluster wont failover logging to surviving node(s) in the cluster, it really is load balancing at best (and you need a 3rd party load balancer). Alternatively, you could get your hosts and vCenter Server systems to log to 2 LI instances per site – but that’s doubling the link utilization at remote sites, at least in this project. VMware HA was deemed to be sufficient for this deployment.
A number of dashboards were created and set to be available to all users of the systems. The widgets inside the dashboards are highly customizable and you are able to create views of just about every logging aspect you can think of. Pretty cool – especially the ability to create views of say what particular hosts were doing, over customizable time periods and sort by a huge variety of factors.
Log Insight allows users to create custom alerts that fire off emails to configurable email addresses about certain situations. A number of alerts were configured and an example one looks like this:
Log Insight is not a reporting product per se, but it does allow a user to generate custom reports that basically produce a csv/json file of the current view you’d see in Interactive Analytics. So if a user were to search for a VM, define a time to search over along with grouping according to customizable criteria, they can generate a report like so:
This deployment had 3 named risks.
- No HA: Every LI instance was placed in a HA enabled cluster. The machine was given higher priority where possible for a restart during a HA event.
- Disk consumption by Master LI: Given the requirement to archive Master LI logs for a number of years, disk consumption needed to be watched. To begin with, the Master LI had minimal disk requirements as it only received greatly filtered logs. To ensure prompt response, system alert to send an email on low free disk warning was configured:
- New LI instances: Process established to ensure filters were configured.
I call out special thanks to outstanding VMware folks – Iwan Rahabok, Vardan Movivsyan, Steve Flanders + their R&D department for helping out with the design and the constant support post-implementation. You guys rock!
Iwan’s done some great write-ups about architecting Log Insight, check here.
Steve’s written numerous articles here, no one does a better job with talking about Log Insight than this guy.