Alerting, Aria Operations, VCF Operations, vRealize Operations, vROps

Monitoring HA VM Restarts in VCF Operations

Someone recently asked me if there’s a straightforward way to notify Virtual Machine owners as soon as their VMs were affected by an HA Event. Apparently, suitable alarm definitions are not available out-of-the-box in VCF Operations, and the customer asked for a solution.

Scenario

The scenario is therefore quite simple:

  • Each host can run VMs from different owners.
  • Each VM has the owner recorded in its metadata, in this case as a vSphere Tag
  • The owners want to be informed when their VMs are restarted by an HA event on a different ESX host.
  • The owners should be notified via a REST Webhook.
First Approach

Initially, I considered how I could deduce the corresponding VMs that were running on an ESX host at the time of an HA alarm, and then inform their owners as simply as possible. Since there’s an alarm for ESX Host HA Events, my first thought was to experiment with this alarm and various methods of grouping VMs, essentially making them ‘dependent’ on it. However, I quickly realized that this approach wouldn’t lead to the desired outcome. The relationship between an ESX Host and its VMs can already be different after an HA event, depending on how the VCF Operations Collection Cycle had just run. Not to mention other complexities or dependencies within Custom Groups. A different solution was needed.

Solution

Then I wondered if the actual vCenter Event might simply be deactivated in the settings of the VCF Operations Adapter, which would explain why there’s no Symptom/Alert out-of-the-box. I quickly consulted my great VCF Operations expert friend Brock (https://www.brockpeterson.com/post/pulling-vcenter-events-into-aria-operations) to save myself the search for the config path.

To my surprise, the corresponding event, com.vmware.vc.ha.VmRestartedByHAEvent, isn’t deactivated at all, it’s there, and it’s also being received, as my tests confirmed, and can therefore be used for alarms. From this point on, solving the task became relatively simple.

Figure 01: HA Event in VCF Operations.
Event Based Symptom

The first step is our own Symptom Definition, which is based on this event. In the following screenshot, we can see that we are using a Message Event of type Change and that the Event Message itself must contain the text vSphere HA restarted virtual machine. That’s it, we don’t need anything more.

Figure 02: Symptom Definition.
Alert Definition

The next step is the actual Alert Definition, which, of course, only contains the single symptom we created in the first step. It’s all very simple. The next screenshot shows the definition.
As always when creating custom alerts and symptoms, do not forget to enable them in the corresponding VCF Operations Policy.

Figure 03: Alert Definition.
Webhook

Since the requirement is to send the notifications via REST to a target system, we naturally first need a Webhook Notification Plugin Outbound Instance and a Payload Template.

For this blog post, I did not use a dedicated REST server. In real environments, there can be many different REST endpoints that expect their own specific payloads. For simplicity’s sake, I built a small Python REST server for testing purposes, implementing only one method, pushAlert, which is sufficient for demonstration. The following image shows the configuration of the Outbound Instance. As I’ve omitted a dedicated test method, the test run will produce an error. However, this is negligible, as the actual method I will use later functions correctly.

Figure 04: Webhook Notification Plugin Outbound Instance Definition.

In order for the REST endpoint to correctly process our Notification, we also need a Payload Template. In this template, we precisely specify what data to send, in what format, using which HTTP method, and to which specific REST method. In a real-world scenario, this configuration might be a bit more extensive, but it’s still not rocket science 🙂 The following image shows my simple configuration, which is perfectly sufficient for my fake REST server.

Figure 05: Payload Template.
Notification Definition

Now we have all the ingredients, and we can assemble them into a complete solution using a VCF Operations Notification. This means an alarm, based on a symptom, will trigger a notification that in turn sends a predefined REST payload to a configured REST endpoint. The next image shows my Notification configuration.

Figure 06: Notification Definition.

It’s important at this point that we select the correct Alert Definition in the configuration of the Criteria for the Notification, otherwise we will receive notifications for the wrong events. Equally important is the configuration of the REST endpoint and the payload, otherwise we might send the data to the wrong recipient. The following two images illustrate the configuration.

Figure 07: Notification Definition – Criteria.
Figure 08: Notification Definition – Outbound Method and Payload.

All configurations and definitions from this post can be found in my Git repo: https://github.com/tkopton/aria-operations-content/tree/main/VM-HA-Event-Notification

Stay safe.

Thomas – https://twitter.com/ThomasKopton

Leave a Reply

Your email address will not be published. Required fields are marked *