Fixing “Virtual Machine Power Metrics Display in mW” using Aria Operations

In the VMware Aria Operations 8.6 (previously known as vRealize Operations), VMware introduced pioneering sustainability dashboards designed to display the amount of carbon emissions conserved through compute virtualization. Additionally, these dashboards offer insights into reducing the carbon footprint by identifying and optimizing idle workloads.

This progress was taken even further with the introduction of Sustainability v2.0 in the Aria Operations Cloud update released in October 2022 as well as in the Aria Operations 8.12 on-premises edition. Sustainability v2.0 is centered around three key themes:

  1. Assessing the Current Carbon Footprint
  2. Monitoring Carbon Emissions with a Green Score
  3. Providing Actionable Recommendations for Enhancing the Green Score.

When working with Virtual Machine power related metrics you need to be careful in case your VMs are running on certain ESXi 7.0 versions.

VMware has released a KB describing the issue: https://kb.vmware.com/s/article/92639

This issue has been resolved in ESXi 7.0 Update 3l.

Quick Solution in Aria Operations

The issue can be very easily fixed in Aria Operations using two simple Super Metrics. The first one is correcting the Power|Power (Watt) metric:

${this, metric=power|power_average} / 1000

And the second Super Metric fixes the Power|Total Energy (Wh) metric:

Figure 01: Super Metric fixing the power usage metric.
${this, metric=power|energy_summation_sum} / 1000
Figure 03: Super Metric fixing the energy consumption metric.
Applying the Super Metric – Automatically

Super Metrics are activated on certain objects in Aria Operations using Policies. The most common construct which is being used to group objects and apply a Policy to them is the Custom Group.

In this case I am using two Custom Groups. The first one contains all ESXi Host System objects with version affected by the issue described in the KB. The second Custom Group contains all Virtual Machine objects running on Host Systems belonging to the first group.

To create the first group and its member criteria I have used this overview of ESXi version numbers: https://kb.vmware.com/s/article/2143832.

The following picture shows how to define the membership criteria. And now you may see the problem. It will be a lot of clicking to include all 23 versions. But there is an easier way to do that. Simply create the Custom Group with two criteria as show below.

Figure 04: Custom Group containing the affected ESXi servers.

In the next step export the Custom Group into a file, open this JSON file with your favorite editor and just copy and paste the membership criteria, it is an array, and adjust the version number.

Figure 05: Custom Group as code – membership criteria array.

Save the file and import it into Aria Operations overwriting the existing Custom Group.

Figure 06: Importing the modified Custom Group.

Now this Custom Group contains all affected ESXi servers and we can proceed with the VM group. The membership criteria is simple as shown in the next picture.

Figure 07: Custom Group containing the affected VMs (running on affected ESXi servers).

You can download the Custom Group definition here and adjust the name, description and the policy to meet your requirements.

With this relatively simply approach Aria Operations provides correct VM level power and energy metrics.

Figure 08: Fixed metrics.

Happy dashboarding!

Stay safe.

Thomas – https://twitter.com/ThomasKopton

Aria Operations Heatmap Widget meets Custom Groups – Grouping by Property

The Aria Operations Heatmap Widget which is very often used in many Dashboards provides the possibility of grouping objects within the Heatmap by other related object types, like in the following screenshot showing Virtual Machines grouped by their respective vSphere Cluster.

Figure 01: Heatmap with VMs grouped by their respective vSphere Cluster.
Problem description

This is a great feature but the grouping works only using object types.

Today I was asked if there is a way to group Virtual Machines by their vCenter Custom Attribute value To my knowledge this is not possible in the widget itself.

Possible solution / workaround

Assuming the possible values of the Custom Attribute are known, the Aria Operations Custom Group object type can be used to implement such requirements.

The use case is:

“I want to group my Kubernetes Virtual Machines by they K8s function (master, worker).”

The steps are:

  • create a custom Group Type
  • create required Custom Groups and associate them with the new Group Type
  • use the Custom Groups for grouping in the Heatmap Widget
Create Custom Group Type

Group Type is an abstract construct to, well, group together Custom Groups belonging together, like the OOTB provided types Environment, Function, etc.

For my use case I created the type tk-kubernetes-functions as shown in the next picture.

Figure 02: New Group Type.

Please note that you have to create the type before you create the groups! Once a group has been assigned to a type you cannot change the assignment, at least not using the UI.

Create Custom Groups

To group K8s masters and workers I created two groups. The configuration of the groups is based on the property value reflecting the vCenter Custom Attributes:

Summary|Custom Tag:kubernetes-role|Value.

Figure 03: Custom Group configuration.

That way I group my three TKGi Virtual Machines in two groups.

Figure 04: New Custom Groups.
Configure Heatmap Widget using Custom Groups for grouping

In the last step I configured the Heatmap widget to use the new Group Type for grouping. Next picture shows the according configuration. You will find the new Group Type in the Container section of available object types.

Figure 05: Heatmap configuration.

Now my Heatmap shows Kubernetes Virtual Machines grouped by their role.

Figure 05: Dashboard containing a Heatmap Widget with “property grouping”.

This approach will not work for all use cases, especially when the attribute values are not know upfront. In such scenarios Aria Automation could automate the creation of the Custom Groups in Aria Operations, but this is stuff for another blog post;-)

Stay safe.

Thomas – https://twitter.com/ThomasKopto

How to detect Windows Blue Screen of Death using VMware Aria Operations

Problem statement

Recently I was asked by a customer what would be the best way to get alerted by VMware Aria Operations when a Windows VM stopped because of a Blue Screen of Death (BSoD) or a Linux machine suddenly quit working due to a Kernel Panic.

Even if it looks like a piece of cake (we have tons of the metrics and properties collected by Aria Operations), it turns out that it is not that simple to recognize such crashes without looking at the console.

So, challenge accepted:-)

In this blog post I am focusing on Windows BSoD and the overall idea was to figure out the metrics which combined are indicating a BSoD occurrence.

NOTE: Windows as well as Linux will restart a crashed OS with default settings and the restart usually is quick enough to remain “undetected” by Symptom Definitions unless you are using Near-Real Time Monitoring in VMware Aria Operations SaaS. A restart can also be initiated by vCenter HA settings in case of missing heartbeats from the OS.

My “Windows BSoD” Approach

I quickly created a Windows Server 2019 VM with this configuration:

Figure 01: Windows VM configuration.

And with the help of tools like Testlimit and DiskSpd plus some usual activities on the VM I created a “quick and dirty” baseline using metrics shown in the next picture (please ignore the color coding in the following picture for a moment). You will notice that in the Scoreboard the VMTools status is missing. I was not sure if I should include it or not as “tools not running” does not necessarily mean that the OS crashed, it could be also a crashed service.

Figure 02: Windows 2019 Server VM baseline.
Blue Screen of Death Examples

NotMyFault is a perfect tool to crash Windows in various ways. As I wanted to check if different BSoD types have different symptoms I used that tool to force few crashes and collected a set of metrics for comparison.

First crash type.

I started with probably the most known Blue Screen:

Figure 03: IRQL not less or equal – BSoD.

The first surprise was that the CPU Demand of the VM immediately increased to almost 100%. To ensure that this is not related to the fact that Windows is collecting data after the crash for some time, I checked this metric after two collections cycles (10 minutes) and it did not go down. Second finding is that the RAM Usage is only slowly decreasing, I assume this is simply due the the fact that memory is a really virtualized resource whereas CPU cycles inside the VM are actual cycles on the ESXi CPU. I also added the Guest Tools Status to the Scoreboard but I would not use it as symptom in an Alert Definition.

Figure 04: IRQL not less or equal – metrics.
Second crash type.
Figure 05: IRQL gt zero at system service – BSoD.

As you can see in the next picture, the metrics are behaving similarly to the first crash. Of course no disk and network usage at all was expected but to see that the CPU Demand and RAM Usage are following the same pattern is interesting and very promising symptom.

Figure 06: IRQL gt zero at system service – metrics.

This time I waited few more collection cycles to see how far the RAM usage will decrease and apparently it will go down to approx. 0.99% after >4 cycles.

Figure 07: RAM and CPU usage pattern.
Third crash type.
Figure 08: Unexpected kernel mode trap – BSoD.

This BSoD type again resulted in the same metrics values.

Figure 09: Unexpected kernel mode trap – metrics.

RAM on ESXi as metric seems to be dependent on the memory usage of the VM before the crash. I did not fully test it but Aria Operations Metric Correlation feature shows the same pattern for the respective metrics.

Figure 10: Memory metrics correlation.

Just to be sure that the RAM Usage metric values do not change with different memory configurations of the VM, I did two more tests, the first one with 4GB RAM configured and the second with 17GB to check the metric with an odd RAM config.

Figure 11: 4GB VM metrics after crash.

And here the 17GB RAM config.

Figure 12: 17GB VM metrics after crash.
Constraints, assumptions and conclusions

Please be aware that I did not test every possible scenario, this is what I used:

  • Windows 2019 Server Datacenter as OS
  • VM Version 19
  • VMware ESXi, 7.0.3, 20328353
  • 3 different BSoD types tested
  • VMTools not used as symptom
  • OS Uptime not used as symptom as the metric is not available after OS crashed
  • No Guest metrics used as such metrics will not be available after OS crashed

With the observations made during the crash tests I created 6 new Symptom Definitions and an Alert Definition using these new symptoms and one Condition for the power state of the VM. In the following two pictures you see the symptom and alert definitions.

Figure 13: New Symptom Definitions.
Figure 14: Alert Definition.

DO NOT forget to activate your new symptoms and alert definition in the Aria Operations Policy assigned to your VMs!

This is how the symptoms looks like on a crashed Windows Server 2019 VM.

Please be aware that the highlighted low memory usage symptom requires several collection cycles to become active. If you need fast response, remove it from the Alert Definition.

Figure 15: Active symptoms.
Figure 16: Active alert.

The small dashboard I created is shown in the next picture.

Figure 17: BSoD Dashboard.

You can download the content from VMware Code.

Update 03.03.2023

One of my fellow colleagues (thank you Brandon) suggested to test the behavior with VMTools not running at all as it will have impact on memory usage metrics. Brandon also suggested to add or replace CPU Demand with CPU Usage as demand will be affected by high CPU usage on the ESXi host. I have added this metric to the Metric Configuration file and uploaded it to VMware Code.

NOTE: CPU, Disk and Network metrics are basically instantly affected by the crash, whereas memory slowly converges toward 0.

As you can see in the following screenshot, you can use CPU Usage instead of CPU Demand as it will also increase to 98-100% after the BSoD.

Figure 18: BSoD Dashboard with new metrics.

I would like to mention once again that the CPU metrics are available basically right after the crash and if you use Aria Operations SaaS, which is definitely the recommended way of using Aria Operations, you will get the symptoms triggered roughly after 40-60 seconds.

The memory metrics, as you can see in the next picture, will need several minutes to decrease to a level near 0.

Figure 19: Slowly converging memory metrics.

Stay safe.

Thomas – https://twitter.com/ThomasKopto

VMware Explore Follow-up 2 – Aria Operations Dashboard Permissions Management

Another question I was asked during my “Meet the Expert – Creating Custom Dashboards” session which I could not answer due to the limited time was:

How to manage access permissions to Aria Operations Dashboards in a way that will allow only specific group of content admins to edit only specific group of dashboards?

Even if there is no explicit feature providing such functionality, there is a way to implement it using Access Control and Dashboard Sharing capabilities of Aria Operations.

My solution

Assumption is that for example following AD users and groups are available, content admins are responsible for creating dashboards and content users will be consuming their dedicated content.

Figure 01: AD Users and Groups

I have imported the AD groups in Aria Operations Access Control and for the sake of simplicity I have assigned them the predefined roles Content Admin and Read Only respectively and granted access to all objects in Aria Operations.

Figure 02: AD Groups in Operations Access Control

I have also created two sample dashboards and two dashboard folders for these two dashboards. This is not really required but it makes it easier to find the dashboards if you have a larger number of them with a more complex categorization.

Figure 03: Aria Operations dashboard folders

And the last thing to do is to configure dashboard sharing accordingly using in the dashboard management shown in the next picture.

Figure 04: Aria Operations dashboard management

A dashboard can be shared with multiple user groups. In may example I have shared one sample dashboard with one editor and user group and the other sample dashboard with another editor and another user group. This way only dedicated editors (the members of the AD group) have access only to dashboards shared with them, and of course to any other dashboard shared with the built-in group Everyone. The very same way as regular users get access to their respective content.

Figure 05: Aria Operations dashboard sharing

Of course this approach requires a proper user group and dashboard sharing concept but such a concept is recommended anyway.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

VMware Explore Follow-up – Aria Operations and SNMP Traps

During one of my “Meet the Expert” sessions this year in Barcelona I was asked if there is an easy way to use SNMP traps as Aria Operations Notification and let the SNMP trap receiver decide what to do with the trap based on included information except for the alert definition, object type or object name itself.

The requirement is to make it as simple and as generic as possible thus creating separate alert definitions and notifications for e.g. Windows and Linux teams or Dev or Test environments is not an option.

Solution

I had few ideas in my mind but I had to test it first as working with SNMP traps is not something I am doing very often.

Basically we have two easy options to include additional information in the notification:

  • Add metrics and/or properties to the payload template which can be used as a differentiator.
  • Modify the alert definition to always include an additional symptom which can be used as the differentiator, like for example include a vSphere tag based symptom.

Aria Operations Payload Templates allow you to add any additional metrics and properties to the notification. Theses metrics and properties do not have to be related to the actual alert definition but might help to organize and route the alerts in the receiving system based on that additional information.

In the following picture you can see my payload template which includes one additional metric and one property. My test alert definition will be triggered on Virtual Machine object type.

Figure 01: Payload Template

For my tests I have also created an new very simple Symptom Definition, this symptom is basically trigger everytime a Virtual Machine has any vSphere tag assigned to it. A specific tag can be now used to be parsed later on and allow required decisions.

Next picture shows the symptom definition.

Figure 02: “Dummy” Symptom Definition

My Aria Operations Alert Definition includes the actual symptom I am interested in, for simplicity reasons it is also a certain vSphere tag which I can quickly set and remove to trigger the alert, combined using a boolean AND with the dummy symptom definition.

Figure 03: Alert Definition

As last step in Aria Operations I have created a Notification which will send the SNMP trap to my Aria Orchestrator instance where I can inspect the SNMP message to see what is actually included.

Figure 04: Notification Definition
SNMP Message

And here is what Aria Operations is sending as the SNMP message. For completeness I have included the entire message here and highlighted the additional information, both, the dummy symptom and the modified payload. The following links describe the Aria Operations MIB and help identitiv and parse the relevant parts.

=============
oid: 1.3.6.1.2.1.1.3.0
type: Number
snmp type: Timeticks
value: 3112273537

Element 2:
=============
oid: 1.3.6.1.6.3.1.1.4.1.0
type: String
snmp type: OID
value: 1.3.6.1.4.1.6876.4.50.1.0.46

Element 3:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.1.0
type: String
snmp type: Octet String
value: vrops.cpod-cmbu-vcf01.az-muc.cloud-garage.net

Element 4:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.2.0
type: String
snmp type: Octet String
value: ansible

Element 5:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.3.0
type: String
snmp type: Octet String
value: General

Element 6:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.4.0
type: String
snmp type: Octet String
value: 1669559583519

Element 7:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.5.0
type: String
snmp type: Octet String
value: warning

Element 8:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.6.0
type: String
snmp type: Octet String
value: New alert by id 1ab40eba-c480-4475-91e2-a0cc682fe945 is generated at Sun Nov 27 14:33:03 UTC 2022;

Element 9:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.7.0
type: String
snmp type: Octet String
value: https://172.28.4.33/ui/index.action#environment/object-browser/hierarchy/d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee/alerts-and-symptoms/alerts/1ab40eba-c480-4475-91e2-a0cc682fe945

Element 10:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.8.0
type: String
snmp type: Octet String
value: 1ab40eba-c480-4475-91e2-a0cc682fe945

Element 11:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.9.0
type: String
snmp type: Octet String
value: symptomSet: 1242e208-cc7f-40db-9bb0-ecc8a55b1f9b
relation: self
totalObjects: 1
violatingObjects: 1
symptom: tk-Include-vSphere-Tags
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric: 
obj.1.info: Property [<OS-Type-Windows3.11>] matches regular expression .*
symptom: tk-TriggerTestAlert
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric: 
obj.1.info: Property [<OS-Type-Windows3.11>, <killSwitch-On>] contains <killSwitch-On>


Element 12:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.10.0
type: String
snmp type: Octet String
value: Application

Element 13:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.11.0
type: String
snmp type: Octet String
value: Performance

Element 14:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.12.0
type: String
snmp type: Octet String
value: warning

Element 15:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.13.0
type: String
snmp type: Octet String
value: warning

Element 16:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.14.0
type: String
snmp type: Octet String
value: info

Element 17:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.15.0
type: String
snmp type: Octet String
value: 

Element 18:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.16.0
type: String
snmp type: Octet String
value: VirtualMachine

Element 19:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.17.0
type: String
snmp type: Octet String
value: tk-TestAlert-01

Element 20:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.18.0
type: String
snmp type: Octet String
value: 

Element 21:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.19.0
type: String
snmp type: Octet String
value: health

Element 22:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.20.0
type: String
snmp type: Octet String
value: tk-SNMP-Trap-Test

Element 23:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.21.0
type: String
snmp type: Octet String
value: Number of KPIs Breached : 0.0

Parent Host : esx03.cpod-cmbu-vcf01.az-muc.cloud-garage.net



Element 24:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.22.0
type: String
snmp type: Octet String
value: 

Element 25:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.23.0
type: String
snmp type: Octet String
value: 

The following links explain the SNMP content in detail.

https://github.com/librenms/librenms/blob/master/mibs/vmware/VMWARE-VCOPS-EVENT-MIB

https://mibs.observium.org/mib/VMWARE-VROPS-MIB/

Stay safe.

Thomas – https://twitter.com/ThomasKopton

How vRealize Operations helps size new vSphere Clusters

In ESXi Cluster (non-HCI) Rightsizing using vRealize Operations I have described how to use vRealize Operations and the numbers calculated by the Capacity Engine to estimate the number of ESXi hosts which might be moved to other clusters or decommissioned. The corresponding dashboard is available via VMware Code.

In this post, I describe the opposite scenario.

Problem Statement

The question I will answer is: “How can I use vRealize Operations to help me size new vSphere clusters using completely new ESXi hosts I plan to purchase?

With its Capacity Engine and the “What-If Analysis” scenarios, vRealize Operations provides powerful features to help with infrastructure and workload planning. In case you are not familiar with “What-If”, the following picture shows the supported scenarios.

Figure 01: vRealize Operations “What-If Analysis” supported scenarios.

What we are missing here is a scenario covering local workload (Virtual Machines) migrations from existing vSphere clusters to new planned and not yet existing clusters. Usually, you know what kind of compute hardware you are planning to buy or at least what choices you have, what you do not know is how many of them you need to run any specific workloads.

Solution

NOTE: I am using the demand model in this use case. The allocation model would be similar to implement.

Using vROps and knowing what type of hardware will be used, we have everything we need to estimate the number of hosts required to migrate all workloads from an “old” vSphere Cluster.

These are the ingredients:

  • “Recommended Total Capacity (Mhz)” as calculated by the vROps Capacity Engine
  • “Recommended Total Capacity (GB)” as calculated by the vROps Capacity Engine
  • total CPU resources (in MHz) provided by the new hardware
  • total RAM resources (in GB) provided by the new hardware

Now we need to do some simple math:

"Recommended Total Capacity (Mhz)" / "total CPU resources (in MHz) provided by the new hardware"

"Recommended Total Capacity (GB)" / "total RAM resources (in GB) provided by the new hardware"

I use two vROps Super Metrics with as simple as possible formula, a number, to represent the potential new resources.

In this example, it is a Cisco Blade system with a certain CPU and RAM configuration.

Figure 02: Super Metrics representing the new hardware configuration.

Another three Super Metrics, attached to Cluster Compute Resource as object type, simply calculate the required number of such new hosts, from the CPU and RAM perspective, and identify the higher one as the number of required hosts.

Figure 03: Super Metrics calculating the number of required new hosts.

To make it easier to consume I have created a dashboard similar to the rightsizing one.

Figure 04: Local migration planning dashboard.

You can download the Super Metrics, Views, and Dashboard from VMware Code.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

vRealize Operations and Telegram – Alerts on your Phone

As you all know vRealize Operations is a perfect tool to manage and monitor your SDDC and in case of issues vROps creates alerts and informs you as quickly as possible providing many details related to an alert.

Without any additional customization, vRealize Operations displays the alerts in the Alerts tab. Sending alerts via email is the most common way to quickly attract your attention and increase the chances of reacting to the alert quickly.

With the Webhook Notification Plugin, it is possible to integrate vROps with almost any REST API endpoint. Telegram provides an easy-to-use API and allows for sending vROps alerts as messages into a Telegram chat.

Telegram Chat, Bot, and Token

The first step is to prepare Telegram to allow vROps to send messages to a chat. Basically, you will need a chat, a bot, and a token. The process is very well described here.

The actual REST API POST call uses the chat ID and the Token to ingest messages into the chat.

https://api.telegram.org/bot$token/sendMessage?chat_id=$chat&text=Hello+World

The POST content is pretty simple.

{
  "text": "my message"
}

vROps Outbound Instance

The next step is the vROps Outbound Webhook Plugin instance that will point to our telegram chat. For simplicity, the Outbound URL contains the entire URL we need to send messages to Telegram.

In my environment, the instance looks like depicted in the following figure, bot ID and token are truncated for visibility and my data privacy.

Figure 01: Outbound Webhook Plugin instance.

You do not have to specify any user or password as the ingestion uses the token in the URL. When you run the test you will see an error but this is in our case OK.

vROps Payload Template

Following the Outbound Plugin instance, we need to specify the details of the REST call and the payload. In my example, I would like to receive datastore space-related alarms in my chat. Details of the alert definition are not described in this post.

The Payload Template is where we define these settings, subdivided into:

  • Details: Specifies the name of the instance and the Outbound Method to use, in this case, the Webhook Notification Plugin.
  • Object Content: Here we can add additional information, like metrics and properties to be used in the actual payload. I have added a few additional metrics to help qualify the alarm.
  • Payload Details: Specifies the actual REST call, like method or content type, and describes the body of the call as required by the receiving instance. In the body itself, we can use the predefined parameters (uppercase) and our additional metrics or properties or related objects (lowercase).

The following two pictures show my Payload Template.

Figure 02: Payload Template – Details.
Figure 03: Payload Template – Object Content.
Figure 04: Payload Template – Payload Details.

vROps Notification

The last step of the process is to bring everything together, the alert definition, the outbound instance, and the payload template to work together and send the message towards Telegram.

This is where the good old vROps Notifications come into play. It wires everything and does exactly what the name implies – notifies you in an alarm event.

The process of notification creation has four steps:

  1. Notification: Here you give the notification a name and enable or disable it.
  2. Define Criteria: This step gives you a large number of options to exactly specify when the notification should be triggered. In my use case, I want this notification to be triggered only on one specific object type (Datastore) and only for one specific alert definition (tk-vSAN Remaining Space Low).
  3. Set Outbound Method: Here we define the Outbound Method (Webhook Notification Plugin) and the Outbound Instance we created for Telegram.
  4. Select Payload Template: In the last step we define the Payload Template to use, in our use case the template we created for Telegram.

The following pictures show my Notification settings.

Figure 05: vROps Notification – Notification.
Figure 06: vROps Notification – Define Criteria.
Figure 07: vROps Notification – Set Outbound Method.
Figure 08: vROps Notification – Select Payload Template.

The Result

When all steps are finished successfully, any time the specified alarm has been triggered you will receive a message in your Telegram chat, as depicted in the following figure.

Figure 09: vROps Alarm in Telegram chat.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

Exclude “Aggregate” Instanced Metric using vRealize Operations Super Metric

As you know vRealize Operations is collecting tons of various metrics. Some of these metrics are so-called “Instanced Metrics” and disabled in the default configuration in newer vROps versions. A list of disabled instanced metrics for e.g. Virtual Machine object type is available here:

https://docs.vmware.com/en/vRealize-Operations/8.6/com.vmware.vcom.metrics.doc/GUID-1322F5A4-DA1D-481F-BBEA-99B228E96AF2.html#disabled-instanced-metrics-18

If you need any of those metrics, you can enable them in your vRealize Operations Policy.

Figure 01: Enabling disabled instanced metric in vROps policy.

As you can see in the previous picture, there is an option to specify instances you would like to include or exclude. In my example, I am excluding the CPU (or CPUs) containing “1” in the instanced metric name. Yes, it does not make any sense, it is just an example:-)

Problem statement

In addition or as a replacement for some of the disabled instanced metrics vRealize Operations provides the “Aggregate of all instances” metric, like in this example for Virtual Disk metrics.

Figure 02: “Aggregate of all instances” metric.

The problem now is that in certain situations where you would like to evaluate the instanced metrics to find the maximum, minimum, etc. the aggregated metric may also be taken into the equation, like in views or super metrics.

Use case

One of my customers described a very interesting and important use case.

“I want to determine the highest average write request size”.

One logical way would be to use vRealize Operations Super Metric and create a formula like this one:

max({This Resource: Virtual Disk|Average Write request size (bytes)}) 

Unfortunately, that approach does not work.

As described in the “Problem statement” this calculation includes the aggregated metric, “VirtualDisk|Aggregate of all Instances”, which leads to a wrong result.

Possible solution

Please be aware that this is ONE possible solution with one drawback that I will explain at the end.

The approach is to exclude the aggregated metric from the formula.

What we cannot do, or at least I do not know how is to exclude a metric based on the instance name.

What we can do is to leverage the assumption that the aggregate will be usually greater than any single instance as it is the sum of all instances. And this is the mentioned drawback. The approach works only when the following assumption is true:

  • count of instances is > 1
  • at least 2 instances have a value > 0 at the occurrence of super metric evaluation

I am working on an improved version of the formula to get rid of the assumption. For the time being this is what is working taking the mentioned assumption into account:

max(${this, attribute=virtualDisk|writeIOSize_latest, where=($value < ${metric=virtualDisk:Aggregate of all instances|writeIOSize_latest})})

This formula is evaluating only the metrics with values < the value of the aggregated metric.

Figure 03: Highest average write request size.

Outlook

The improved formula will include some if-then-else statements.

Stay safe

Thomas – https://twitter.com/ThomasKopton

Capacity Management for n+1 and n*2 Clusters using vRealize Operations

When it comes to capacity management in vSphere environments using vRealize Operations customers are frequently asking for guidelines how to setup vROps to properly manage n+1 and n*2 ESXi clusters.

Just as a short reminder, n+1 in context of a ESXi cluster means that we are tolerating (and are hopefully prepared for) the failure of exactly one host. If we need to cope with the failure of 50% of all hosts in a cluster, like two fault domains, we often use the n*2 term.

In general we have two options to make vRealize Operations aware of the failure strategy for the ESXi clusters:

  • the “out-of-the-box” and very easy approach using vSphere HA and Admission Control
  • the vROps, and almost same easy, way using vRealize Operations Policies

vSphere HA and Admission Control

If configured Admission Control automatically calculates the reserved CPU and Memory failover capacity. In the first example my cluster is configured to tolerate failure of one host, which makes it 25% for my 4-hosts cluster.

Figure 1: vSphere and HA settings – n+1 cluster

vRealize Operations is collecting this information and accordingly calculating the remaining capacity. In the following picture you can see vROps recognizing the configured HA buffer of 25%.

Figure 2: vROps HA buffer for n+1 cluster

If we now change the Admission Control settings to n*2, in my case two ESXi host, vSphere is calculating the new required CPU and Memory buffer. We could also set the buffer manually in to 50% or whatever value is required.

Figure 3: vSphere and HA settings – n*1 cluster

After a collection cycle, vRealize Operations retrieves the new settings and starts calculating capacity related metrics using the adjusted values for available CPU and Memory capacity.

Figure 4: vROps HA – available capacity reflecting new HA settings

The “Capacity Remaining” decreases following the new available capacity and the widget shows the new buffer values in %.

Figure 5: vROps HA buffer for n*1 cluster

vRealize Operations Capacity Buffer and Policies

Sometimes the vSphere HA Admission Control is not being used and customers need another solution for their capacity management requirements.

This is where vROps Policies and Capacity Buffer settings helps manage vSphere resources.

vRealize Operations applies various settings to groups of object using vROps Policies. One section of a policy are Capacity Settings.

Figure 6: vROps Capacity Settings via Policy

Within the Capacity Settings you can define a buffer for CPU, Memory and Disk Space to reduce the available capacity of a vSphere cluster or a group of clusters. You can set the values for both capacity models, Demand and Allocation, separately.

Figure 7: vROps Capacity Settings – Buffer

In my example, I have disabled Admission Control in vCenter and set buffers in vROps.

Figure 8: vROps capacity remaining using buffer setting via policy

vRealize Operations is now using the new values for available resources to calculate cluster capacity metrics.

Btw. Custom Groups are the vROps way to group similar cluster together and treat all of them the same way.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

vRealize Operations and Logging via CFAPI and Syslog

Without any doubt configuring vRealize Operations to send log messages to a vRealize Log Insight instance is the best way to collect, parse and display structured and structured log information.

In this post I will explain the major differences between CFAPI and Syslog as the protocol used to forward log messages to a log server like vRealize Log Insight.

The configuration of the log forwarding in vRealize Operations is straight forward. Under “Administration” –> “Management” –> “Log Forwarding” you will find all options to quickly configure vRLI as target for the selected log files.

The following figure shows how to configure vRealize Operations to send all log messages to vRealize Log Insight using the CFAPI protocol via HTTP.

Figure 1: Log Forwarding configuration

The CFAPI protocol, over HTTP or HTTPS, used by the vRealize Log Insight agent provides additional information used by the vROps Content Pack. The extracted information flows into the various dashboards and alert definitions delivered through the Content Pack. Following picture shows one of the available dashboards populated with available data when using CFAPI and vRLI.

Figure 02: vROps Content Pack

In case you (for whatever strange reason) cannot use CFAPI, you can configure vROps to use Syslog. It is as simple as selecting Syslog as the protocol option in the configuration page shown in the following picture.

Figure 03: Syslog as configured protocol

The drawback of using Syslog here is that the additional information parsed by the agent and used by the content pack will no longer be available and you will need to create your own extracted fields in vRLI to parse data from the log messages.

In the next both pictures you can see the empty dashboards and log messages without any vROps specific fields in the interactive analytics .

Figure 04: Empty dashboards when using Syslog
Figure 05: Missing vROps specific fields when using Syslog

It is important to know that vROps is using Syslog over TCP when configured via UI as shown in figure 03.

But what if you are forced to use Syslog over UDP?

There is no such option in the UI but since vROps is using the regular vRLI agent, there has to be a way to configure it to use UDP instead of TCP.

The vRLI config file explains how to set the according option:

Figure 06: liagent.ini config file

You can just replace

proto = syslog

with

proto = syslog_udp

restart the agent

service liagentd restart

and your vROps nodes starts to forward log messages to your log server using UDP.

I have setup a fake log server listening on 514 UDP using netcat:

Figure 07: Syslog over UDP in NC

If you configure the vRLI agent in vROps directly via the config file, please keep in mind:

  • that you are using a function which is not officially supported by VMware
  • you will need to make such manual changes on every node
  • you will need to monitor any changes to that file which can be triggered via the UI or vROps updates

Stay safe.

Thomas – https://twitter.com/ThomasKopton