TOMsOps – Page 2 – Just another VMware Cloud Management Blog

Enhancing Sustainability Data in VMware Aria Operations using REST API and Automation

Overview of OOTB Sustainability

With its VMware Greenscore VMware Aria Operations provides a great way of not only showing the effects of your current efforts toward more sustainable operations, it also offers multiple approaches which will help organization improve their operational efficiency, save money and reduce carbon emissions.

Major use cases possible OOTB

With the power usage and utilization related metrics VMware Aria Operations is collecting, and the predefined set of Dashboards the operations teams can quickly identify opportunities for improvement.

From Clean Demand overview helping find potentially unused resources (Virtual Machines, Snapshots, etc.) or not properly sized Virtual Machines, through Lean Operations focusing on provider efficiency up to Green Supply making energy consumption and carbon emissions transparent, Aria Operations offers the right tools to make the sustainability strategy a success.

*Figure 02: The three pillars of sustainability in VMware Aria Operations*

Leveraging the available information many use cases can be easily implemented out-of-the-box:

resource reclamation and rightsizing
workload balancing
insights into energy consumption, both at provider and at the consumer level
insights into carbon emissions
insights into energy costs

Use cases not possible OOTB

Even if the requirements and use cases are basically always the same, they usually differ in detail. And of course, these details cannot all be mapped as features in the product.
On the other hand, VMware Aria Operations offers excellent options for expanding the feature set without waiting for a new version and thus being able to address practically all use cases.

One of such use cases was:

I need to see the power usage of my vSphere Clusters in this month, from the beginning of the month up to today – Month To Date or MTD values
I want to see the energy costs of my vSphere Clusters as MTD value
I want to display my current Power Usage Effectiveness (PUE) value in VMware Aria Operations

Prior to the possible solution, the next two paragraphs will give you a quick overview of the VMware Aria Operations functionalities building the foundation of my concept.

VMware Aria Operations REST API

The very well described Aria Operations REST API is probably the best way to extend functionality almost infinitely. In the end, Aria Operations is a data lake with many many methods and algorithms to generate helpful insights from raw data.

Every Aria Operations instance, SaaS as well as on-premises, provides a Swagger UI which makes it really easy to get familiar with the very extensive REST API. You can access the Swagger you by simply navigating to:

https://www.mgmt.cloud.vmware.com/vrops-cloud/suite-api

https://$localurl/suite-api

*Figure 03: VMware Aria Operations REST API documentation*

In my concept I will use few of the Resource and Resources methods to gather information, calculate and add new data to existing objects in Aria Operations.

VMware Aria Operations Super Metrics

Sometimes the task is to determine new metrics based on already known metrics or properties. You often don’t even have to resort to programmatic approaches, the VMware Aria Operations Super Metrics provide almost everything to generate new information from existing data with little effort.

Super Metrics are another part of my concept and if you would like to learn how to create your own Super Metrics, please go and check out my video series here:

Figure 04: Now known as VMware Aria Operations Super Metrics Made Easy

The Concept

To stay focused on the objectives, here again the requirements:

I need to see the power usage of my vSphere Clusters in this month, from the beginning of the month up to today – Month To Date or MTD values
I want to see the energy costs of my vSphere Clusters as MTD value
I want to display my current Power Usage Effectiveness (PUE) value in VMware Aria Operations (still WiP)

Based on the requirements, here is the data or metrics I need in Aria Operations:

MTD value – how many seconds or minutes or hours since the beginning of the current month
power consumption of my ESXi hosts per hour (or minute etc.)
current price of energy for any of my ESXi hosts (might be different per host)
the total energy consumption of my data center(s) (facility including cooling, lighting etc.)

Using this information I can finally calculate the data requested in the use cases. The overall process in my concept is depicted in the following figure (I have removed the login operations for better visibility).

*Figure 05: Communication diagram (simplified)*

Let’s start with the easiest part – the energy price.

The are different approaches on how to add this data to Aria Operations. I have decided to use the Custom Group construct to organize my clusters and add automatically add a new custom property, Electricity Rate.

*Figure 06: Adding Energy Rate property using Custom Groups*

If using different energy prices for different ESXi hosts within one cluster is a constraint, the same method can be used to add the energy rate as custom property to every single ESXi host system. This value at the host level will make it easier to calculate MTD energy costs for every single ESXi system.

The next step, slightly more complex, is the calculation of the MTD energy consumption for every Host System object which is having the necessary power metrics, and pushing this new value back to Aria Operations as custom metric. The next picture shows the Aria Automation Orchestrator workflows I have created to accomplish this task.

*Figure 07: Aria Automation Orchestrator workflows*

At the first glance this workflow may appear very confusing but there is an own concept behind this approach. The idea is to have a Aria Operations Automation Framework, re-usable building-blocks you stich together and implement new requirements. All workflows, actions, Super Metrics etc. will be available via VMware Code and my GitHub page. You will find all details at the end of this blog post.

This workflow runs every hour and pushes the values to Aria Operations. The next picture shows the MTDEnergyConsumption metric of my two ESXi hosts.

*Figure 08: Host MTD Energy Consumption metrics*

Finally we have the basic metrics, MTD energy consumption and energy price, to calculate additional metrics and visualize them in Aria Operations.

Super Metrics are the easiest way to calculate energy consumption at higher levels like clusters and data centers as well as the price of the consumed energy on all levels. Following picture shows two examples of such Super Metrics, cluster level energy usage and energy price – both as MTD values.

*Figure 09: Cluster MTD Energy Consumption and costs Super Metrics*

Finally we can use the new metrics to create Aria Operations Dashboards like this one which shows the Month to Date energy consumption and the costs for my Aria Operations Datacenter as well as its breakdown at cluster level.

*Figure 10: Datacenter MTD Energy Consumption and costs dashboard*

Outlook

The calculation of the PUE value is still Work in Progress, thus you will receive an error message using the dashboard.

In the next blog post I will show how to add and use even more energy related data coming from basically any device to VMware Aria Operations.

Sources

The VMware Aria Automation Orchestrator workflows and the Aria Operations content is available here:

https://github.com/tkopton/aria-operations-content/tree/main/Sustainability-01

Please be aware that you will need to create a configuration in Aria Orchestrator, which I will use in an improved version of my workflows:

*Figure 11: Aria Automation Orchestrator configuration*

You will also need to create Custom Groups in your Aria Operations to group Hosts or/and Clusters and add the Energy Rate as Custom Property to these objects.

Stay safe.

Thomas – https://twitter.com/ThomasKopto

How to provision local SaltStack Master to work with VMware Aria Automation Config Cloud (aka SaltStack Config)

Recently I had to install and configure a SaltStack Master in my home lab and connect this master to my Aria Automation Config Cloud (aka SaltStack Config, i will use both names in this post) instance.

Even if the official documentation improved a lot, there are still some pitfalls, especially if you are not experienced with the Salt setup.

In this blog post you will see, step by step, the entire process of preparing CentOS 8, installing Salt components and configuring the master to join the Cloud RaaS instance.

My new SaltStack Master will run on a CentOS 8 box:

4.18.0-348.2.1.el8_5.x86_64 #1 SMP

Step 1 – Prepare CentOS 8

The first thing you need to ensure is, that the firewall is not blocking the Salt master ports.

[root@saltmaster02 ~]# firewall-cmd --permanent --add-port=4505-4506/tcp success
[root@saltmaster02 ~]# firewall-cmd --reload success

You can find the details on the official page of the salt project: https://docs.saltproject.io/en/latest/topics/tutorials/firewall.html

You also need to ensure that the gcc package is installed, if it is not available, simply run:

sudo yum install gcc python3-devel

Step 2 – Install Salt on your Salt master

You must install the Salt master service and Salt minion service plus some few more packages if needed on the Salt master. The following instructions install the latest Salt release on CentOS 8 (RHEL 8).

In the Salt master’s terminal, run the following commands to install the Salt Project repository and key:

sudo rpm --import https://repo.saltproject.io/py3/redhat/8/x86_64/latest/SALTSTACK-GPG-KEY.pub
curl -fsSL https://repo.saltproject.io/py3/redhat/8/x86_64/latest.repo | sudo tee /etc/yum.repos.d/salt.repo

Run sudo yum clean expire-cache. Install the salt-minion service and salt-master service on your Salt master:

sudo yum install salt-master
sudo yum install salt-minion

# Optional packages
sudo yum install salt-ssh
sudo yum install salt-syndic
sudo yum install salt-cloud
sudo yum install salt-api

Enable and start service for salt-minion, salt-master, or other Salt components:

sudo systemctl enable salt-master && sudo systemctl start salt-master
sudo systemctl enable salt-minion && sudo systemctl start salt-minion
sudo systemctl enable salt-syndic && sudo systemctl start salt-syndic
sudo systemctl enable salt-api && sudo systemctl start salt-api

See the Salt Install guide for information about installing Salt on other operating systems.

Step 3 – Create initial master configuration

Create a master.conf file in the /etc/salt/minion.d directory. In this file, set the Salt master’s IP address to point to itself:

master: localhost

Restart the Salt master service and Salt minion service (the services have been enabled in the previous step):

sudo systemctl restart salt-master
sudo systemctl restart salt-minion

Step 4 – Install and configure the Master Plugin

After you install Salt on your om-premises infrastructure, you must install and configure the Master SSEAPE Plugin, which enables your Salt masters to communicate with Aria Automation Config (SaltStack Config) Cloud.

To install and configure the Master PSSEAPE Plugin you first need to install the required Python libraries. Login to your local master and run:

sudo pip3 install pyjwt
sudo pip3 install pika

Download the latest Master Plugin wheel from Customer Connect. You will find the file in the package highlighted in the following picture.

*Figure 01: Package containing the SSEAPE Plugin.*

The Master Plugin is included in the Automated Installer .tar.gz file. After you download and extract the .tar.gz file, you can find the Master Plugin in the sse-installer/salt/sse/eapi_plugin/files directory.

Put the wheel file into your /root directory and install the Master Plugin by manually installing the Python wheel. Use the following example commands, replacing the exact name of the wheel file:

sudo pip3 install SSEAPE-file-name.whl --prefix /usr

Verify that the /etc/salt/master.d directory exists, create it if needed.

Run the following command to generate the master configuration file.

sudo sseapi-config --all > /etc/salt/master.d/raas.conf

If running this command causes an error, see Troubleshooting SaltStack Config Cloud.

Restart the Salt master service.

sudo systemctl restart salt-master

Step 5 – Generate an API token

Before you can connect your Salt master to Aria Automation Config Cloud, you must generate an API token using the Cloud Services Console. This token is used to authenticate your Salt master with VMware Cloud Services.

NOTE: You must have the same role(s) as the role(s) you are configuring for the token. So for example if you are assigning the Organization-Administrator role to the new token you must be Organization-Administrator as well! Please also see: https://docs.vmware.com/en/VMware-Cloud-services/services/Using-VMware-Cloud-Services/GUID-E2A3B1C1-E9AD-4B00-A6B6-88D31FCDDF7C.html

To generate an API token:

On the Cloud Services Console toolbar, click your user name and select My Account > API Tokens.

Click Generate Token.

Enter a name for the token.

Select the token’s Time to Live (TTL). The default duration is six months. Note: A non-expiring token can be a security risk if compromised. If this happens, you must revoke the token.

Define scopes for the token. To access the Aria Automation Config Cloud service, you must select the Organization Admin or Organization Owner roles as well as the Salt Master Service Role.

*Figure 03: Specify the Organization and Service Roles.*

(Optional) Set an email preference to receive a reminder when your token is about to expire.

Click Generate and the newly generated API token will appear in the Token Generated window.

Save the token value to a secure location. After you generate the token, you will only be able to see the token’s name on the API Tokens page, not the token value itself. To regenerate the token, click Regenerate.

Step 6 – Connect your Salt master to Aria Automation Config (SaltStack Config) Cloud

After you generated an API token, you use it to connect your Salt master to Aria Automation Config Cloud.

To connect your Salt master first set a env variable to store the API token you have created in the previous step:

export CSP_API_TOKEN=<api token value>

Run the sseapi-config join command to connect your Salt master to Aria Automation Config Cloud. You have to replace the ssc-url and csp-url values with your region-specific URLs. See the following table for the region-specific URLs.

Region name	SSC URL	CSP URL
US	https://ssc-gateway.mgmt.cloud.vmware.com	https://console.cloud.vmware.com
DE (Germany)	https://de.ssc-gateway.mgmt.cloud.vmware.com	https://console.cloud.vmware.com
IN (India)	https://in.ssc-gateway.mgmt.cloud.vmware.com	https://console.cloud.vmware.com

Region-specific URLs

Run the sseapi-config join command:

sudo sseapi-config join --ssc-url <SSC URL> --csp-url <CSP URL>

In my example the command will be:

sseapi-config join --ssc-url https://ssc-gateway.mgmt.cloud.vmware.com --csp-url https://console.cloud.vmware.com

If you need to redo the joining process, re-run the sseapi-config join command and pass the flag --override-oauth-app.

sseapi-config join --ssc-url <SSC URL> --csp-url <CSP URL> --override-oauth-app

The --override-oauth-app flag deletes the OAuth app used to get an access token and recreates it.

Restart the Salt master service.

systemctl restart salt-master

“Repeat this process for each Salt master. Note: After you connect each Salt master to Aria Automation Config Cloud, you can delete the API token. It is only required for connecting your Salt masters.

After you run the sseapi-config command, an OAuth app is created in your Organization for each Salt master. Salt masters use the OAuth app to get an access token which is appended to every request to Aria Automation Config Cloud. You can view the details of the OAuth app by selecting Organization > OAuth Apps.

The command also creates pillar data called CSP_AUTH_TOKEN on the Salt master. Pillars are structures of data stored on the Salt master and passed through to one or more minions that have been authorized to access that data. The pillar data is stored in /srv/pillar/csp.sls and contains the client ID, the secret, your organization ID, and CSP URL. If you need to rotate your secret, you can re-run the sseapi-config join command.“

Example pillar data:

  CSP_AUTH_TOKEN:
   csp_client_id: kH8wIvNxMJEGGmk7uCx4MBfPswEw7PpLaDh
   csp_client_secret: ebH9iuXnZqUOkuWKwfHXPjyYc5Umpa00mI9Wx3dpEMlrUWNy95
   csp_org_id: 6bh70973-b1g2-716c-6i21-i9974a6gdc85
   csp_url: https://console.cloud.vmware.com

Step 7 – Accept Salt master keys

After you connected your Salt master(s) to Aria Automation Config Cloud, you must accept the Salt master’s key in the Aria Automation Config Cloud user interface.

You must have the Superuser role in SaltStack Config Cloud to accept the Salt master’s key.

To accept the Salt master’s key:

Log in to the SaltStack Config Cloud user interface.
From the top left navigation bar, click the Menu, then select Administration to access the Administration workspace. Click the Master Keys tab.
Check the box next to the master key to select it. Then, click Accept Key.
If you already connected your Salt minions to your Salt master, an alert appears indicating that you have pending minion keys to accept. To accept these minion keys, go to Minion Keys > Pending.
1. Check the boxes next to your minions to select them. Then, click Accept Key.
The key is now accepted. After several seconds, the minion appears under the Accepted tab and in the Targets workspace.

You can verify that your Salt master and Salt minions are communicating by running a test.ping command in the Aria Automation Config Cloud user interface.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

Aria Operations Heatmap Widget meets Custom Groups – Grouping by Property

The Aria Operations Heatmap Widget which is very often used in many Dashboards provides the possibility of grouping objects within the Heatmap by other related object types, like in the following screenshot showing Virtual Machines grouped by their respective vSphere Cluster.

*Figure 01: Heatmap with VMs grouped by their respective vSphere Cluster.*

Problem description

This is a great feature but the grouping works only using object types.

Today I was asked if there is a way to group Virtual Machines by their vCenter Custom Attribute value To my knowledge this is not possible in the widget itself.

Possible solution / workaround

Assuming the possible values of the Custom Attribute are known, the Aria Operations Custom Group object type can be used to implement such requirements.

The use case is:

“I want to group my Kubernetes Virtual Machines by they K8s function (master, worker).”

The steps are:

create a custom Group Type
create required Custom Groups and associate them with the new Group Type
use the Custom Groups for grouping in the Heatmap Widget

Create Custom Group Type

Group Type is an abstract construct to, well, group together Custom Groups belonging together, like the OOTB provided types Environment, Function, etc.

For my use case I created the type tk-kubernetes-functions as shown in the next picture.

Please note that you have to create the type before you create the groups! Once a group has been assigned to a type you cannot change the assignment, at least not using the UI.

Create Custom Groups

To group K8s masters and workers I created two groups. The configuration of the groups is based on the property value reflecting the vCenter Custom Attributes:

Summary|Custom Tag:kubernetes-role|Value.

*Figure 03: Custom Group configuration.*

That way I group my three TKGi Virtual Machines in two groups.

Configure Heatmap Widget using Custom Groups for grouping

In the last step I configured the Heatmap widget to use the new Group Type for grouping. Next picture shows the according configuration. You will find the new Group Type in the Container section of available object types.

Now my Heatmap shows Kubernetes Virtual Machines grouped by their role.

This approach will not work for all use cases, especially when the attribute values are not know upfront. In such scenarios Aria Automation could automate the creation of the Custom Groups in Aria Operations, but this is stuff for another blog post;-)

Stay safe.

Thomas – https://twitter.com/ThomasKopto

How to detect Windows Blue Screen of Death using VMware Aria Operations

Problem statement

Recently I was asked by a customer what would be the best way to get alerted by VMware Aria Operations when a Windows VM stopped because of a Blue Screen of Death (BSoD) or a Linux machine suddenly quit working due to a Kernel Panic.

Even if it looks like a piece of cake (we have tons of the metrics and properties collected by Aria Operations), it turns out that it is not that simple to recognize such crashes without looking at the console.

So, challenge accepted:-)

In this blog post I am focusing on Windows BSoD and the overall idea was to figure out the metrics which combined are indicating a BSoD occurrence.

NOTE: Windows as well as Linux will restart a crashed OS with default settings and the restart usually is quick enough to remain “undetected” by Symptom Definitions unless you are using Near-Real Time Monitoring in VMware Aria Operations SaaS. A restart can also be initiated by vCenter HA settings in case of missing heartbeats from the OS.

My “Windows BSoD” Approach

I quickly created a Windows Server 2019 VM with this configuration:

And with the help of tools like Testlimit and DiskSpd plus some usual activities on the VM I created a “quick and dirty” baseline using metrics shown in the next picture (please ignore the color coding in the following picture for a moment). You will notice that in the Scoreboard the VMTools status is missing. I was not sure if I should include it or not as “tools not running” does not necessarily mean that the OS crashed, it could be also a crashed service.

*Figure 02: Windows 2019 Server VM baseline.*

Blue Screen of Death Examples

NotMyFault is a perfect tool to crash Windows in various ways. As I wanted to check if different BSoD types have different symptoms I used that tool to force few crashes and collected a set of metrics for comparison.

First crash type.

I started with probably the most known Blue Screen:

*Figure 03: IRQL not less or equal – BSoD.*

The first surprise was that the CPU Demand of the VM immediately increased to almost 100%. To ensure that this is not related to the fact that Windows is collecting data after the crash for some time, I checked this metric after two collections cycles (10 minutes) and it did not go down. Second finding is that the RAM Usage is only slowly decreasing, I assume this is simply due the the fact that memory is a really virtualized resource whereas CPU cycles inside the VM are actual cycles on the ESXi CPU. I also added the Guest Tools Status to the Scoreboard but I would not use it as symptom in an Alert Definition.

*Figure 04: IRQL not less or equal – metrics.*

Second crash type.

*Figure 05: IRQL gt zero at system service – BSoD.*

As you can see in the next picture, the metrics are behaving similarly to the first crash. Of course no disk and network usage at all was expected but to see that the CPU Demand and RAM Usage are following the same pattern is interesting and very promising symptom.

*Figure 06: IRQL gt zero at system service – metrics.*

This time I waited few more collection cycles to see how far the RAM usage will decrease and apparently it will go down to approx. 0.99% after >4 cycles.

Third crash type.

*Figure 08: Unexpected kernel mode trap – BSoD.*

This BSoD type again resulted in the same metrics values.

*Figure 09: *Unexpected kernel mode trap* – metrics.*

RAM on ESXi as metric seems to be dependent on the memory usage of the VM before the crash. I did not fully test it but Aria Operations Metric Correlation feature shows the same pattern for the respective metrics.

*Figure 10: *Memory metrics correlation*.*

Just to be sure that the RAM Usage metric values do not change with different memory configurations of the VM, I did two more tests, the first one with 4GB RAM configured and the second with 17GB to check the metric with an odd RAM config.

*Figure 11: *4GB VM* metrics after crash.*

And here the 17GB RAM config.

*Figure 12: *17GB VM* metrics after crash.*

Constraints, assumptions and conclusions

Please be aware that I did not test every possible scenario, this is what I used:

Windows 2019 Server Datacenter as OS
VM Version 19
VMware ESXi, 7.0.3, 20328353
3 different BSoD types tested
VMTools not used as symptom
OS Uptime not used as symptom as the metric is not available after OS crashed
No Guest metrics used as such metrics will not be available after OS crashed

With the observations made during the crash tests I created 6 new Symptom Definitions and an Alert Definition using these new symptoms and one Condition for the power state of the VM. In the following two pictures you see the symptom and alert definitions.

DO NOT forget to activate your new symptoms and alert definition in the Aria Operations Policy assigned to your VMs!

This is how the symptoms looks like on a crashed Windows Server 2019 VM.

Please be aware that the highlighted low memory usage symptom requires several collection cycles to become active. If you need fast response, remove it from the Alert Definition.

The small dashboard I created is shown in the next picture.

You can download the content from VMware Code.

Update 03.03.2023

One of my fellow colleagues (thank you Brandon) suggested to test the behavior with VMTools not running at all as it will have impact on memory usage metrics. Brandon also suggested to add or replace CPU Demand with CPU Usage as demand will be affected by high CPU usage on the ESXi host. I have added this metric to the Metric Configuration file and uploaded it to VMware Code.

NOTE: CPU, Disk and Network metrics are basically instantly affected by the crash, whereas memory slowly converges toward 0.

As you can see in the following screenshot, you can use CPU Usage instead of CPU Demand as it will also increase to 98-100% after the BSoD.

*Figure 18: *BSoD Dashboar*d with new metrics.*

I would like to mention once again that the CPU metrics are available basically right after the crash and if you use Aria Operations SaaS, which is definitely the recommended way of using Aria Operations, you will get the symptoms triggered roughly after 40-60 seconds.

The memory metrics, as you can see in the next picture, will need several minutes to decrease to a level near 0.

*Figure 19: Slowly converging memory metrics.*

Stay safe.

Thomas – https://twitter.com/ThomasKopto

VMware Explore Follow-up 2 – Aria Operations Dashboard Permissions Management

Another question I was asked during my “Meet the Expert – Creating Custom Dashboards” session which I could not answer due to the limited time was:

“How to manage access permissions to Aria Operations Dashboards in a way that will allow only specific group of content admins to edit only specific group of dashboards?“

Even if there is no explicit feature providing such functionality, there is a way to implement it using Access Control and Dashboard Sharing capabilities of Aria Operations.

My solution

Assumption is that for example following AD users and groups are available, content admins are responsible for creating dashboards and content users will be consuming their dedicated content.

I have imported the AD groups in Aria Operations Access Control and for the sake of simplicity I have assigned them the predefined roles Content Admin and Read Only respectively and granted access to all objects in Aria Operations.

*Figure 02: AD Groups in Operations Access Control*

I have also created two sample dashboards and two dashboard folders for these two dashboards. This is not really required but it makes it easier to find the dashboards if you have a larger number of them with a more complex categorization.

*Figure 03: Aria Operations dashboard folders*

And the last thing to do is to configure dashboard sharing accordingly using in the dashboard management shown in the next picture.

A dashboard can be shared with multiple user groups. In may example I have shared one sample dashboard with one editor and user group and the other sample dashboard with another editor and another user group. This way only dedicated editors (the members of the AD group) have access only to dashboards shared with them, and of course to any other dashboard shared with the built-in group Everyone. The very same way as regular users get access to their respective content.

*Figure 05: Aria Operations dashboard sharing*

Of course this approach requires a proper user group and dashboard sharing concept but such a concept is recommended anyway.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

VMware Explore Follow-up – Aria Operations and SNMP Traps

During one of my “Meet the Expert” sessions this year in Barcelona I was asked if there is an easy way to use SNMP traps as Aria Operations Notification and let the SNMP trap receiver decide what to do with the trap based on included information except for the alert definition, object type or object name itself.

The requirement is to make it as simple and as generic as possible thus creating separate alert definitions and notifications for e.g. Windows and Linux teams or Dev or Test environments is not an option.

Solution

I had few ideas in my mind but I had to test it first as working with SNMP traps is not something I am doing very often.

Basically we have two easy options to include additional information in the notification:

Add metrics and/or properties to the payload template which can be used as a differentiator.
Modify the alert definition to always include an additional symptom which can be used as the differentiator, like for example include a vSphere tag based symptom.

Aria Operations Payload Templates allow you to add any additional metrics and properties to the notification. Theses metrics and properties do not have to be related to the actual alert definition but might help to organize and route the alerts in the receiving system based on that additional information.

In the following picture you can see my payload template which includes one additional metric and one property. My test alert definition will be triggered on Virtual Machine object type.

For my tests I have also created an new very simple Symptom Definition, this symptom is basically trigger everytime a Virtual Machine has any vSphere tag assigned to it. A specific tag can be now used to be parsed later on and allow required decisions.

Next picture shows the symptom definition.

My Aria Operations Alert Definition includes the actual symptom I am interested in, for simplicity reasons it is also a certain vSphere tag which I can quickly set and remove to trigger the alert, combined using a boolean AND with the dummy symptom definition.

As last step in Aria Operations I have created a Notification which will send the SNMP trap to my Aria Orchestrator instance where I can inspect the SNMP message to see what is actually included.

SNMP Message

And here is what Aria Operations is sending as the SNMP message. For completeness I have included the entire message here and highlighted the additional information, both, the dummy symptom and the modified payload. The following links describe the Aria Operations MIB and help identitiv and parse the relevant parts.

=============
oid: 1.3.6.1.2.1.1.3.0
type: Number
snmp type: Timeticks
value: 3112273537

Element 2:
=============
oid: 1.3.6.1.6.3.1.1.4.1.0
type: String
snmp type: OID
value: 1.3.6.1.4.1.6876.4.50.1.0.46

Element 3:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.1.0
type: String
snmp type: Octet String
value: vrops.cpod-cmbu-vcf01.az-muc.cloud-garage.net

Element 4:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.2.0
type: String
snmp type: Octet String
value: ansible

Element 5:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.3.0
type: String
snmp type: Octet String
value: General

Element 6:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.4.0
type: String
snmp type: Octet String
value: 1669559583519

Element 7:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.5.0
type: String
snmp type: Octet String
value: warning

Element 8:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.6.0
type: String
snmp type: Octet String
value: New alert by id 1ab40eba-c480-4475-91e2-a0cc682fe945 is generated at Sun Nov 27 14:33:03 UTC 2022;

Element 9:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.7.0
type: String
snmp type: Octet String
value: https://172.28.4.33/ui/index.action#environment/object-browser/hierarchy/d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee/alerts-and-symptoms/alerts/1ab40eba-c480-4475-91e2-a0cc682fe945

Element 10:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.8.0
type: String
snmp type: Octet String
value: 1ab40eba-c480-4475-91e2-a0cc682fe945

Element 11:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.9.0
type: String
snmp type: Octet String
value: symptomSet: 1242e208-cc7f-40db-9bb0-ecc8a55b1f9b
relation: self
totalObjects: 1
violatingObjects: 1
symptom: tk-Include-vSphere-Tags
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric: 
obj.1.info: Property [<OS-Type-Windows3.11>] matches regular expression .*
symptom: tk-TriggerTestAlert
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric: 
obj.1.info: Property [<OS-Type-Windows3.11>, <killSwitch-On>] contains <killSwitch-On>


Element 12:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.10.0
type: String
snmp type: Octet String
value: Application

Element 13:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.11.0
type: String
snmp type: Octet String
value: Performance

Element 14:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.12.0
type: String
snmp type: Octet String
value: warning

Element 15:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.13.0
type: String
snmp type: Octet String
value: warning

Element 16:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.14.0
type: String
snmp type: Octet String
value: info

Element 17:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.15.0
type: String
snmp type: Octet String
value: 

Element 18:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.16.0
type: String
snmp type: Octet String
value: VirtualMachine

Element 19:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.17.0
type: String
snmp type: Octet String
value: tk-TestAlert-01

Element 20:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.18.0
type: String
snmp type: Octet String
value: 

Element 21:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.19.0
type: String
snmp type: Octet String
value: health

Element 22:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.20.0
type: String
snmp type: Octet String
value: tk-SNMP-Trap-Test

Element 23:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.21.0
type: String
snmp type: Octet String
value: Number of KPIs Breached : 0.0

Parent Host : esx03.cpod-cmbu-vcf01.az-muc.cloud-garage.net



Element 24:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.22.0
type: String
snmp type: Octet String
value: 

Element 25:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.23.0
type: String
snmp type: Octet String
value:

The following links explain the SNMP content in detail.

https://github.com/librenms/librenms/blob/master/mibs/vmware/VMWARE-VCOPS-EVENT-MIB

https://mibs.observium.org/mib/VMWARE-VROPS-MIB/

Stay safe.

Thomas – https://twitter.com/ThomasKopton

vRealize Log Insight and Direct Forwarding to Log Insight Cloud

Since the release of vRealize Log Insight 8.8, you can configure log forwarding from vRealize Log Insight to vRealize Log Insight Cloud without the need to deploy any additional Cloud Proxy. It is the Cloud Forwarding feature in Log Management, which makes it very easy to forward all or only selected log messages from your de-centralized vRLI instances into a centralized vRealize Log Insight Cloud instance.

Setup procedure

Create an API Key in vRealize Log Insight Cloud

As usual for SaaS, the first thing we need is an API access key. That key is quickly generated in the API Keys section in vRealize Log Insight Cloud.

*Figure 01: Generating API Key in vRLI Cloud*.

Once we have generated the key, we can copy it as well as the URL which we will need to set up the Cloud Forwarding in vRLI later on.

*Figure 02: API Key and the target URL in vRLI Cloud*.

Configure Cloud Forwarding in vRLI

With the API Key and the URL, we can set up the Cloud Forwarding in vRealize Log Insight. The Cloud Forwarding option is one of the features of vRLI Log Management. A click on New Channel opens the forwarding configuration dialog.

*Figure 03: New Channel – Cloud Forwarding configuration*.

Alongside the name, key and URL of the new channel, we can also specify:

Tags we want to add to every forwarded log message to easily filter for certain sources in the target vRLI Cloud instance. In my example, I am tagging the messages according to the originating environment.
Filter to exactly specify which log messages should be forwarded to vRLI Cloud. In my example, I am forwarding all vRealize Operations messages to vRLI Cloud.

Read Only Forwarding Option

If you activate the Read Only toggle button, vRealize Log Insight acts as a relay and does not store or index the logs forwarded to vRealize Log Insight Cloud. This option can be used to decrease the load on the on-premises vRealize Log Insight and save disk space.

Browsing Logs in vRLI Cloud

Once forwarded to vRealize Log Insight Cloud we can use all Cloud features, like Live Trail to inspect the log messages.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

How vRealize Operations helps size new vSphere Clusters

In ESXi Cluster (non-HCI) Rightsizing using vRealize Operations I have described how to use vRealize Operations and the numbers calculated by the Capacity Engine to estimate the number of ESXi hosts which might be moved to other clusters or decommissioned. The corresponding dashboard is available via VMware Code.

In this post, I describe the opposite scenario.

Problem Statement

The question I will answer is: “How can I use vRealize Operations to help me size new vSphere clusters using completely new ESXi hosts I plan to purchase?“

With its Capacity Engine and the “What-If Analysis” scenarios, vRealize Operations provides powerful features to help with infrastructure and workload planning. In case you are not familiar with “What-If”, the following picture shows the supported scenarios.

*Figure 01: vRealize Operations “What-If Analysis” supported scenarios.*

What we are missing here is a scenario covering local workload (Virtual Machines) migrations from existing vSphere clusters to new planned and not yet existing clusters. Usually, you know what kind of compute hardware you are planning to buy or at least what choices you have, what you do not know is how many of them you need to run any specific workloads.

Solution

NOTE: I am using the demand model in this use case. The allocation model would be similar to implement.

Using vROps and knowing what type of hardware will be used, we have everything we need to estimate the number of hosts required to migrate all workloads from an “old” vSphere Cluster.

These are the ingredients:

“Recommended Total Capacity (Mhz)” as calculated by the vROps Capacity Engine
“Recommended Total Capacity (GB)” as calculated by the vROps Capacity Engine
total CPU resources (in MHz) provided by the new hardware
total RAM resources (in GB) provided by the new hardware

Now we need to do some simple math:

"Recommended Total Capacity (Mhz)" / "total CPU resources (in MHz) provided by the new hardware"

"Recommended Total Capacity (GB)" / "total RAM resources (in GB) provided by the new hardware"

I use two vROps Super Metrics with as simple as possible formula, a number, to represent the potential new resources.

In this example, it is a Cisco Blade system with a certain CPU and RAM configuration.

*Figure 02: Super Metrics representing the new hardware configuration.*

Another three Super Metrics, attached to Cluster Compute Resource as object type, simply calculate the required number of such new hosts, from the CPU and RAM perspective, and identify the higher one as the number of required hosts.

*Figure 03: Super Metrics calculating the number of required new hosts.*

To make it easier to consume I have created a dashboard similar to the rightsizing one.

You can download the Super Metrics, Views, and Dashboard from VMware Code.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

vRealize Log Insight Daily Disk Consumption – Quick Tip

Recently I was asked how to quickly estimate the vRealize Log Insight (vRLI) data usage per day and I have checked several, more or less obvious, approaches.

Usually, this question comes up when a customer is either re-designing the vRLI cluster or migrating to the vRealize Log Insight Cloud instance. To answer that question we first need to specify what “data usage” means.

Data usage on disk is certainly the most important metric when you need to size your instance. The first idea was to use the vRLI System Monitor and evaluate the IOPS statistics – Bytes Written metric. In the example shown in the next picture, we have a value of around 13MB/s on average which would mean almost 1TB per day of written data! In my environment that cannot be correct. Firstly it is an average, secondly, it’s IOPS, which means not every byte written means data being stored, update and delete ops are also written data.

The number is still very valuable as you usually need the IOPS information to size your storage backend, but it does not help with the data usage on disk.

Next idea was to leverage vRrealize Operations (vROps) to estimate the daily usage or daily incoming data. In vROps you simply go to the vRLI node(s) and check the Guest File System usage. Unfortunately, that does not work if the disk is already at 97% of the usage as at this point vRLI will start deleting old buckets and you will see frequent drops and not a continuously increasing metric value. That approach also may not work if you have created various index partitions with different retention periods and vRLI is frequently removing data. In the next figure, you see the file system usage in case the file system is already breaching the 97% threshold.

*Figure 02: vRLI file system usage monitored by vRealize Operations.*

As long as the file system is way below 97% and there is only one index partition configured, this vROps number, or more specifically the difference between the value at day_n and day_n-1, will give you a pretty good estimation.

All that means is we need something more accurate.

And here it comes. Working on System Alerts I have coincidentally discovered that the Repository Retention Time system alert will give you exactly the number you need. To trigger that alert you simply configure the threshold to a number that will be breached and you will get your information.

*Figure 03: vRLI Retention Notification settings.*

After the next evaluation (once every 24hrs) you will receive an email with the needed numbers. In the next figure, you see, that my vRLI instance consumes around 10GB per day. I have received that message for the next 2 days and the number remains stable at around 9.8 – 9.9 GB/day.

*Figure 04: Disk consumption per day information.*

I hope that helps you size your vRLI when for example migrating to vRealize Log Insight Cloud.

Stay safe.

Thomas – https://twitter.com/ThomasKopton

vRealize Operations and Telegram – Alerts on your Phone

As you all know vRealize Operations is a perfect tool to manage and monitor your SDDC and in case of issues vROps creates alerts and informs you as quickly as possible providing many details related to an alert.

Without any additional customization, vRealize Operations displays the alerts in the Alerts tab. Sending alerts via email is the most common way to quickly attract your attention and increase the chances of reacting to the alert quickly.

With the Webhook Notification Plugin, it is possible to integrate vROps with almost any REST API endpoint. Telegram provides an easy-to-use API and allows for sending vROps alerts as messages into a Telegram chat.

Telegram Chat, Bot, and Token

The first step is to prepare Telegram to allow vROps to send messages to a chat. Basically, you will need a chat, a bot, and a token. The process is very well described here.

The actual REST API POST call uses the chat ID and the Token to ingest messages into the chat.

https://api.telegram.org/bot$token/sendMessage?chat_id=$chat&text=Hello+World

The POST content is pretty simple.

{
  "text": "my message"
}

vROps Outbound Instance

The next step is the vROps Outbound Webhook Plugin instance that will point to our telegram chat. For simplicity, the Outbound URL contains the entire URL we need to send messages to Telegram.

In my environment, the instance looks like depicted in the following figure, bot ID and token are truncated for visibility and my data privacy.

*Figure 01: Outbound Webhook Plugin instance.*

You do not have to specify any user or password as the ingestion uses the token in the URL. When you run the test you will see an error but this is in our case OK.

vROps Payload Template

Following the Outbound Plugin instance, we need to specify the details of the REST call and the payload. In my example, I would like to receive datastore space-related alarms in my chat. Details of the alert definition are not described in this post.

The Payload Template is where we define these settings, subdivided into:

Details: Specifies the name of the instance and the Outbound Method to use, in this case, the Webhook Notification Plugin.
Object Content: Here we can add additional information, like metrics and properties to be used in the actual payload. I have added a few additional metrics to help qualify the alarm.
Payload Details: Specifies the actual REST call, like method or content type, and describes the body of the call as required by the receiving instance. In the body itself, we can use the predefined parameters (uppercase) and our additional metrics or properties or related objects (lowercase).

The following two pictures show my Payload Template.

*Figure 02: Payload Template – Details.*

*Figure 03: Payload Template – Object Content.*

*Figure 04: Payload Template – Payload Details.*

vROps Notification

The last step of the process is to bring everything together, the alert definition, the outbound instance, and the payload template to work together and send the message towards Telegram.

This is where the good old vROps Notifications come into play. It wires everything and does exactly what the name implies – notifies you in an alarm event.

The process of notification creation has four steps:

Notification: Here you give the notification a name and enable or disable it.
Define Criteria: This step gives you a large number of options to exactly specify when the notification should be triggered. In my use case, I want this notification to be triggered only on one specific object type (Datastore) and only for one specific alert definition (tk-vSAN Remaining Space Low).
Set Outbound Method: Here we define the Outbound Method (Webhook Notification Plugin) and the Outbound Instance we created for Telegram.
Select Payload Template: In the last step we define the Payload Template to use, in our use case the template we created for Telegram.

The following pictures show my Notification settings.

*Figure 05: vROps Notification – Notification.*

*Figure 06: vROps Notification – Define Criteria.*

*Figure 07: vROps Notification – Set Outbound Method.*

*Figure 08: vROps Notification – Select Payload Template.*

The Result

When all steps are finished successfully, any time the specified alarm has been triggered you will receive a message in your Telegram chat, as depicted in the following figure.

*Figure 09: vROps Alarm in Telegram chat.*

Stay safe.

Thomas – https://twitter.com/ThomasKopton