In the VMware Aria Operations 8.6 (previously known as vRealize Operations), VMware introduced pioneering sustainability dashboards designed to display the amount of carbon emissions conserved through compute virtualization. Additionally, these dashboards offer insights into reducing the carbon footprint by identifying and optimizing idle workloads.
This progress was taken even further with the introduction of Sustainability v2.0 in the Aria Operations Cloud update released in October 2022 as well as in the Aria Operations 8.12 on-premises edition. Sustainability v2.0 is centered around three key themes:
Assessing the Current Carbon Footprint
Monitoring Carbon Emissions with a Green Score
Providing Actionable Recommendations for Enhancing the Green Score.
When working with Virtual Machine power related metrics you need to be careful in case your VMs are running on certain ESXi 7.0 versions.
The issue can be very easily fixed in Aria Operations using two simple Super Metrics. The first one is correcting the Power|Power (Watt) metric:
${this, metric=power|power_average} / 1000
And the second Super Metric fixes the Power|Total Energy (Wh) metric:
${this, metric=power|energy_summation_sum} / 1000
Applying the Super Metric – Automatically
Super Metrics are activated on certain objects in Aria Operations using Policies. The most common construct which is being used to group objects and apply a Policy to them is the Custom Group.
In this case I am using two Custom Groups. The first one contains all ESXi Host System objects with version affected by the issue described in the KB. The second Custom Group contains all Virtual Machine objects running on Host Systems belonging to the first group.
The following picture shows how to define the membership criteria. And now you may see the problem. It will be a lot of clicking to include all 23 versions. But there is an easier way to do that. Simply create the Custom Group with two criteria as show below.
In the next step export the Custom Group into a file, open this JSON file with your favorite editor and just copy and paste the membership criteria, it is an array, and adjust the version number.
Save the file and import it into Aria Operations overwriting the existing Custom Group.
Now this Custom Group contains all affected ESXi servers and we can proceed with the VM group. The membership criteria is simple as shown in the next picture.
You can download the Custom Group definition here and adjust the name, description and the policy to meet your requirements.
With this relatively simply approach Aria Operations provides correct VM level power and energy metrics.
The Aria Operations Heatmap Widget which is very often used in many Dashboards provides the possibility of grouping objects within the Heatmap by other related object types, like in the following screenshot showing Virtual Machines grouped by their respective vSphere Cluster.
Problem description
This is a great feature but the grouping works only using object types.
Today I was asked if there is a way to group Virtual Machines by their vCenter Custom Attribute value To my knowledge this is not possible in the widget itself.
Possible solution / workaround
Assuming the possible values of the Custom Attribute are known, the Aria Operations Custom Group object type can be used to implement such requirements.
The use case is:
“I want to group my Kubernetes Virtual Machines by they K8s function (master, worker).”
The steps are:
create a custom Group Type
create required Custom Groups and associate them with the new Group Type
use the Custom Groups for grouping in the Heatmap Widget
Create Custom Group Type
Group Type is an abstract construct to, well, group together Custom Groups belonging together, like the OOTB provided types Environment, Function, etc.
For my use case I created the type tk-kubernetes-functions as shown in the next picture.
Please note that you have to create the type before you create the groups! Once a group has been assigned to a type you cannot change the assignment, at least not using the UI.
Create Custom Groups
To group K8s masters and workers I created two groups. The configuration of the groups is based on the property value reflecting the vCenter Custom Attributes:
Summary|Custom Tag:kubernetes-role|Value.
That way I group my three TKGi Virtual Machines in two groups.
Configure Heatmap Widget using Custom Groups for grouping
In the last step I configured the Heatmap widget to use the new Group Type for grouping. Next picture shows the according configuration. You will find the new Group Type in the Container section of available object types.
Now my Heatmap shows Kubernetes Virtual Machines grouped by their role.
This approach will not work for all use cases, especially when the attribute values are not know upfront. In such scenarios Aria Automation could automate the creation of the Custom Groups in Aria Operations, but this is stuff for another blog post;-)
Recently I was asked by a customer what would be the best way to get alerted by VMware Aria Operations when a Windows VM stopped because of a Blue Screen of Death (BSoD) or a Linux machine suddenly quit working due to a Kernel Panic.
Even if it looks like a piece of cake (we have tons of the metrics and properties collected by Aria Operations), it turns out that it is not that simple to recognize such crashes without looking at the console.
So, challenge accepted:-)
In this blog post I am focusing on Windows BSoD and the overall idea was to figure out the metrics which combined are indicating a BSoD occurrence.
NOTE: Windows as well as Linux will restart a crashed OS with default settings and the restart usually is quick enough to remain “undetected” by Symptom Definitions unless you are using Near-Real Time Monitoring in VMwareAria Operations SaaS. A restart can also be initiated by vCenter HA settings in case of missing heartbeats from the OS.
My “Windows BSoD” Approach
I quickly created a Windows Server 2019 VM with this configuration:
And with the help of tools like Testlimit and DiskSpd plus some usual activities on the VM I created a “quick and dirty” baseline using metrics shown in the next picture (please ignore the color coding in the following picture for a moment). You will notice that in the Scoreboard the VMTools status is missing. I was not sure if I should include it or not as “tools not running” does not necessarily mean that the OS crashed, it could be also a crashed service.
Blue Screen of Death Examples
NotMyFault is a perfect tool to crash Windows in various ways. As I wanted to check if different BSoD types have different symptoms I used that tool to force few crashes and collected a set of metrics for comparison.
First crash type.
I started with probably the most known Blue Screen:
The first surprise was that the CPU Demand of the VM immediately increased to almost 100%. To ensure that this is not related to the fact that Windows is collecting data after the crash for some time, I checked this metric after two collections cycles (10 minutes) and it did not go down. Second finding is that the RAM Usage is only slowly decreasing, I assume this is simply due the the fact that memory is a really virtualized resource whereas CPU cycles inside the VM are actual cycles on the ESXi CPU. I also added the Guest Tools Status to the Scoreboard but I would not use it as symptom in an Alert Definition.
Second crash type.
As you can see in the next picture, the metrics are behaving similarly to the first crash. Of course no disk and network usage at all was expected but to see that the CPU Demand and RAM Usage are following the same pattern is interesting and very promising symptom.
This time I waited few more collection cycles to see how far the RAM usage will decrease and apparently it will go down to approx. 0.99% after >4 cycles.
Third crash type.
This BSoD type again resulted in the same metrics values.
RAM on ESXi as metric seems to be dependent on the memory usage of the VM before the crash. I did not fully test it but Aria Operations Metric Correlation feature shows the same pattern for the respective metrics.
Just to be sure that the RAM Usage metric values do not change with different memory configurations of the VM, I did two more tests, the first one with 4GB RAM configured and the second with 17GB to check the metric with an odd RAM config.
And here the 17GB RAM config.
Constraints, assumptions and conclusions
Please be aware that I did not test every possible scenario, this is what I used:
Windows 2019 Server Datacenter as OS
VM Version 19
VMware ESXi, 7.0.3, 20328353
3 different BSoD types tested
VMTools not used as symptom
OS Uptime not used as symptom as the metric is not available after OS crashed
No Guest metrics used as such metrics will not be available after OS crashed
With the observations made during the crash tests I created 6 new Symptom Definitions and an Alert Definition using these new symptoms and one Condition for the power state of the VM. In the following two pictures you see the symptom and alert definitions.
DO NOT forget to activate your new symptoms and alert definition in the Aria Operations Policy assigned to your VMs!
This is how the symptoms looks like on a crashed Windows Server 2019 VM.
Please be aware that the highlighted low memory usage symptom requires several collection cycles to become active. If you need fast response, remove it from the Alert Definition.
The small dashboard I created is shown in the next picture.
One of my fellow colleagues (thank you Brandon) suggested to test the behavior with VMTools not running at all as it will have impact on memory usage metrics. Brandon also suggested to add or replace CPU Demand with CPU Usage as demand will be affected by high CPU usage on the ESXi host. I have added this metric to the Metric Configuration file and uploaded it to VMware Code.
NOTE: CPU, Disk and Network metrics are basically instantly affected by the crash, whereas memory slowly converges toward 0.
As you can see in the following screenshot, you can use CPU Usage instead of CPU Demand as it will also increase to 98-100% after the BSoD.
I would like to mention once again that the CPU metrics are available basically right after the crash and if you use Aria Operations SaaS, which is definitely the recommended way of using Aria Operations, you will get the symptoms triggered roughly after 40-60 seconds.
The memory metrics, as you can see in the next picture, will need several minutes to decrease to a level near 0.
Another question I was asked during my “Meet the Expert – Creating Custom Dashboards” session which I could not answer due to the limited time was:
“How to manage access permissions to Aria Operations Dashboards in a way that will allow only specific group of content admins to edit only specific group of dashboards?“
Even if there is no explicit feature providing such functionality, there is a way to implement it using Access Control and Dashboard Sharing capabilities of Aria Operations.
My solution
Assumption is that for example following AD users and groups are available, content admins are responsible for creating dashboards and content users will be consuming their dedicated content.
I have imported the AD groups in Aria Operations Access Control and for the sake of simplicity I have assigned them the predefined roles Content Admin and Read Only respectively and granted access to all objects in Aria Operations.
I have also created two sample dashboards and two dashboard folders for these two dashboards. This is not really required but it makes it easier to find the dashboards if you have a larger number of them with a more complex categorization.
And the last thing to do is to configure dashboard sharing accordingly using in the dashboard management shown in the next picture.
A dashboard can be shared with multiple user groups. In may example I have shared one sample dashboard with one editor and user group and the other sample dashboard with another editor and another user group. This way only dedicated editors (the members of the AD group) have access only to dashboards shared with them, and of course to any other dashboard shared with the built-in group Everyone. The very same way as regular users get access to their respective content.
Of course this approach requires a proper user group and dashboard sharing concept but such a concept is recommended anyway.
During one of my “Meet the Expert” sessions this year in Barcelona I was asked if there is an easy way to use SNMP traps as Aria Operations Notification and let the SNMP trap receiver decide what to do with the trap based on included information except for the alert definition, object type or object name itself.
The requirement is to make it as simple and as generic as possible thus creating separate alert definitions and notifications for e.g. Windows and Linux teams or Dev or Test environments is not an option.
Solution
I had few ideas in my mind but I had to test it first as working with SNMP traps is not something I am doing very often.
Basically we have two easy options to include additional information in the notification:
Add metrics and/or properties to the payload template which can be used as a differentiator.
Modify the alert definition to always include an additional symptom which can be used as the differentiator, like for example include a vSphere tag based symptom.
Aria Operations Payload Templates allow you to add any additional metrics and properties to the notification. Theses metrics and properties do not have to be related to the actual alert definition but might help to organize and route the alerts in the receiving system based on that additional information.
In the following picture you can see my payload template which includes one additional metric and one property. My test alert definition will be triggered on Virtual Machine object type.
For my tests I have also created an new very simple Symptom Definition, this symptom is basically trigger everytime a Virtual Machine has any vSphere tag assigned to it. A specific tag can be now used to be parsed later on and allow required decisions.
Next picture shows the symptom definition.
My Aria Operations Alert Definition includes the actual symptom I am interested in, for simplicity reasons it is also a certain vSphere tag which I can quickly set and remove to trigger the alert, combined using a boolean AND with the dummy symptom definition.
As last step in Aria Operations I have created a Notification which will send the SNMP trap to my Aria Orchestrator instance where I can inspect the SNMP message to see what is actually included.
SNMP Message
And here is what Aria Operations is sending as the SNMP message. For completeness I have included the entire message here and highlighted the additional information, both, the dummy symptom and the modified payload. The following links describe the Aria Operations MIB and help identitiv and parse the relevant parts.
=============
oid: 1.3.6.1.2.1.1.3.0
type: Number
snmp type: Timeticks
value: 3112273537
Element 2:
=============
oid: 1.3.6.1.6.3.1.1.4.1.0
type: String
snmp type: OID
value: 1.3.6.1.4.1.6876.4.50.1.0.46
Element 3:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.1.0
type: String
snmp type: Octet String
value: vrops.cpod-cmbu-vcf01.az-muc.cloud-garage.net
Element 4:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.2.0
type: String
snmp type: Octet String
value: ansible
Element 5:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.3.0
type: String
snmp type: Octet String
value: General
Element 6:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.4.0
type: String
snmp type: Octet String
value: 1669559583519
Element 7:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.5.0
type: String
snmp type: Octet String
value: warning
Element 8:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.6.0
type: String
snmp type: Octet String
value: New alert by id 1ab40eba-c480-4475-91e2-a0cc682fe945 is generated at Sun Nov 27 14:33:03 UTC 2022;
Element 9:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.7.0
type: String
snmp type: Octet String
value: https://172.28.4.33/ui/index.action#environment/object-browser/hierarchy/d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee/alerts-and-symptoms/alerts/1ab40eba-c480-4475-91e2-a0cc682fe945
Element 10:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.8.0
type: String
snmp type: Octet String
value: 1ab40eba-c480-4475-91e2-a0cc682fe945
Element 11:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.9.0
type: String
snmp type: Octet String
value: symptomSet: 1242e208-cc7f-40db-9bb0-ecc8a55b1f9b
relation: self
totalObjects: 1
violatingObjects: 1
symptom: tk-Include-vSphere-Tags
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric:
obj.1.info: Property [<OS-Type-Windows3.11>] matches regular expression .*
symptom: tk-TriggerTestAlert
active: true
obj.1.name: ansible
obj.1.id: d83f38d5-ec7d-44e8-81dc-54b02b3cd3ee
obj.1.metric:
obj.1.info: Property [<OS-Type-Windows3.11>, <killSwitch-On>] contains <killSwitch-On>
Element 12:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.10.0
type: String
snmp type: Octet String
value: Application
Element 13:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.11.0
type: String
snmp type: Octet String
value: Performance
Element 14:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.12.0
type: String
snmp type: Octet String
value: warning
Element 15:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.13.0
type: String
snmp type: Octet String
value: warning
Element 16:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.14.0
type: String
snmp type: Octet String
value: info
Element 17:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.15.0
type: String
snmp type: Octet String
value:
Element 18:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.16.0
type: String
snmp type: Octet String
value: VirtualMachine
Element 19:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.17.0
type: String
snmp type: Octet String
value: tk-TestAlert-01
Element 20:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.18.0
type: String
snmp type: Octet String
value:
Element 21:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.19.0
type: String
snmp type: Octet String
value: health
Element 22:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.20.0
type: String
snmp type: Octet String
value: tk-SNMP-Trap-Test
Element 23:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.21.0
type: String
snmp type: Octet String
value: Number of KPIs Breached : 0.0
Parent Host : esx03.cpod-cmbu-vcf01.az-muc.cloud-garage.net
Element 24:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.22.0
type: String
snmp type: Octet String
value:
Element 25:
=============
oid: 1.3.6.1.4.1.6876.4.50.1.2.23.0
type: String
snmp type: Octet String
value:
The following links explain the SNMP content in detail.
In ESXi Cluster (non-HCI) Rightsizing using vRealize Operations I have described how to use vRealize Operations and the numbers calculated by the Capacity Engine to estimate the number of ESXi hosts which might be moved to other clusters or decommissioned. The corresponding dashboard is available via VMware Code.
In this post, I describe the opposite scenario.
Problem Statement
The question I will answer is: “How can I use vRealize Operations to help me size new vSphere clusters using completely new ESXi hosts I plan to purchase?“
With its Capacity Engine and the “What-If Analysis” scenarios, vRealize Operations provides powerful features to help with infrastructure and workload planning. In case you are not familiar with “What-If”, the following picture shows the supported scenarios.
What we are missing here is a scenario covering local workload (Virtual Machines) migrations from existing vSphere clusters to new planned and not yet existing clusters. Usually, you know what kind of compute hardware you are planning to buy or at least what choices you have, what you do not know is how many of them you need to run any specific workloads.
Solution
NOTE: I am using the demand model in this use case. The allocation model would be similar to implement.
Using vROps and knowing what type of hardware will be used, we have everything we need to estimate the number of hosts required to migrate all workloads from an “old” vSphere Cluster.
These are the ingredients:
“Recommended Total Capacity (Mhz)” as calculated by the vROps Capacity Engine
“Recommended Total Capacity (GB)” as calculated by the vROps Capacity Engine
total CPU resources (in MHz) provided by the new hardware
total RAM resources (in GB) provided by the new hardware
Now we need to do some simple math:
"Recommended Total Capacity (Mhz)" / "total CPU resources (in MHz) provided by the new hardware"
"Recommended Total Capacity (GB)" / "total RAM resources (in GB) provided by the new hardware"
I use two vROps Super Metrics with as simple as possible formula, a number, to represent the potential new resources.
In this example, it is a Cisco Blade system with a certain CPU and RAM configuration.
Another three Super Metrics, attached to Cluster Compute Resource as object type, simply calculate the required number of such new hosts, from the CPU and RAM perspective, and identify the higher one as the number of required hosts.
To make it easier to consume I have created a dashboard similar to the rightsizing one.
You can download the Super Metrics, Views, and Dashboard from VMware Code.
As you all know vRealize Operations is a perfect tool to manage and monitor your SDDC and in case of issues vROps creates alerts and informs you as quickly as possible providing many details related to an alert.
Without any additional customization, vRealize Operations displays the alerts in the Alerts tab. Sending alerts via email is the most common way to quickly attract your attention and increase the chances of reacting to the alert quickly.
With the Webhook Notification Plugin, it is possible to integrate vROps with almost any REST API endpoint. Telegram provides an easy-to-use API and allows for sending vROps alerts as messages into a Telegram chat.
Telegram Chat, Bot, and Token
The first step is to prepare Telegram to allow vROps to send messages to a chat. Basically, you will need a chat, a bot, and a token. The process is very well described here.
The actual REST API POST call uses the chat ID and the Token to ingest messages into the chat.
The next step is the vROps Outbound Webhook Plugin instance that will point to our telegram chat. For simplicity, the Outbound URL contains the entire URL we need to send messages to Telegram.
In my environment, the instance looks like depicted in the following figure, bot ID and token are truncated for visibility and my data privacy.
You do not have to specify any user or password as the ingestion uses the token in the URL. When you run the test you will see an error but this is in our case OK.
vROps Payload Template
Following the Outbound Plugin instance, we need to specify the details of the REST call and the payload. In my example, I would like to receive datastore space-related alarms in my chat. Details of the alert definition are not described in this post.
The Payload Template is where we define these settings, subdivided into:
Details: Specifies the name of the instance and the Outbound Method to use, in this case, the Webhook Notification Plugin.
Object Content: Here we can add additional information, like metrics and properties to be used in the actual payload. I have added a few additional metrics to help qualify the alarm.
Payload Details: Specifies the actual REST call, like method or content type, and describes the body of the call as required by the receiving instance. In the body itself, we can use the predefined parameters (uppercase) and our additional metrics or properties or related objects (lowercase).
The following two pictures show my Payload Template.
vROps Notification
The last step of the process is to bring everything together, the alert definition, the outbound instance, and the payload template to work together and send the message towards Telegram.
This is where the good old vROps Notifications come into play. It wires everything and does exactly what the name implies – notifies you in an alarm event.
The process of notification creation has four steps:
Notification: Here you give the notification a name and enable or disable it.
Define Criteria: This step gives you a large number of options to exactly specify when the notification should be triggered. In my use case, I want this notification to be triggered only on one specific object type (Datastore) and only for one specific alert definition (tk-vSAN Remaining Space Low).
Set Outbound Method: Here we define the Outbound Method (Webhook Notification Plugin) and the Outbound Instance we created for Telegram.
Select Payload Template: In the last step we define the Payload Template to use, in our use case the template we created for Telegram.
The following pictures show my Notification settings.
The Result
When all steps are finished successfully, any time the specified alarm has been triggered you will receive a message in your Telegram chat, as depicted in the following figure.
As you know vRealize Operations is collecting tons of various metrics. Some of these metrics are so-called “Instanced Metrics” and disabled in the default configuration in newer vROps versions. A list of disabled instanced metrics for e.g. Virtual Machine object type is available here:
If you need any of those metrics, you can enable them in your vRealize Operations Policy.
As you can see in the previous picture, there is an option to specify instances you would like to include or exclude. In my example, I am excluding the CPU (or CPUs) containing “1” in the instanced metric name. Yes, it does not make any sense, it is just an example:-)
Problem statement
In addition or as a replacement for some of the disabled instanced metrics vRealize Operations provides the “Aggregate of all instances” metric, like in this example for Virtual Disk metrics.
The problem now is that in certain situations where you would like to evaluate the instanced metrics to find the maximum, minimum, etc. the aggregated metric may also be taken into the equation, like in views or super metrics.
Use case
One of my customers described a very interesting and important use case.
“I want to determine the highest average write request size”.
One logical way would be to use vRealize Operations Super Metric and create a formula like this one:
As described in the “Problem statement” this calculation includes the aggregated metric, “VirtualDisk|Aggregate of all Instances”, which leads to a wrong result.
Possible solution
Please be aware that this is ONE possible solution with one drawback that I will explain at the end.
The approach is to exclude the aggregated metric from the formula.
What we cannot do, or at least I do not know how is to exclude a metric based on the instance name.
What we can do is to leverage the assumption that the aggregate will be usually greater than any single instance as it is the sum of all instances. And this is the mentioned drawback. The approach works only when the following assumption is true:
count of instances is > 1
at least 2 instances have a value > 0 at the occurrence of super metric evaluation
I am working on an improved version of the formula to get rid of the assumption. For the time being this is what is working taking the mentioned assumption into account:
max(${this, attribute=virtualDisk|writeIOSize_latest, where=($value < ${metric=virtualDisk:Aggregate of all instances|writeIOSize_latest})})
This formula is evaluating only the metrics with values < the value of the aggregated metric.
Outlook
The improved formula will include some if-then-else statements.
When it comes to capacity management in vSphere environments using vRealize Operations customers are frequently asking for guidelines how to setup vROps to properly manage n+1 and n*2ESXi clusters.
Just as a short reminder, n+1 in context of a ESXi cluster means that we are tolerating (and are hopefully prepared for) the failure of exactly one host. If we need to cope with the failure of 50% of all hosts in a cluster, like two fault domains, we often use the n*2 term.
In general we have two options to make vRealize Operations aware of the failure strategy for the ESXi clusters:
the “out-of-the-box” and very easy approach using vSphere HA and Admission Control
the vROps, and almost same easy, way using vRealize Operations Policies
vSphere HA and Admission Control
If configured Admission Control automatically calculates the reserved CPU and Memory failover capacity. In the first example my cluster is configured to tolerate failure of one host, which makes it 25% for my 4-hosts cluster.
vRealize Operations is collecting this information and accordingly calculating the remaining capacity. In the following picture you can see vROps recognizing the configured HA buffer of 25%.
If we now change the Admission Control settings to n*2, in my case two ESXi host, vSphere is calculating the new required CPU and Memory buffer. We could also set the buffer manually in to 50% or whatever value is required.
After a collection cycle, vRealize Operations retrieves the new settings and starts calculating capacity related metrics using the adjusted values for available CPU and Memory capacity.
The “Capacity Remaining” decreases following the new available capacity and the widget shows the new buffer values in %.
vRealize Operations Capacity Buffer and Policies
Sometimes the vSphere HA Admission Control is not being used and customers need another solution for their capacity management requirements.
This is where vROps Policies and Capacity Buffer settings helps manage vSphere resources.
vRealize Operations applies various settings to groups of object using vROps Policies. One section of a policy are Capacity Settings.
Within the Capacity Settings you can define a buffer for CPU, Memory and Disk Space to reduce the available capacity of a vSphere cluster or a group of clusters. You can set the values for both capacity models, Demand and Allocation, separately.
In my example, I have disabled Admission Control in vCenter and set buffers in vROps.
vRealize Operations is now using the new values for available resources to calculate cluster capacity metrics.
Btw. Custom Groups are the vROps way to group similar cluster together and treat all of them the same way.
Without any doubt configuring vRealize Operations to send log messages to a vRealize Log Insightinstance is the best way to collect, parse and display structured and structured log information.
In this post I will explain the major differences between CFAPI and Syslog as the protocol used to forward log messages to a log server like vRealize Log Insight.
The configuration of the log forwarding in vRealize Operations is straight forward. Under “Administration” –> “Management” –> “Log Forwarding” you will find all options to quickly configure vRLI as target for the selected log files.
The following figure shows how to configure vRealize Operations to send all log messages to vRealize Log Insight using the CFAPI protocol via HTTP.
The CFAPI protocol, over HTTP or HTTPS, used by the vRealize Log Insight agent provides additional information used by the vROps Content Pack. The extracted information flows into the various dashboards and alert definitions delivered through the Content Pack. Following picture shows one of the available dashboards populated with available data when using CFAPI and vRLI.
In case you (for whatever strange reason) cannot use CFAPI, you can configure vROps to use Syslog. It is as simple as selecting Syslog as the protocol option in the configuration page shown in the following picture.
The drawback of using Syslog here is that the additional information parsed by the agent and used by the content pack will no longer be available and you will need to create your own extracted fields in vRLI to parse data from the log messages.
In the next both pictures you can see the empty dashboards and log messages without any vROps specific fields in the interactive analytics .
It is important to know that vROps is using Syslog over TCPwhen configured via UI as shown in figure 03.
But what if you are forced to use Syslog over UDP?
There is no such option in the UI but since vROps is using the regular vRLI agent, there has to be a way to configure it to use UDP instead of TCP.
The vRLI config file explains how to set the according option:
You can just replace
proto = syslog
with
proto = syslog_udp
restart the agent
service liagentd restart
and your vROps nodes starts to forward log messages to your log server using UDP.
I have setup a fake log server listening on 514 UDP using netcat:
If you configure the vRLI agent in vROps directly via the config file, please keep in mind:
that you are using a function which is not officially supported by VMware
you will need to make such manual changes on every node
you will need to monitor any changes to that file which can be triggered via the UI or vROps updates