RemcoLengers
ServiceNow Employee
ServiceNow Employee

Using Metrics, next to Logs, is an important part of building AIOPS capabilities. With ServiceNow ITOM Health or AIOPS, anomaly detection can play an important role in driving operational efficiencies by reducing MTTD, MTTR.

 

Metrics can come from many sources. If you are using Microsoft Azure as a cloud provider getting Azure Monitor Metrics into ServiceNow in an effective manner is important. This can be accomplished a new capability to use Agent Client Collector in proxy and multi CI mode to collect metrics from Azure Batch API. Now 20 metrics for 50 CI's can be collected in one API call greatly improving the efficiency of the mechanism so significantly less ACC proxy agents are required to collect the Azure Monitor Metrics.

 

This is a guide to setting up Agent Client Collector (ACC-M) for gathering Metrics from Microsoft Azure Batch API.

Currently this feature is available for Azure Virtual Machines, Storage Accounts, Load Balancers,  Application Gateway, Redis. 

 

Prerequisites

 

  • ServiceNow instance with ITOM Health or AIOPS installed
  • Minimum Agent Client Collector Monitoring 3.10.4 (There are improvement in later versions)
  • Service Operation Workspace installed
  • MID Server is setup 
  • Azure Cloud Discovery has been run (We need to know for what Azure resources to collect metrics)
  • Agent Client collector running on a Linux VM (To follow this guide)

 

Setup

To setup ACC Proxy agent to collect Azure metrics from the Azure Batch API the following steps need to be completed:

 

  • Setup MID (Not included in this guide but see Reference below for some info)
  • Setup Cloud Discovery (Not included in this guide but see Reference below for some info)
  • Configure MID for ACC and Metric Intelligence (Not included in this guide but see Reference below for some info)
  • Setup ACC on a VM (Not included in this guide but see Reference below for some info)
  • Setup ACC Policy for Azure Batch API collection
  • View results

 

Azure credentials

 

In order for the ACC to read from the Azure Batch API it needs to have the right credentials. You may need to get these from your Cloud team or if you already have Azure Discovery running they may already be present.

 

Setup a Discovery->Credentials of the Type “Azure Service Principal”.

 

RemcoLengers_1-1717255380179.png

 

 

Cloud provisioning and Governance-> Service Accounts

RemcoLengers_2-1717255380198.png

 

Currently there is no exact specification of the Role required in Azure for reading Metric data from the Batch API. I have used the Reader role.

  

RemcoLengers_3-1717255380202.png

 

 

Locating ACC Check for Metrics in Azure Monitor via Batch API

 

Review the check that runs: Agent Client Collector-> Check Definitions

Open the “Azure Metrics Collector” and review the check that runs on the ACC.

 

RemcoLengers_4-1717255481837.png

 

List of Policies

Agent Client Collector-> Policies

 

Added column “Multi-CI mode”. Note how Policy “Azure VM Metrics is running on Proxy agent “Agent_AccLinuxMetricsClient”.

 

RemcoLengers_5-1717255481851.png

 

 CI types have their own policy because Azure Batch API can only be called with one CI type.

 

Running ACC-M in proxy mode

 

When ACC is running in proxy mode it can use most of the resources of this VM for processing. This is different from ACC-M running checks for the local VM where resource consumption must be minimised. See the first comment in the blue box below:

 

RemcoLengers_6-1717255481857.png

 

Assigning the Policy to the Agent

 

To assign the Policy to a Proxy Agent, “Edit in Sandbox” and assign the ACC of your choice. At some point the agent should show up in “Agents”.

 

RemcoLengers_7-1717255481871.png

 

Now that the Policy is assigned to an ACC to run it we will configure what Azure VM’s we will collect Metrics for.

 

RemcoLengers_8-1717255481882.png

 

If we Save and Activate and Publish the Policy it will start executing. Let's see the results

 

Please give a few minutes for the first results will show up.

 

Service Operations Workspace-> AIOPS Dashboard ->Microsoft Azure Monitoring

 

RemcoLengers_15-1717256332099.png

 

Success, Nice!

 

Now how to drive anomaly detections from metrics is for another blog.

 

The rest of the document contains a lot of details, if you just wanted to get it running feel free to stop reading.

 

Deeper dive

 

High level design

 

The diagram details how the ServiceNow instance, MID Server, ACC and Azure components interact.

 

RemcoLengers_16-1717256727684.png

 

During the configuration 2 configuration files are created, a list of resource to collect metrics for, and a list of Metrics to collect for those resources. They play an important role, see more details below.

 

RemcoLengers_17-1717256727702.png

 

Azure and Metrics

 

Both Azure and ServiceNow have agents. ServiceNow’s ACC-M agent can collect Guest OS metrics. Azure by default collects Metrics on VM Instance level.

 

Azure Monitor Agent:

https://learn.microsoft.com/nl-nl/azure/azure-monitor/agents/agents-overview

 

 

ACC M

Local Guest OS Metric collection by ACC checks into ServiceNow platform via MID Server

Azure Monitor Agent

Local Guest OS Metric collection into Azure Monitor Metrics database

ACC-M Proxy

ACC read the metric data from Azure monitor on a CI by CI basis

 

ACC-M Proxy with Batch API support

ACC read the metric data from Azure monitor via Azure Batch API.  50 CI’s with 20 Metrics max per API call.

 

 

 

Azure Monitor Agent:

https://learn.microsoft.com/nl-nl/azure/azure-monitor/agents/agents-overview

 

https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-05-01-...

 

Azure Batch API documentation

https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/metrics-index

 

Note: Azure documentation does not specify which Metrics are supported by Azure Batch API. You need to call the Azure Batch API (Postman or Curl) for a Azure object and Metrics and see if it supports Batch API access. 

 

Note 2: This functionality can quite easily extended for other Azure services provided they have Batch API support for their metrics. See the config files that need to be created for a new service below.

 

Configuration files

 

Two files are being constructed by the check and passed to the proxy agent to be used as input for the running check. Acc_azure_statis_vm_config.json contains the Metric being collected. The file starting with AzureVMMetrics_.... contains the objects (VM’s for this check) that are dynamically being added based on the filter in the policy. See examples below:

 

RemcoLengers_10-1717256294486.png

 

 

The files below can be changed manually if more/less/different metrics need to be collected. Also to support other Azure objects beyond the currently supported out of the box. A copy can be made, and the Check can be pointed toward the updated or other copy if appropriate.

 

RemcoLengers_11-1717256294498.png

 

The file below is dynamically created (And edited slightly for easy viewing):

 

RemcoLengers_12-1717256294506.png

 

The Check Instance is the place where the static config file and the credentials are configured to the Policy. Check  parameters:

RemcoLengers_13-1717256294514.png

 

Check Secure Parameters:

RemcoLengers_14-1717256294516.png

 

 

Troubleshooting

 

Debugging and well-known failure modes

 

  1. Customer can enable the azure-metrics-collector check log by updating the check instance configuration inside each Azure policy.
    The command should be updated to use the following flag:

    azure-metrics-collector  --nolog=false

 

  1. The check log is written directly to the agent log customer can collect it from the instance by grab agent logs

    RemcoLengers_0-1717323857558.png

     

Possible points of failures:

  1. Wrong Azure Credentials (Fix: Get credentials that allow reading the metrics)
  2. No CI’s in CMDB (Fix: Setup Discovery)
  3. Multi CI mode configuration file script is failing
  4. Metrics configuration file corrupted format or incorrect metrics definition inside (unsupported metrics for Azure Batch API)
  5. Your credentials covers only partial resources (Fix: duplicate the policy and assign different credentials per policy and filter resources that are relevant only for the provided credentials)

 

Important things to know

  • CI’s need to be in the CMDB otherwise metric collection will not occur. Discovery needs to be run regularly. (Cloud Discovery + Event based discovery or Service Graph connector)
  • Not all resources in Azure supports Batch API. Currently not documented by Microsoft. (Find out by trial and error)
  • In ACC Proxy mode the assumption is ACC can use all resources on the host.
  • One Batch API call returns multiple metrics (max 20) for multiple CI (max 50), but all CI's need to be of the same type (that how Azure batch API works).
  • Azure allows a API request only in for 1 Azure Metric location. If the CI’s are in 3 locations that will result in 3 Batch API queries.
  • Subscriptions are not relevant.
  • Credentials are needed for initial setup (Tennant ID, Client ID, Secret key). 
  • A Mid Server is required as it play an important role in Metric Intelligence data pipeline

 

Running in local mode

 

It is possible to run the azure-metric-collector in local mode for debugging purposes when you log into the Linux VM running the ACC checks.

 

Load the 2 configuration files from the ServiceNow instance (Agent Client Collector->Configuration Files) and put them in a directory config-files. The directory needs to be created. Sample content of these files can be found a few pages back.

 

The screenshot below shows how to run the azure-metric-collector in local mode.

 

RemcoLengers_18-1717256909266.png

 

 

Content of the run.sh file for easy copy and paste

 

#!/bin/bash
export AZURE_TENANT_ID='xxxxxxxx-9979-491e-8683-d8ced0850bad'
export AZURE_CLIENT_ID='xxxxxxxx-244c-4615-b73b-25967c0ded29'
export AZURE_CLIENT_SECRET='xxxxxxxxxxxxxxxxxxxxxxxxxxx'
/var/cache/servicenow/agent-client-collector/monitoring-plugin-azure-metrics-collector/bin/azure-metrics-collector --local -l info -c acc_azure_static_vm_config.json -r AzureVMMetrics_RL.json -i 60 -w 120

 

An example run with log level “info” can be seen below

 

RemcoLengers_19-1717256909278.png

 

 

Reference

 

Documentation links to setup MID, Cloud Discovery, ACC and Metric intelligence 

 

I recommend to setup a Windows MID server (2 cpu 4 Gb) and a Linux ACC client (1 cpu 1 Gb) for functional testing.

 

Setup Windows MID server:
https://docs.servicenow.com/bundle/washingtondc-servicenow-platform/page/product/mid-server/concept/...

 

Create a Windows service account with "Log on as Service":

https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0867669

 

Setup Agent Client Collector

 

https://docs.servicenow.com/bundle/washingtondc-it-operations-management/page/product/agent-client-c...

https://docs.servicenow.com/bundle/washingtondc-it-operations-management/page/product/agent-client-c...

https://docs.servicenow.com/bundle/washingtondc-it-operations-management/page/product/agent-client-c...

https://docs.servicenow.com/bundle/washingtondc-it-operations-management/page/product/event-manageme...

 

 Do not forget to open the inbound firewall port on the Windows MID server. For example:

 

RemcoLengers_0-1717255285569.png

 

It is sufficient to have ACC being run with basic discovery mode. If the ACC has reported itself correctly then you can move on. See Agent Health Dashboard.

 

A performance test has been conducted with the following results.

 

Host OS - Linux
Host Spec - Ubuntu OS 20, 8 CPUs, 16GB RAM
Test Duration - 24 hrs
Policy Name- Azure VM Metrics (Linux Proxy Agent)
Check Name - Azure Metrics Collector
Number of VMs - 10K
Number of checks - 1
Number of metrics / minute - 350K
Network utilization – Tx (Agent -> MID) 16 MB/s
Network utilization – Rx (Agent <- MID) 18 MB/s
Memory consumption ~180 MB
CPU of all checks 0%
Process CPU utilization ~70%
Host CPU utilization ~95%

 

Check help page

 

./azure-metrics-collector  -h
A tool to collect Azure metrics and forward them to the acc agent

Usage:

  azure-metrics-collector [flags]

Flags:

  -g, --agg string     The list of aggregation types (comma separated) to retrieve. Examples: average,minimum,maximum,count,total (default "average")
  -h, --help           help for azure-metrics-collector
  -i, --interval int   Interval between metric collections in seconds (default 60)
  -l, --ll string      Provide log level. Possible values: debug, info, warn, error, fatal, trace (default "info")
      --local          Local mode. If true, Credentials will be collected from environment variables. If false, credentials will be collected from stdin
  -c, --mc string      Name of the config file contains namespace and list of metrics (default "acc_azure_static_config.json")
  -p, --mp string      metric prefix to be added to the metric name
  -m, --mpr int        Max number of metrics per request (Azure Default: 20) (default 20)
      --nolog          Skip logging to ACC. If true, the logs will not be sent to ACC. If false, the logs will be sent to ACC, default is true (default true)
  -n, --npr int        Number of parallel requests to Azure API (default 100)
  -r, --rc string      Name of the config file with the list of resources to collect metrics for (default "acc_azure_check_config.json")
  -s, --sci int        Sync resources config file interval in seconds (default 60)
      --scv            Skip certificate validation. If true, the certificate validation will be skipped. If false, the certificate validation will be enabled
  -w, --sw int         Sliding window in seconds to collect metrics from Azure Monitor (default 1)
  -t, --timeout int    Max number of seconds to wait for a response from Azure API (default 30)

 

2 Comments