Self-Healing With Lightstep

Will Hallam · ‎09-15-2022

Disclaimer: The examples herein come with no support or warranty, implied or explicit. Caveat emptor!

Lightstep from ServiceNow offers unparalleled Open Telemetry ingestion capability and enables SRE teams to gain deep insights into modern cloud native applications. By integrating it with ServiceNow ITOM via the Service Graph Connector for Observability - Lightstep (currently available in the ServiceNow Store Innovation Lab), those insights can be turned into automated actions which deflect issues before any negative impact is felt. Here's an example of how.

ServiceNow Service Graph Connector

The first step in setting up my environment was to install the ServiceNow Service Graph Connector for Observability - Lightstep from the System Applications->All page in my ServiceNow instance.

Once the SG Connector is installed, I navigate to the setup page via SG Connector for Observability Lightstep->Setup. By following the guided setup, I imported projects and services from Lightstep into my CMDB. The connector creates a parent Application Service for each Lightstep "Project" and a child Application Service for each Lightstep "Service" inside those Projects. It prepends "LS-" to the name for these objects to avoid collisions with Application Services being discovered via other means.

The final section in the connector setup configures a webhook destination in your Lightstep account which will be used to send alarm events into ServiceNow.

NOTE: You can also create a webhook destination directly from the Lightstep UI.

Lightstep Webhook

The foundational component for sending Lightstep alarms into ServiceNow is a webhook notification destination. The connector guided setup will create that for you or you can create it yourself; it is found under "Notification destinations" on the "Alerts" page in the Lightstep UI. The webhook URL format is as follows:

https://<user>:<password>@<instance>.service-now.com/api/sn_em_connector/em/inbound_event?source=SG-Lightstep

For example,

https://ls_notification:password@myinstance.service-now.com/api/sn_em_connector/em/inbound_event?sou...

NOTE: while Lightstep does provide a test capability for a notification destination, the test payload may not include the mandatory key/value pairs required by the event management push connector, resulting in no event record being created.

Lightstep Stream

With a webhook destination in place, my next step was to define a stream. In my case I selected a service of "iOS" and an operation of "api/get-catalog".

Lightstep Alert

I defined my test alert from within the view of my new stream, by clicking on the "Alerts" button to expand the alerts pane, then clicking "Create an alert".

I chose a suitable signal and threshold for the stream I created, then selected the destination I'd created for my ServiceNow instance. After saving the new alert I started seeing events corresponding to it coming in to my ServiceNow Event table.

NOTE: setting an artificially low or high threshold can be a useful way to trigger a new alert in order to get some events sent into ServiceNow.

Container Workload

As this exercise was a proof of concept and was using an auto-generated series of Lightstep data, I had to stand up a dummy service which could be controlled by my automation in order to show that actual changes were occurring in response to my alerts. To do this, I created a Git repo to house my fictional "app code" and populated it with the following:

"ios.yaml", a Kubernetes deployment which creates a replicaset using the image of my choice (in my case I used nginx) and tags it with the name "ios" in order to match the "iOS" service on the Lightstep side.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ios
  labels:
    svc: ios
spec:
  selector:
    matchLabels:
      svc: ios
  replicas: 1
  template:
    metadata:
      labels:
        svc: ios
    spec:
      containers:
      - image: nginx
        name: ios
        ports:
        - containerPort: 8080

"scale.py", a Python script which will be invoked automatically to adjust the replica count for the "ios" deployment. This piece will be highly specific based on the actual application and infrastructure environment, but I'll include the code here for completeness.

#!/bin/python3

import json
import subprocess
import sys

# initialize
incr=1
newReplicas=0

# parse args
serviceName=sys.argv[1]
upDown=sys.argv[2]

if (len(sys.argv)>3):
    incr=sys.argv[3]

# update Kubeconfig
updateKubeconfig=subprocess.run(["aws","--region","us-east-1","eks","update-kubeconfig","--name","mycluster"])

# retrieve JSON for existing object
getResource=subprocess.run(["kubectl","get","-n","demo-project","-f",serviceName+".yaml","-o","json"],stdout=subprocess.PIPE,encoding='UTF-8')

serviceJson=json.loads(getResource.stdout)
if (upDown=="up"):
    newReplicas=serviceJson["spec"]["replicas"]+int(incr)
if (upDown=="down"):
    newReplicas=serviceJson["spec"]["replicas"]-int(incr)

if (newReplicas<0):
    newReplicas=1

kubeScale=subprocess.run(["kubectl","scale","-n","demo-project","--replicas",str(newReplicas),"-f",serviceName+".yaml"],stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding='UTF-8')
print (kubeScale.stdout)
print (kubeScale.stderr)

Since I planned to run this scaling script from my Windows MID, I installed the requisite packages there as needed (btw, Chocolatey is _awesome_). In my case I needed the AWS CLI, Kubernetes CLI ("kubectl"), and Python.

All my infrastructure for this endeavor resided in my AWS account, so I used my MID server's EC2 instance profile to give it the needed permissions to enumerate and talk to my EKS cluster. Some helpful info on giving a MID server permissions in an EKS cluster can be found in this article:

https://community.servicenow.com/community?id=community_article&sys_id=88330d5edbc499d4fd8d2b6913961...

ServiceNow Flow/Subflows/Action

With all the required external pieces in place, I turned my attention back to my ServiceNow instance to create my automated remediation flows. First, I created a custom action called "Scale K8s Resource". This action takes four inputs: a LightStep project name, a resource value as populated into an Alert record, a replica increment number and an "up/down" keyword to indicate which way to scale.

Then I add a PowerShell step (since my MID server runs Windows) which invokes my scaling script and passes it the inputs as required.

After building the custom action, I wrap it in a "Scale Up" remediation subflow. Since this subflow will be called from an alert management rule, I selected one of the default subflows included with the Alert Management Content application, which I then copied. This saved some time in populating the standard subflow inputs. Working on this copied subflow, I removed the existing steps and added a log step followed by a call to my custom "Scale K8s Resource" action, passing in the applicable input values.

To perform the inverse operation, scaling down, I copied the scale up subflow and modified the "Scale K8s Resource" step to pass an up/down value of "down".

Because alert management rules will not fire when an alert changes to "Closed", I needed a different approach to trigger the "Scale Down" subflow which would occur after the alert from Lightstep returned to an "OK" state. I chose to use the "Trigger" functionality included with Flows, creating a flow which would trigger when a matching alert record transitioned to the Closed state. The flow would then invoke the scale down subflow.

ServiceNow Alert Management Rule

With my automation content created, what remained was to build an alert management rule which invoked the scale up subflow upon a matching alert being opened (or re-opened). Navigating to Event Management->Rules->Alert Management Rules, I created a new record. I selected a filter criteria which ensures that only Lightstep alerts of the type I had defined would trigger the rule.

On the "Actions" tab of the new alert management rule, I add the scale up subflow to the "Remediation Subflows" section and set it for automatic execution.

Conclusion/Result

With that step completed, I was able to verify the "ios" deployment would increase its replica count whenever the sample Lightstep data surpassed the threshold I'd defined in my alert. When the alert reverted to an "OK" state, the deployment would scale down to the original replica count, the subflow being called by the parent flow I had built to watch for matching alerts being closed.