Case Study: How ServiceNow Support uses Machine Learning to Predict Customer Escalations

Mwatkins · ‎04-11-2024

Introduction

ServiceNow Support is always seeking to be more proactive in dealing with customer impacting issues. This project uses two ServiceNow products ServiceNow® Predictive Intelligence™ and ServiceNow® Event Management, together with the ServiceNow® Platform to predict customers whose experience with ServiceNow is heading in the wrong direction. The premise is that by applying machine learning to the body of performance related events, relative to known customer escalations, we can predict other customers in danger of escalation and reach out to them before smoke turns to fire.

Uses ServiceNow Predictive Intelligence™, Classification Solution with XGBoost.
Supervised model using alerts / escalations data as training set.
Patent Awarded: https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/11829233

System Diagram

Terminology

Model: In Machine Learning, a Model refers generally to the idea of using mathematics to model desired real world behavior. Models detect the hidden patterns within data.

Solution: In Machine Learning, a Solution is the result of training a model. A given model might be trained multiple times with different combinations of training data or feature criteria, resulting in multiple Solutions.

Classification: A supervised machine learning model type that groups phenomenon into mutually exclusive classes. In our model we use binary classification. We classify each entry as either "Proposed" or "Not Proposed" for each day. At the time of creating our model, ServiceNow offered two variants for Classification, Logistic Regression and XGBoost.

Label: In supervised learning, a label is a human designation of some desired outcome in a body of data. Machines use labels to learn how to organize phenomena in that data into groups. Our SME's reviewed 28 days worth of Event Management trend data for thousands of instances to decide if the trends seemed to warrant being proposed for an escalation. We used a custom field, "Escalation Decision", as our label and gave it either the value "Proposed" or "Not Proposed".

Training Set: In Supervised Learning, a Training Set is a group of labeled data that is used to generate a ML Solution. The solution is then used to output predictions given new input from unlabeled data.

Feature: A feature is a variable of structured data that can act as input to a machine model. It is a characteristic of a phenomenon. Examples of our features include the count of alerts, median duration of alerts, and variance of alert type.

Project Phases

1: POC (Proof of Concept)

Make sure we had enough empirical data to pass minimal precision and coverage thresholds against a small labeled data set.

2: POV (Proof of Value)

Was it better than what we already had?

To answer that question, we compared the results from our new solution against an existing solution to verify that it could give us (1) net new predictions and (2) earlier predictions for duplicates. Our existing solution also used Alert trending data to proactively identify escalations, but it was more reactive and did not use Machine Learning.

One major finding that the POV helped us realize, was that, to create a truly predictive model, we would need to add in some trending features like Linear Regression, Variance, and Exponential decay, so our ML model had visibility to how data was trending over time.

Screenshot 2024-05-02 at 07.44.51.png

3: MVP (Minimal Viable Product) - Automating workflows with ServiceNow

Up until this point nothing was automated; it was all spreadsheets, Python, and manually executed scripts. So our next step was to hook up all the systems and use the ServiceNow platform to completely automate the process. This was a highly valuable and rapid exercise – we hit all our milestones early thanks to rapid development process of the ServiceNow® platform.

4: Testing and Revising the Product

Feature Selection Process

With what we had learned during the previous stages, we set off to build the perfect Supervised Solution.

We started a completely new data collection process to include the new features from the POV and build a larger data set. Then we had our domain experts do massive amounts of manual data labeling to build a Training Set.
Next, we ranked hundreds of possible features so that we could down-select to just the right combination, that would give us the best predictions. We fed our newly labeled data set into supervised learning models (Random Forrest, LogIT, and Decision Matrix) to rank salient features.
We then tried a few dozen combinations of the highest ranked features to build different Solutions in ServiceNow. We used the Solution Statistics from ServiceNow® Predictive Intelligence™ to quickly narrow this down to only the top performing Solutions.
Next we compared the detailed outputs of these Solutions by running our Training Set through ServiceNow's ML SolutionFactory API to see the detailed prediction score for each item in the Training Set. This will be described in more detail in the "Solution Selection Analysis" section below.

5: Product Go-Live

Finally we launched the product and started taking escalations. We immediately saw the value of the new model. We let our new and old models run in parallel for a few months and, in the end, we found that each model provided enough unique value to keep them both.

Solution Selection Analysis

So how did we evaluate which Solution was giving us the best results? We are a small team of engineers. Reviewing nominated instances can be extremely time consuming. So, in terms of model output, we knew that minimizing false positives was most important. Having engineers spend significant time ‘spinning their wheels’ would be a huge waste of resources. The ServiceNow® Predictive Intelligence™ tool provides Solution Statistics so that you can automatically evaluate the quality of your models.

Precision told us, out of those predicted "Proposed", how many were correct predictions. High precision means low False Positives. It is a measure of accuracy.
Recall told us, out of the total that were labeled "Proposed", how many were predicted "Proposed". High Recall means low False Negatives – in other words, fewer misses. It is a measure of accuracy.
Coverage told us, out of all the records in the training set, how many could we make some type of confident prediction – either "Proposed" or "Not Proposed". High coverage means the model was able to "cover", make decisions about, more of the dataset. However, it doesn't tell us if that coverage was more accurate.

Screenshot 2024-04-11 at 12.16.50.png

Since reducing false positives was so important to us, Precision was a much more important factor than Recall or Coverage. These estimated solutions let us quickly eliminate some of our early models. For models which looked good, we then used ServiceNow's MLSolutionFactory API to get the prediction results for each and every entry in our training set. We used these results to perform a deeper analysis, selecting the models that gave us a very low number of false positives without missing too many obvious escalations.

In 2024 we retrained the model. Since our team had grown, we decided to maximize the model for Recall, while still trying to maintain a 10% or lower false positive rate. This has served us well so far and the model is providing more opportunities to engage customers proactively.

One issue that I ran into both times I trained the model is that you must not give the trainer what it perceives as invalid data. In my original model (from 2020), it seemed that a String or Decimal field with the value "0" would break the trainer. I got around this by adding infinitesimally small amounts to each number. For example 0 might become 0.000023. However, in the second model this workaround was unsuccessful. Training would fail with the message, "Error while training solution" and when you click on the "Show training progress link" on the failed ML Solution record you will see "NSE0016:Failed to execute component Preprocessor : NCE0079:Input data is null or empty, logging details: : no thrown error".

It seems that a newer version of ServiceNow's Predictive Intelligence upgrades the solution types from "Classification" to "Workflow Classification" when that happened, my model could no longer be trained. When using "Workflow Classification" there is a feature during Preprocessing that removes any columns with highly distinct values. My original workaround of adding infinitesimally small random numbers actually caused the failure because it resulted in 100% of my rows being distinct. When I reverted to the previous solution version, the Trainer ran successfully. This can be done by setting the property "glide.platform_ml.api.enable_workflow_classification" to false. However, I didn't want to stay stuck on the old solution type, so I turned off the mechanism that was adding the small numbers to all my values and the newer model started working - even with the value 0, regardless of if the fields were integer, string or floating point data types. Hopefully this may help you if you run into the same issue. This is documented in KB1953573.

Conclusion

Ultimately, we landed on a binary classification model using 19 features, leveraging the XGBoost gradient boosting library. This gave us the accuracy we needed without wasting our engineer’s time, giving only a 3% false positive rate.

Since going live with our predictive model, we have engaged hundreds of customer per year and moved from only 11% proactive engagements to now 68% of our engagements being proactive!

Vivek Verma · ‎05-23-2024

I'm here from an excellent session on K24, and I have a similar requirement I was trying to solve with this approach. I would appreciate your guidance 🌸

I'm currently working on implementing a feature for the Alert records that will allow for the classification of noise and non-noise alerts. I plan to achieve this by creating a custom field (Choice or true/false) on Alert. For example, I identified 10K records that are real noise and marked those alerts as noise. Then I sent them to the ML classification model as labeled data labelled data.

My main concern is whether the SN ML model will be capable of identifying a new alert as noise if it matches the patterns from the trained 10k alerts, all of which are classified as noise.

Alternatively, I'm open to exploring other Predictive Intelligence solutions that may be more suitable for this scenario. Additionally, I'm interested in leveraging advanced ML parameters, and based on my research, I believe that XGBoost would be well-suited for precision classification ML

Mwatkins · ‎05-23-2024

@Vivek Verma thanks for the question and I'm glad you enjoyed the K24 session! I think it makes sense to use the strategy you have outlined to identify which alerts are noise. Sounds like you've got a high fidelity training set, so the key will be in discovering which Features you can feed into the model so it can pick out the pattern of what makes an alert noise or not. I've done some work on similar projects where we are trying to reduce noise from alerts. A while back we had an alert that was creating a lot of noise with the Edge Encryption product. There were a number of redundant components and we were alerting if any single component was down for X amount of time. We made a decision based on domain knowledge that the alert was noise unless at least 3/4 of the components were down for X data points in a row.

So if I abstract that thought process it might look like:

Q. What is our goal?

A. We want less noise and more signal

Q. What empirical data distinguishes noise from signal?

A. The conditions of the alert threshold

Q. What are the conditions of the alert threshold (i.e. how do we know when impact is serious)?

A. Number of downed components. Amount of time components have been down.

Q. How can we change the conditions of our alert threshold so that we get higher fidelity signal?

A. etc...

I think you will want your ML model to do something similar. So I think the process would be something like:
1. For your labelled data set, gather the all the values of the alert threshold conditions to become potential Features

2. (optional step) rank the Features using your Labeled set with a feature attribution method.

3. Pick some of the highest ranking combinations of features and test those against your labeled set to see which Solution has the best Estimated Solution Statistics.

4. (optional step) Go one more level and output all 10,000 labeled alerts versus the Solution prediction with ServiceNow's MLSolutionFactory API.

Please 👍 if Helpful. Thanks!

"Simplicity does not precede complexity, but follows it" - Alan Perlis

Mwatkins · ‎05-23-2024

I think an important assumption in my previous comment is that your current alert design is already collecting the necessary measurements in the alert threshold conditions to make a determination between noise and signal. In other words, if your decision about which alerts were noise or not cannot in any way be derived from the features fed into your model, then the model won't be successful.

Vivek Verma · ‎05-27-2024

Certainly, and I'm contemplating adding an extra step to the solution, which would be:

1. Enhancing the Alert record by including alert tags and mapping more significant attributes.

Mwatkins · ‎09-25-2024

@Vivek Verma Oh my, I just realized I missed the crux of your original question. Sorry! You said that all 10,000 of your alerts were marked noise. In that case, I think there is a problem with your population. My understanding is that the training set population should closely resemble the real world population but also have enough examples of each class for the trainer to understand what each class looks like. You'll need to add some examples of valid alert signal to your population, so the model can differentiate between your two classes - noise and signal. If you only train the model on one class it will only predict one class.

In our case, we had a population problem as well. We had too few real alert escalations in our population. When we grabbed a large enough sample, the positive class (i.e., likely escalations) was less than 1%. So, we artificially increased the number to 10%. That way the trainer had enough of each class to make good predictions. By artificially I mean we had a group of engineers look at randomized examples until we had identified enough examples in the positive class.

I know it's way after the fact, but maybe someone who reads this later will benefit.