Data Quality Analysis a best practice you should follow before you build your first AI Model

Lener Pacania1 · ‎05-11-2023

You’ve installed the Predictive Intelligence/Task Intelligence plugins on your instance and you’ve trained your first model. But something looks wrong, your precision is low or when testing “unable to predict” gets thrown.

Worse case you can’t even train the model and you get the super helpful message to “Ask Support to use log key ######”. You scratch you’re head and think you configured something wrong - let me stop you right there.

Updated June 2025 - I have added a PDF of the Knowledge Lab that I taught at K24 with some techniques to address data quality issues.

It’s not you, it’s probably your data.

I’ve compiled a data quality checklist to complete before you configure your first Predictive/Task Intelligence model. Taking some of these steps should help you avoid the common issues in data quality that can cause your Predictive/Task Intelligence models to underperform (have low precision/recall/coverage) or not train.

Data Quality Checklist:

Do you have enough data?

The default minimum training size is 10,000 records for Predictive & Task Intelligence. You can lower these minimums in the sys_properties (for example changing the glide.platform_ml.api.csv_min_line) - however, that’s almost never a good idea. Machine Learning models need a good amount of data to learn, our data scientist benchmarked our models and found that having 30,000 records with good variability improves the performance of the machine learning models. Note: If your curious on the algorithms used in classification you can reference the Servicenow AI community video - AI Fundamentals Part 1.

Do you know what you are trying to solve with machine learning?

To answer this question, it helps to create a simple dashboard with some basic reports on it. At minimum I create a dashboard like the one below with four simple metrics:

Potential Cases that can be automated – this is a count of all tickets solved by the first agent, this may be called Level 1 (L1) for your ServiceDesk. Also known as First Call Resolved tickets.
Potential Cases with multiple re-assignments – this is a count of all the tickets where the re-assignment count is greater than 2. If your re-assignment count is high then this is a great use case for Predictive/Task Intelligence.
L1 MTTR – this is the First Call Resolved MTTR. If this value is high it may indicate potential savings via self-service via Virtual Agent or Knowledge article deflection. You can use Predictive Intelligence’s Knowledge Demand Insight application to improve your knowledge base.
AVG reassignment time – this is the average reassignment time when reassignment count is greater than two. This is a sweet spot for Predictive Intelligence & Task Intelligence for auto-routing.

--Common data quality issues with output fields--

Do you have Class Imbalance in the output field you are trying to predict?

Class Imbalance happens when you have a few assignment groups that handle almost all the tickets for a particular issue in your training set. You can easily check for class imbalance by going to your filter condition in your solution definition and click the link, in my case 1,998 records (yes- I am aware that I said you need 30k records).

A list view of my training set will open and I can make a bar chart on the output field I am trying to predict. The x_axis represents the output field I am trying to predict. Because I know my data, I know I have ten unique outputs. But the problem is all my training data is distributed between two output classes phone and chat.

To fix this we need to change the training set to include more of the classes we are trying to predict. When you plot your training set you want it to look like the below, with each class (output field) represented in the distribution. One solution would be to add a column with a value of True/False, you would label which records you would like to use in the training set to achieve a balanced data set and eliminate the skew.

Does your output field have outliers that would skew the prediction?

Outliers have a big impact on models such as Regression where you are trying to predict the business resolve time. It’s important to exclude the outliers in your training set when training a regression solution. You can do this by creating a whisker plot of the resolution times for each assignment group and eliminating the top whisker (which only has 25% of the data). For example, in your training dataset filter condition you would exclude cases with business resolve time in the range of 20-25 days.

Why is my regression prediction negative?

There is a great KB1213065 by Brian Bakker that explains why this happens. In summary, depending on the range of the training data, for predictions that return lower values where the confidence level is set high, the case arises, where the lower bound will be shown as a negative value. However, the predicted value itself will never be negative.

Are your output fields of the right data type?

Output fields should have a finite list of possibilities such as assignment group, HR services, or category, or priority. Output fields can be choice field or a string field and should have some causal connection to the input fields.

--Common issues with input fields--

Do your input fields have enough data variability?

If your input fields lack variability your model will perform poorly. This often happens when templates are used to populate the short description and description fields like the below. Where everything is populated by a template with the only variability being a name. In the below short_desc and description these are heavily templated and would NOT be good input fields.

Did you select input fields that have zero correlation to the output field you are trying to predict?

If precision is poor on the model, it may be that you don’t have the right inputs to predict the output. You might have selected too many or too few inputs. For example, to predict assignment group typically you will at minimum use a description as input; additional inputs such as location, category, priority might help improve assignment group prediction. You can determine which inputs are highly correlated to the output you are predicting by using the steps in this tuning article, the closer the value is to one the more highly correlated it is to the field you are trying to predict.

7 (1).jpg

Do your input fields have a lot of empty values?

If so remove them from your training set. You can also use the script in this article to identify input fields that are mainly empty or NULL.

Do your input fields have mixed Languages, special characters, or HTML tags?

In the below example the short description has a combination of English and Japanese. Predictive/Task intelligence can only be trained against one language, so incidents like this will lower the model effectiveness. Regarding multi-language, Predictive Intelligence Similarity when used in Agent Assist and Knowledge Demand insights can only support one language at a time. That means if you are in a workspace and you access Similar Resolved Incidents/Similar Closed Cases from Agent Assist that similarity model will only be trained against one language.

Any special characters not removed by the pre-processing process (see KB0862310) , HTML tags, or images in your input fields should be removed to reduce model noise when training. At a high level you can create a custom stop word list to ignore the special characters. To remove HTML tags or images in the description fields you would need to write java script to remove the special characters and re-write the cleansed data into a custom column and use that column as your new input.

Did you select any inputs field with a large number of distinct values?

A common issue for training failure/poor performance is selecting an input field that has many distinct values, but not all those values are used in your incidents. For example, in the below classification solution definition I use configuration item as an input – configuration item has 5M+ distinct values, but only 5% of those configuration items are used in my incident list. If your model is performing poorly you may need to limit your training set to configuration items that are used by the incidents in your training set.

Is your input field a supported data type?

Input fields can be string, reference, choice, or HTML. Journal fields such as work notes are not supported. I’ve also found that if you have a paragraphs of text with HTML tags those HTML tags need to be removed using a java script pre-processing step.

Do not pick the same field in the group by, input field, purity field when configuring a cluster solution

If you want to use group, purity fields in your cluster make sure you don’t have the same field in multiple places as shown in the image below. The assignment group should only show up in ONE place. Otherwise, your clustering may fail training. If you want to learn more about clustering techniques you can go to the AI community > Advanced Topics > Predictive Intelligence for details of clustering, insight tables, and clustering algorithms.

Do you have machine generated/event data as inputs?

If you are trying to automate agent task by routing you'll want to exclude any incidents/cases generated by machine/event systems.

This list is not exhaustive and I'll update this article with other data quality issues to be aware of. -Lener

mahajanravish5 · ‎03-06-2025

Hello @Lener Pacania1 ,

Great insights!!

How did you calculate AVG Reassignment time?