SLA Engine - a developer's perspective of the logic

andrew_och · ‎01-10-2025

What is the SLA Engine? What is it made up off? Is it really an Engine? All these questions and more are answered from the perspective of a developer working and supporting the SLA Engine.

The SLA Engine is not actually a real SNC engine, although the Legacy SLA implementation did use the Escalation Engine, but that is out of scope of this article.

Rather the SLA Engine is a grouping of Script Includes, Scheduled Jobs, flows/workflows, two key tables: SLA Definition: contract_sla and Task SLA: task_sla and a Business Rule which is where all SLA processing starts.

Business Rule: Run SLAs

This is split into two distinct pieces of logic, either use Script Include: SLAOfflineUpdate to add an insert/update entry into the sla_offline_update table for task records with Table attribute: offline_timestamp_field, or use Script Include: TaskSLAController to process the insert/update for a given task. Note task includes any extension of task, e.g. incident, problem, change etc.

Script Include: SLAOfflineUpdate

This has one core function:
queue (calling: _queue in SelfCleaningMutex 'Process SLA Offline Update Mutex ')

Given the task insert/update with Table attribute: offline_timestamp_field, populates table sla_offline_update with DOC/DOC_ID aka table/sys_id for asynchronous processing.

Script Include: TaskSLAController

The TaskSLAController has two core functions:

_processNewSLAs (calling: _processNewSLAs_criticalSection in SelfCleaningMutex 'Process New SLAs Mutex ')

Given the task insert/update it collects all the SLA Definitions start conditions and verifies if the task matches any of these start conditions. It does this while holding a mutex against the task record to ensure there is no parallel processing to result in duplicate task_sla records being generated for a given SLA Definition. A start condition match for a given SLA Definition will insert a new task_sla record for that SLA Definition.

There is of course more complexity here, i.e. if reset, cancel, stop, pause conditions also match, then behaviour will depend on configuration. If contract management is enabled and configured, domains etc, all have the ability to affect the attachment of the task_sla record, however all of this happens within this thread of execution.

_processExitingSLAs (calling: _processExistingSLAs_criticalSection SelfCleaningMutex 'Process Existing SLAs Mutex ')

Given the task insert/update it evaluates all the SLA Definition conditions (stop, reset, cancel, pause/unpause) for all associated task_sla records. Again as above there is more complexity.

Both of these functions hand off the condition matching to another script include.

Script Include: SLAConditionBase

The first thing to note is that SNC ships two script includes: SLAConditionBase and SLAConditionSimple and that this script include is powered up in TaskSLAController via a property such that customers can replace this with their own script include.

Its purpose is to define the task_sla state machine, i.e. from which state to which state task_sla records can move to.

SLAConditionBase state machine can be found here: https://www.servicenow.com/docs/bundle/xanadu-it-service-management/page/product/service-level-manag...

SLAConditionSimple state machine can be found here: https://www.servicenow.com/docs/bundle/xanadu-it-service-management/page/product/service-level-manag...

Note all conditions are "GlideRecord Encoded Queries" so matching logic is handed off to platform API: SNC.Filter.checkRecord(task, condition)

Script Include: SLACalculatorNG

This evaluates all the task_sla record timings including the Elapsed Time, Pause Time, Time Left (when SLA Definition does not have an associated Schedule cmn_schedule), Business Elapsed Time, Business Pause Time, Business Time Left, in and out of schedule (when a schedule is associated), as well as the Planned End Date a.k.a Breach datetime. This is the script include to debug when task_sla calculations give undesired behaviours. This script include uses the Script Include: DurationCalculator and does the bulk of the timings calculation logic.

Script Include: TaskSLA

Once a task insert/update either matches a new SLA Definition condition or an existing task_sla SLA Definition condition, the insert or update of a task_sla record is done by this script include. Prior to insert/update the timings for the task_sla are calculated as explained above, however this script include also does a small amount of calculation of datetime values for a task_sla record, again using Script Include: DurationCalculator. Finally it also initiates and interacts with SLA Definition associated flow/workflow via TaskSLAFlow/TaskSLAworkflow respectively.

Script Include: TaskSLAFlow/TaskSLAworkflow

The task_sla records often have either a flow or workflow associated to them. Interaction between SLA start/cancel/pause states to equivalent Flow or Workflow states is done here calling the requisite Flow: sn_fd.FlowAPI or Workflow: SNC.WorkflowScriptAPI API.

Script Include: SLABreakdownProcessor

SLA Breakdown records are buckets of time, each marked with both an Assignment Group [reference to sys_user_group] and Assigned To [reference to sys_user], such that [empty + empty] is a bucket, as is [Group 1 + empty], [empty + Person 1], [Group 1 + Person 1] etc etc. BUT these buckets are also update based, i.e. Each task update is is one bucket, e.g. if it was assigned to: Person 1 and Group 1, (1 record), then it is assigned to empty, and group 1 (2 records), then back to assigned to: Person 1 and Group 1 (3 records). It does not add to the first bucket/record or time, because there was a sequence of events. Only visually in bar charts are buckets that are identical, but occurred in different updates, interspersed by different ownership buckets, amalgamated to show total time. Finally, if/when an SLA breaches, the current active bucket is marked as breached. Since task records may have multiple references to sys_user and sys_user_group there is a mapping definition to allow customers to use which desired sys_user (Assigned To) and sys_user_group (Assignment Group).

Scheduled Jobs to maintain accuracy of task_sla records:

SLA update (breach within 30 days)
SLA update (breach within 1 day)
SLA update (breach within 1 hour)
SLA update (breach within 10 min)
SLA update (already breached)
SLA update (breach after 30 days)

They simply call the SLACalculatorNG against a filtered list of task_sla records, to update their timings values.

Summary Interlude

Business Rule: Run SLAs calls Script Include: SLAOfflineUpdate for task records with attribute offline_timestamp_field OR Script Include: TaskSLAController for new and existing task_sla records, calls Script Include: SLAConditionBase to match task insert/update to conditions, calls Script Include: SLACalculator to calculate timings, calls Script Include: TaskSLA to insert/update the task_sla records, calls Script Include: SLABreakdownProcessor to update the breakdown time bucket records.

SLA Engine - Asynchronous mode

The SLA Engine can be run in Asynchronous mode, as set by property com.snc.sla.engine.async true. Rather than processing each insert/update for a given task immediately in the current user session, the task record table, sys_id, sys_mod_count and sys_audit.internal_checkpoint is stored in sla_async_queue. The Scheduled Job: SLA Async Delegator runs every 5 seconds which triggers the processing of task updates by the SLA Engine as the system user on a worker node.

Script Include: SLAAsyncUtils

An encapsulated block of utility logic that provides functionality such as whether SLA Engine is running in Asynchronous mode, whether Asynchronous Processing is currently active, as well as enabling and disabling Asynchronous processing.

Script Include: SLAAsyncQueue

This has two distinct logic paths.

First it puts task record insert/update table, sys_id, sys_mod_count and sys_audit.internal_checkpoint into the sla_async_queue.

Second it triggers processing of these task updates when called from sys_trigger being run and passing that task insert/update to TaskSLAController for actual processing.

Script Include: SLAAsyncDelegator

This is called by the Scheduled Job: SLA Async Delegator. It purpose is to take entries in the sla_async_queue, create new sys_trigger records (Scheduled Jobs) and assign batches of task updates to be processed. There are a several properties here to help tune the creation of, priority of, quantity of scheduled jobs. The defaults are: 4 sys_trigger records to be created, each with a priority of 100, each allowed to process a maximum of 20 task updates. Task updates are sequenced and always put into the same sys_trigger record to ensure no parallel and/or out of order processing (which could cause deadlocks since the SLA Engine uses a mutex per task). Each time the SLAAsyncDelegator is called, it will either create new sys_trigger records or re-fill existing sys_trigger records with tasks to process. These run-once sys_trigger records call Script Include: SLAAsyncQueue.processQueue with themselves as the argument, see second part of SLAAsyncQueue.

Script Include: SLAAsyncQueueHealthCheck

This is triggered every 5 minutes by the Scheduled Job: SLA async queue health check and checks the health of the sla_async_queue, such as do processing entries really have a corresponding sys_trigger record.

SLA Repair and SLA Timeline

Script Include: SLARepair

Provides functions to repair Task SLAs based on a GlideRecord, filter or sys id of records from the contract_sla, task_sla or task tables. Each Task SLA identified for repair is removed and then recreated based on the audit history on the Task (using History Walker) and the current set of conditions in each SLA Definition. As a result the repair process can potentially remove Task SLAs that no longer match the attach conditions or create Task SLAs that did not previously exist.

Script Include: SLATimeLineV2SNC

Provides a visual representation of the life of a task_sla, showing when it matched the conditions in each SLA Definition, start, pause/unpause, complete, canceled as well as in/out of schedule and breach indicator. The Task SLA is created in-memory only from the task audit history, using History Walker.

History Walker

The History Walker uses the audit/history tables to generate a historical version of an existing GlideRecord. It provides the ability to move forward and backward through update numbers. This enables the SLA Engine Repair to be run again on each replayed update of a Task record and recreate a Task SLA record. This Task SLA record is used in-memory only by the SLA Timeline view, or inserted into the database as part of repairing it.

API: sn_hw.HistoryWalker(table, sys_id);

The API loads a task record and replays the entire Audit History of that record at each update/checkpoint such that it can be re-evaluated by the SLA Engine.

var gr = new GlideRecord('sc_req_item');
gr.get('number', 'RITM0010065');

var hw = new sn_hw.HistoryWalker(gr.getTableName(), gr.getUniqueValue());
hw.setWithVariables(true); // Optional: if walked variables required
hw.setWithJournalFields(true); // Optional: if walked journal fields required
hw.setWithSysFields(true); // Optional: if walked system fields required

while (hw.walkForward())
   printChangedFields(hw);

function printChangedFields(hw) {
   var walkedGr = hw.getWalkedRecord();
   var fields = GlideScriptRecordUtil.get(walkedGr).getChangedFieldNames();
   gs.print('Fields changed at update ' + hw.getUpdateNumber() + ' were:');
   for (var j = 0; j < fields.size(); j++) {
      var fieldName = fields.get(j) + '';
      if (fieldName !== 'variables')
         gs.print(walkedGr.getValue('sys_updated_on') + ' ' + fieldName + '=' + walkedGr.getValue(fieldName));
   }

   // Optional: if walked variables required
   var variables = walkedGr.variables;
   for (var variableName in variables) {
      if (variables[variableName].changes())
         gs.print(walkedGr.getValue('sys_updated_on') + ' ' + variableName + '=' + variables[variableName].getValue());
   }
   gs.print('');
}

Two Additional notes about the SLA Engine and other Platform features

Flows and Workflows

SLA Definitions commonly have a flow or workflow associated to them. The primary use case of these flows/workflows is to support milestone notifications with the: Default SLA flow, providing implementation of 50% event trigger and 75% event trigger as well as 100% aka breached event trigger, which are often mapped to outbound email notifications to Assigned To and/or Group manager depending event percentage.

Domain Separation

SLA Definitions have both a sys_domain field and sys_override field when domain separation is enabled, meaning they are configuration rather than data. This enables MSPs to set domain specific SLA Definitions as well as having SLAs running in the Top/Global domain. In Asynchronous mode task updates are processed by user: system rather than the user that ran in synchronous mode. As such the SLA Engine sets the domain of the task record as the current running domain and identifies the SLA Definitions in the bottom/leaf domain first, as well as match the flow/workflow in that domain, the schedules and corresponding spans etc.

SLA Engine - a developer's perspective of the logic

Admin Tip That Can Save you some time