Understanding How and When SLAs Are Calculated, and How It Might Affect Your Reports

NikEng1 · ‎03-29-2021

SLA Recalculations

Have you ever viewed an SLA record, only to have the values for elapsed percentage and time suddenly increase by multiple units? Or have you ever refreshed a list of SLA:s, without seeing the values of elapsed time change at all, even though time has passed since you last refreshed the list?

This is because calculated values in SLA records, like elapsed time and elapsed percentage, are not calculated on-the-fly. In fact, some SLA records only have their values updated once every 5 days (!). Depending on how big your backlog of open tasks is, and what you measure in terms om SLA metrics, this could have a big impact on your KPI:s.

So how are SLA:s recalculated, and can it be changed?

By "SLA Recalculation", I am referring to the updates of values like "Percentage Elapsed" or "Time Left". These values are always recalculated when an update happens in the SLA record, like it being paused, cancelled or completed. But apart from those events, there are three thing that recalculate and update the SLA:

A unique scheduled job for each SLA that runs at the breach time
A business rule named “Calc SLAs on Display” on the task table
Scheduled calculation jobs

The breach-time scheduled job

Each time an SLA is created, a job is created and scheduled to run at the breach time of the SLA. This scheduled job recalculates the SLA when the breach time is reached. If an SLA is changed or added, this job is rescheduled. The job runs once and is then removed. This means that an SLA is always updated and recalculated at the time it breaches.

The on-display business rule

The business rule “Calc SLAs on Display” runs when a records on the task table, or a table that extends task, is displayed. This means that each time someone opens an incident in a form view, all related SLA:s are recalculated. Note that the business rule does not run on display of the actual task_sla record, so opening the SLA record itself does not recalculate it.

Here is a video of this happening, notice how the elapsed percentage jumps from 202 to 206 when we open the incident:

This is the business rule responsible for the recalculation:

The business rule runs the following script:

(function executeRule(current, previous /*null when async*/) {
    // if this Task has unprocessed records in the "sla_async_queue" then do not call SLACalculatorNG
    if (new SLAAsyncQueue().isTaskQueued(current.getUniqueValue()))
        return;
    
    var task_sla = new GlideRecord("task_sla");
    task_sla.addQuery("task", current.sys_id);
    task_sla.addActiveQuery();
    task_sla.addQuery('stage','!=','paused');
    task_sla.query();
    while (task_sla.next()) {
        //Disable running of workflow for recalculation of sla.
        task_sla.setWorkflow(false);
        if (gs.getProperty("com.snc.sla.engine.version", "2010") === "2011")
            SLACalculatorNG.calculateSLA(task_sla);
        else {
            var slac = new SLACalculator();
            slac.calcAnSLA(task_sla);
        }
    }
})(current, previous);

What this script does it to query the task_sla table for all active an unpased SLA records related to the task that was just displayed. It then calls the appropriate SLA-calculation script based on your version of the SLA engine (most likely 2011).

The Scheduled Jobs

But what if a record isn’t displayed, does that mean the SLA is never recalculated? The answer is no, thanks to scheduled jobs. There are multiple jobs running at different time intervals to recalculate the SLA:s. Understanding how these work can be important to making sure your metrics are correct.

To view the jobs, navigate to: System Scheduler > Scheduled Jobs and search for jobs with a name starting with “SLA Update“:

There are multiple jobs running at different intervals. Each job calculates a subset of active SLA:s based on how how much time is left before the the breach time. Lets look at “SLA Update (breach within 1 hour)“:

We can see that this job is scheduled to run every 10 minutes. It calls the calculateSLArange method of the SLACalculatorNG class, with two parameters, start and end. The start is a date and time 10 minutes into the future, and the end is 60 minutes into the future. This means we are calculating all Task SLA records with a breach time of between 10 minutes or 60 minutes from now.

What’s important to note is that per default, these scheduled jobs stop updating the SLA when it has surpassed a certain value for “actual elapsed percentage”. That value is defined by a system setting “Percentage at which scheduled jobs stop refreshing Task SLA timings“, which by default is 1000.

It should be noted that this is only true for the scheduled jobs, which utilize the calculateSLArange method. The on-display business rule mentioned previously utilizes the calcAnSLA method, which does not take the maximum value into account, and calculates the SLA regardless.

In total there are six scheduled jobs, each with their own range. Notice that these jobs have a certain overlap in their ranges, which is why for example the “breach within 1 hour” job doesn’t look for SLA:s breaching within 10 minutes, as those are already covered by the “breach within 10 minute” job.

Job Name	Interval / Range of SLAs covered by job	Runs every:
SLA update (already breached)	Breach time between now and 1 year ago	1 day
SLA update (breach after 30 days)	Breach time between 30 days from now and 1 year from now	5 days
SLA update (breach within 10 min)	Breach time between 1 minute from now and 10 minutes from now	1 minute
SLA update (breach within 1 hour)	Breach time between 10 minutes from now and 60 minutes from now	10 minutes
SLA update (breach within 1 day)	Breach time between 1 hour from now and 24 hours from now	1 hour
SLA update (breach within 30 days)	Breach time between 1 day from now and 30 days from now	1 day

What does this mean for my reports?

First of all, realize that running queries against the task_sla table, or a database view like incident_sla, does not constitute a display of the actual record. This means the business rule to recalculate SLA records is not triggered. If no one opens the task, we are relying on the scheduled jobs to keep the SLA values updated until the task is closed.

If an SLA record is more than 30 days away from breaching, and no one has opened the task form, the elapsed percentage and elapsed time fields might be up to 5 days off in their calculated values.
If an SLA has surpassed 1000% of its actual elapsed time, and no one opens the task form, the SLA values will stop updating.
If an SLA is due to breach within a day, the calculated values are only updated once an hour unless someone opens the task.
If an SLA has breached more than one year ago, it stops being updated.
If an SLA has breached and no one opens the form, it is only updated once per day.

If we are just reporting on metrics such as “Number of open incidents with a breached SLA”, or “% of incidents resolved within SLA”, the limitations above will likely not matter. An SLA will always be shown as breached if the breach time has passed, thanks to the unique scheduled job created for that specific SLA record. It also wont affect metrics like “average elapsed SLA percentage in resolved incidents”. That metric is looking at completed SLA:s, which are updated when the related task is closed/resolved.

However, if you are reporting on metrics like "Average Elapsed SLA in Open Incidents" or "Open Incidents With >150% SLA Elapsed", your metrics could be off. This will be an even bigger problem if you have a big backlog of old open tasks, or if you have very long SLAs. For example, if you have an SLA that is due to breach over 1 year in the future, it simply wont update until there's 1 month left.

Changing the frequency

You can change the recalculation frequency, if you feel your use cases depend on it.

A simple thing to change would be the system property which modifies the maximum elapsed percentage for which the SLA engine will continue calculation for. As stated above, it is set to 1000% out of the box. If you know you have a lot of old tasks exceeding this, it could be a good idea to turn this up.

You could also increase the frequency of some of the scheduled jobs. Maybe you want the breached SLA:s to update more than once every day, which could be accomplished by changing the “SLA update (already breached)” job frequency.
Or maybe you want SLA:s breaching in 30 days or more to be recalculated more than once every 5 days; then change the frequency of the “SLA update (breach after 30 days)” job.

Also, note that the SLACalculatorNG script-include has a method called “calculateAll”, which will recalculate all active and unpaused SLA:s with an elapsed percentage less than the maximum value in the system property.

Be aware of the potential performance hit any change could have on your instance, which largely depends on how many active SLA:s you have.

/Niklas Engren

performinganalytics.com

YuvrajS17678636 · ‎02-27-2024

How to disable the SLA task getting created for Issue ?