SLA split transaction

andrew_och · ‎01-11-2024

Why did my SLA not attach? My SLA did not pause, why? A common reason why SLAs do not behave as expected is due to a transaction being split, but what does that mean, how do I identify the cause, and how do I prevent it.

There are of course many ways a transaction can be split, so the way presented below is neither definitive and perhaps very obviously a bad practice, it does however serve as a good example.

The bad practice example setup.

1. I modified the property: com.snc.sla.get_task_from_db ensuring it is empty

2. I modified the Service Catalog Item Request Workflow, adding an activity that both sets backordered to true and priority to 3 and put this before the create catalog task, see below.

3. I created an on-insert Business Rule on sc_task, which will trigger from the above workflow that creates a catalog task. The Business Rule queries the sc_req_item record (the one associated to the workflow in previous step) and sets the state to 2 (Work in Progress) and the priority to 2 (wait didn't we set the priority to 3 in the workflow? Yep, this is a very bad practice indeed). See below.

4. I created an SLA Definition on table sc_req_item, and set my pause condition to be: state to 2 (Work in Progress), backordered to true, priority to 3. Based on my implementation in step #3, will my SLA pause? see below.

5. My sc_req_item record is has state set to 2 (Work in Progress), backordered set to true, and Priority set to 3, well I guess based on my implementation in step #3 the workflow won and the Business Rule lost, and my task_sla should pause... OMG my task_sla did not pause, WHY? see below.

Ok so when we look at the values in the sc_req_item record we see that they match the SLA pause condition, but lets look at the history of the record. see below.

Oh the priority is set to both 2 - High and 3 - Moderate, all in update 3, but we can also see that the there are two different record internal checkpoint values, one set for the Priority and State made by the BR and another set, Approval set, Stage, Backordered, Approval, Priority made by the workflow. So which set of values was passed to the SLA Engine when update 3 was made? Well it did not pause, so obviously not the desired values.

6. Let adds some comments the sc_req_item, in update 4. Wow the task_sla paused at 16:18:14, when processing update 4 at 16:18:17. So the database updated 3 seconds after processing, that is just my local instance being slow. But that is the completely wrong time, it should have paused on update 3 at 16:09:25. What is going wrong with the SLA Engine. Well each update is passed to the SLA Engine which processes (checks start, stop, pause, reset, cancel conditions) and updates the task_sla accordingly. The sc_req_item DB values match the pause condition, so the SLA pauses the task_sla, it is update 4 and the time is: 16:18:17. Unfortunately the implementation done in #3 causes data corruption, is the priority 2 or three, well during one transaction, the update of the sc_req_item, it is 3, but the other transaction insert of the sc_task, the sc_req_item, read from the DB (prior to the changes made in the workflow) it is priority 2. However by update 4 all data is settled in the DB and now we get a single result for priority: 3 and that matches the pause condition. see below.

Lets run the SLA data collector and see what we get.

*** Script: SLA recalculation: 
Task:	RITM0010007
SLA:	Dev Laptop Backorder Priority 3
	hasAuditData: true
	isAuditModInSync: true
	auditCount: 4
	modCount: 4
	hasBadHistoryUpdate: {
  "different_dates": {
    "3": "2024-01-10 16:09:25 != 2024-01-10 16:09:24"
  },
  "different_internal_checkpoints": {
    "3": "18cf422b89b0000001 != 18cf422b6700000001"
  },
  "hasBadUpdates": true
}

Oh dear hasBadUpdates is true. There are two issues with update 3, there are two different record internal checkpoints, which mean: split transaction. Not only is the update 3 split, it has different times: 16:09:25 and 16:09:24. Ok but that is only one second out right? No big deal? Unfortunately that could be the difference between an SLA breaching or not. It will of course yield different calculations on SLA Timeline and SLA Repair, depending on which date/time the history records are returned. So which is it... in short I don't know, but fundamentally the root cause is data corruption and caused by a split transaction as created by implementation in step #3. The point here is that when a transaction is split, it is not just the data that is potentially corrupt or inconsistent, but also the meta data i.e. the time the system believes the update took place, is also inconsistent. The solution is to prevent any inconsistencies from occurring, and hopefully this post will help identify causes of data inconsistencies.