Uncle Rob
Kilo Patron

There once was a customer that had tasks with unusually long lifecycles.   These tasks would frequently be open and relevant between 6 months and 1 year.   The tasks were associated with a complex approval / data population workflow that was, by comparison, very short.   The effect looks like this...

find_real_file.png

THE FIRST HINT THAT SOMETHING WAS WRONG

For the first few months, the customer was elated at the business value being provided by this solution (it helped them ace a government audit).   But suddenly a large number of tickets mysteriously "started over".   A second set of approvals launched.   Data overwritten.   Notifications sent.   On further inspection, it appears the workflow had launched a second time.   Digging deeper, a large number of older Tasks simply had no workflow.   Some of you may remember this from my The mysterious case of the disappearing workflows thread.

WORKFLOWS ARE FLUSHED ON OLD INSTANCES!!

The horrifying root cause to this issue is that on older instances of ServiceNow, completed workflows are flushed after a few months.   Take a look at the visualization above.   That workflow completes in a couple days.   Here's the culprit.   Workflow context table flushes "ended" workflows after 15 million seconds.

find_real_file.png

The extra scary fact is those workflows are GONE.   There is *no* hope of repair.   So imagine a large, complex task, which is critical for audits, that instantly has months of history wiped out.   No joke people, this is the kind of thing that mobilizes whole teams of lawyers.     The good news is that this only appears to exist on OLD instances of ServiceNow.   On fresh instances of Geneva or Helsinki, this auto-flush record does not appear to exist.

HARDENING YOUR SOLUTIONS

After discovering the issue, our intrepid customer took two courses of action to harden their long-lifecycle Tasks against workflow flush/reset

1)   De-activate sys_auto_flush records for wf_context

This appears to be a safe option, since later versions of ServiceNow don't even have the record.

2)   Place a "now" condition on your Workflow conditions

Since de-activating the sys_auto_flush record only prevented new context flushing, the customer still had a terrifying problem of plentiful tasks with no workflow.   They still had to prevent new workflow launches on existing Tasks with purged contexts!   To accomplish this, they added a condition to the Workflow's triggering:   created = Now.   Since the tasks in question would already be months old, this condition would be false, and no new workflow would be run.   It guarantees that a workflow "only runs once ever" no matter what other legacy issues exist.

TL;DR

- Sometimes tasks outlive their wf_contexts

- workflow contexts get flushed on old instances

- Check your sys_auto_flush table for a wf_context record, and deactivate it.

10 Comments
Uncle Rob
Kilo Patron

Quick update:   After sleeping on this I realized the Task doesn't necessarily have to have a long life cycle.   It just has to have a plausible update scenario some time in the future.   This is really common in many of the non-IT apps that I've built.   Tasks can be "closed", but have some kind of post-closure update that's meaningful to the business.   Remember:   *any* update to a Task with an abandoned workflow will cause the WF to start all over again.



TL;DR


- make sure your workflow contexts don't get flushed


- harden your workflow conditions so that it evaluates against the creation date of the Task.


GoBucks
Mega Sage

Good post. We've recently encountered this scenario-- staff updating old Change Requests (with long gone workflow contexts).   However, Support has informed me NOT to remove...


Is it OK to deactivate the Auto Flush (sys_auto_flush) record for workflow contexts?


hugoruano
Tera Contributor

Great information!


peter_z
ServiceNow Employee

Hi All,   just a small add on to this thread to bring it up to Jakarta state .     The sys auto flush was disabled post Dublin release and removed in Geneva onwards but the unrestrained growth in the size of the WF tables can lead to performance issues in some cases.     With the release of Jakarta the table cleaner function is reintroduced and is enabled by default   ( https://docs.servicenow.com/bundle/jakarta-release-notes/page/release-notes/servicenow-platform/work... ).



The original issue highlighted here of workflows potentially be retriggered if their WF contexts have been removed is managed now through the change to the cleaner routine to copy the records to be removed to a separate table and the enWF engine will query this new table if it does not find a record in WF contexts for a related record before it triggers/retriggers the WF .


ytrottier
Tera Contributor

Hi Peter.

Can you clarify which table will hold the deleted WF context records that the improved WF engine looks at before restarting a new WF context for a deleted WF context ?  I tried to find it but could not.

I guess that table too should be cleaned up after some time (most likely a longer period) since it would grow very large after years of receiving all the deleted/completed WF context records.

Thanks.

haninger_3
Tera Guru

Excellent posts here! We started in 2009-2010, pre-Aspen. We have built functionality that handles inbound email replies to closed RITMs.

One of these emails retriggered a workflow that has orchestration. If the orchestration fails, the support team gets a child RTASK to manually complete the orchestrated work. Then the end-user gets a notification that their new account is ready.

One of these RITMs sat completed for 6 months and then received a reply from the user. Of course the workflow context was gone so it restarted, orch failed, manual RTASK, email to customer, followed by much confusion. Then we received a RITM suggesting that there might be a ServiceNow bug.

*Gasp!* How dare they!

Oh. Oh I see. I guess there was a bug. But they fixed it. Only for customers going forward. Argh.

DirkRedeker
Mega Sage

Hi Rob

Thanks a lot for this article.

This is really an eye-catcher and eye-opener.

What about Task SLA and other engines racing around..

I guess the table cleaner is something I need to put an eye on in more depth.

That is the "sadness of removing data from databases".

Best practices in my 275 years of IT always was

a) never recycle IDs AND

b) never remove records with any meaning.

Great article, thanks.

Enjoy and

BR Dirk

Uncle Rob
Kilo Patron

You're most welcome, Dirk.  I BELIEVE the issue isn't a thing anymore.  Somewhere way back, they kept workflow instances off the table cleaner by default. 

This will only be a problem in the oldest of old instances.

haninger_3
Tera Guru

Is it wf_context_binding?

Currently working with HI on this and noticed a reference to wf_context_binding in an OOB cascade Delete BR on wf_context.

We have ~450K records in wf_context. All completed in the last 6 months.

We have 2+M records in wf_context_binding.

Our instance is 10+ years old. I don't see a cleanup record or rotation for wf_context_binding, but they must be getting deleted somehow. Probably I'm missing something simple.

Nolan Strait
Kilo Contributor

Auto Flushes still appearing active here for me in San Diego version!!!