Post zBoot + Clone + Restore - All background jobs stuck

LPK_GU · ‎05-22-2025

Hi all,

After encountering issues during environment recovery, including a zBoot and a clone from production, we performed a system restore. We’re now facing persistent background job failures in our sub-production environment and would appreciate insights from the community.

Observed Behavior:

Background and async jobs are no longer executing
Plugin installations hang when using the new Application Manager
sys_trigger_list is heavily backlogged with thousands of stuck items

Investigation Findings:

Node records were recreated with new node IDs (visible in sys_cluster_state) at the time of zBoot
All nodes were assigned the Maintenance type
No entries appear in sys_scheduler_assignment_view — no schedulers are assigned to any active nodes, and none were restored
Nodes are online and report with healthy heartbeats and no visible errors in logs — which made the issue much harder to detect
Our other environments (Production and another sub-production box), using the Generic type, continue processing jobs without issues

Additional Notes:

HI Support suggested using the Classic Application Manager to install plug-ins, which works — it bypasses the async job queue and masks the deeper issue while they continue to investigate
Restarting nodes and the instance (post-restore) had no effect, as these actions do not reset node roles or regenerate scheduler assignments.

Questions for the Community:

Does a zBoot reset node roles and remove scheduler assignments?
In cases where nodes are set to the Maintenance role, are new schedulers required to be manually created and assigned?
Or is it more appropriate to revert the nodes to Generic post-zBoot?

We haven’t found documentation outlining this behavior or any post-zBoot checklist covering this. Any experience, guidance, or official documentation would be incredibly helpful.

Thanks in advance!

#InstanceManagement, #PlatformAdministration, #zBoot, #Scheduler, #BackgroundJobs, #NodeConfiguration, #sys_cluster_state, #sys_scheduler_assignment, #Maintenancenodetype, #PluginInstallation, #AsyncJobs, #EnvironmentRecovery

LPK_GU · ‎05-23-2025

Update:

After weeks of investigation, we asked HI Support whether having all nodes set to “maintenance” could be the root cause of our background job backlog. The next day, they responded confirming that they updated the node records in the sys_cluster_state table to match out-of-box (OOB) configurations, node type updated to generic. Almost immediately after this change, the 6,000+ pending jobs were flushed and system operations returned to normal.

This raises important questions: If a zBoot is intended to revert the system to OOB configurations, when and how did our nodes end up in “maintenance” mode instead of “generic” by default? Was it immediately after the zBoot, after the system clone, or after the production restore?

For context, our sequence of events—while very unusual—was:

zBoot,
system clone (without installing plugins beforehand),
followed by a production restore.

Disclaimer: I was not involved in the execution of these activities and had no prior knowledge about nodes. I’ve simply stepped in to help troubleshoot and understand the root cause.

I’m just happy to be back in business now that everything is running again, but hope sharing this experience saves others weeks of frustration and confusion. If anyone has insight into how node types are set during these processes, I’d love to learn more.

Update #2:

“Why Nothing Was Running: A Bus Stop Analogy”

Imagine our ServiceNow system is like a big city transit network.
• Jobs are the passengers.
• Schedulers are like route planners—they decide which buses (nodes) need to go where and when.
• Nodes are the actual buses that carry out the jobs. Some buses are meant for passengers (background jobs), while others are “Not in Service.”

When we had our issue:
• All the buses (nodes) in staging were set to “maintenance” mode—this is like putting up a “Not in Service” sign.
• Passengers (jobs) kept arriving at the bus stops (job queues), but no buses came to pick them up.

• The schedulers were still on the clock, planning the routes, but not assigning buses to any stops.
• Over time, the bus stops got packed with waiting passengers—thousands of them.
• Nothing moved. Everyone (plugins, metrics, background tasks) just stood in line.

Finally, after we realized the issue and HI support changed the configuration of the nodes:
• It’s like someone flipped the signs on the buses to “In Service” (changing from maintenance to application/generic).
• Suddenly, buses started picking up passengers again.
• The lines cleared up almost instantly—over 6,000 waiting jobs were gone in a flash.

View solution in original post

LPK_GU · ‎05-23-2025

Update:

After weeks of investigation, we asked HI Support whether having all nodes set to “maintenance” could be the root cause of our background job backlog. The next day, they responded confirming that they updated the node records in the sys_cluster_state table to match out-of-box (OOB) configurations, node type updated to generic. Almost immediately after this change, the 6,000+ pending jobs were flushed and system operations returned to normal.

This raises important questions: If a zBoot is intended to revert the system to OOB configurations, when and how did our nodes end up in “maintenance” mode instead of “generic” by default? Was it immediately after the zBoot, after the system clone, or after the production restore?

For context, our sequence of events—while very unusual—was:

zBoot,
system clone (without installing plugins beforehand),
followed by a production restore.

Disclaimer: I was not involved in the execution of these activities and had no prior knowledge about nodes. I’ve simply stepped in to help troubleshoot and understand the root cause.

I’m just happy to be back in business now that everything is running again, but hope sharing this experience saves others weeks of frustration and confusion. If anyone has insight into how node types are set during these processes, I’d love to learn more.

Update #2:

“Why Nothing Was Running: A Bus Stop Analogy”

Imagine our ServiceNow system is like a big city transit network.
• Jobs are the passengers.
• Schedulers are like route planners—they decide which buses (nodes) need to go where and when.
• Nodes are the actual buses that carry out the jobs. Some buses are meant for passengers (background jobs), while others are “Not in Service.”

When we had our issue:
• All the buses (nodes) in staging were set to “maintenance” mode—this is like putting up a “Not in Service” sign.
• Passengers (jobs) kept arriving at the bus stops (job queues), but no buses came to pick them up.

• The schedulers were still on the clock, planning the routes, but not assigning buses to any stops.
• Over time, the bus stops got packed with waiting passengers—thousands of them.
• Nothing moved. Everyone (plugins, metrics, background tasks) just stood in line.

Finally, after we realized the issue and HI support changed the configuration of the nodes:
• It’s like someone flipped the signs on the buses to “In Service” (changing from maintenance to application/generic).
• Suddenly, buses started picking up passengers again.
• The lines cleared up almost instantly—over 6,000 waiting jobs were gone in a flash.

iainb · ‎08-09-2025

Hi LPK_GU,

May I request information on why this activity was done?

By that I mean

zBoot,
system clone (without installing plugins beforehand),
followed by a production restore.

What was the intended outcome of this and did you achieve it?

Thanks,

Iain