- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
If you’ve worked on large-scale ServiceNow implementations, you’ve probably been there —
it’s 11:30 PM, a major integration has failed again, and the same team is manually re-running payloads, checking logs, and praying that the external system doesn’t timeout this time.
I’ve seen this cycle repeat across multiple projects: integrations work perfectly during testing, but the moment they hit production, reality sets in — intermittent network issues, expired tokens, partial payloads, and missing responses.
And every time something failed, someone on the team had to log in, dig through the syslog, reconstruct the payload, and re-send it manually.
It wasn’t sustainable.
That’s what pushed us to build a Centralized Outbound API Maintenance module — a custom ServiceNow capability designed to bring order, visibility, and automation to how outbound integrations behave.
Why We Built It
We realized that most integration failures weren’t caused by bad design — they were caused by a lack of governance and traceability.
Every script include handled its own retries, logging, and error handling in slightly different ways. When something broke, it was impossible to answer simple questions like:
- Which APIs are failing most often?
- Are these temporary or recurring issues?
- Who retried what, and when?
We needed a system-wide approach — one place to track every outbound API call, retry failed ones automatically, and give administrators the power to fix issues without touching code.
The Core Idea
Every time a ServiceNow integration makes an outbound API call, the system creates a log entry.
This record captures everything — the endpoint, payload, response, HTTP code, retry count, and the result.
If the call fails, it’s automatically picked up by a Scheduled Retry Job, which keeps retrying based on the integration’s configuration (for example, up to 3 times, every 30 minutes).
If all retries fail, the system marks the call as Maxed Out and flags it for human review.
In other words — no more manual re-sends at midnight.
When the Circuit Breaker Saved Us
One of the biggest pain points we had before was API storms.
Imagine a downstream system like a CRM going offline — suddenly, hundreds of retries start hitting the same dead endpoint.
Our outbound queues would flood, and the logs would be useless.
That’s when we introduced the Health Check Job — a circuit breaker.
It pings each configured integration endpoint regularly. If the target system is down, it marks all pending retries as Skipped and temporarily halts retry attempts.
When the system comes back online, those skipped calls are reactivated automatically.
It sounds simple now, but this single change stopped hundreds of wasted retries and gave us clear visibility into system health — before users even noticed.
Manual Control, But Smarter
While most issues resolved automatically, there were times when we needed to step in manually — for example, after fixing credentials or updating an API key.
So, we added “Retry Now” and “Bulk Retry” buttons to the admin UI.
These allowed integration admins to:
- Retry failed API calls instantly (without waiting for the next scheduled run).
- Bulk reprocess multiple failures linked to a single incident or outage.
- Add contextual notes or link retries to problem records for auditability.
The technical design and the details follows in the part two of this blog.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.