Managing Outages within a Service Management Environment

Chris Shakespea · ‎12-04-2023

An IT outage is when computer systems or networks stop working correctly, often due to hardware issues, software glitches, or other unforeseen events. This disruption can lead to downtime, impacting productivity and business operations. To minimize such issues having a well established process for managing outages is essential.

This article aims to take you through

- What is an outage?

- Common Use Cases

- Outages relationship to Services

- Creating Outage records

- Who should be involved

- Reporting Outages

Outage Overview

An Outage represents CI unavailability. The causes are :

Outage
Planned Outage
- usually the result of a routine maintenance schedule, upgrade action
Degradation
- Partial, Slow, Intermittent

CI unavailability, or outage, is the actual downtime of a CI. [1]

ServiceNow provides the capability to

Create a stand-alone outage record
Associate an outage record to a task
Create an outage record from a task

Outages have a key relationship to Incident Management and Major Incident Management.

[1] Whenever there is an outage for any of the CI items, the outage information is stored in the Outage [cmdb_ci_outage] table. The Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.

Outage Use Case

Outages on their own are data points rather than informational e.g., knowing database_server123@mycompany.com is offline helps the IT staff work the issue and knowing that Finance Services are unavailable. Its month-end is far more informative.

Look at a simple outage case and a single CI relating its outage impact to 4 Services.

Looking through those, how could the services each be affected differently by the outage

Service 1	Demands of the Service on the CI are still able to be met by the CI degradation
Service 2	Demands of the Service on the CI are unable to be met by the CI degradation & outage
Service 3	Demands of the Service on the CI cannot be met by the CI degradation and the outage is not impactful. One possible scenario is CI failover meant service availability was unaffected.
Service 4	Demands of the Service on the CI are still able to be met by the CI degradation and outage was not impactful (similar to service 3). An alternative is that the service was not operational at the time; therefore no impact.

The key here is if you want to consider outages, then this cannot be independent of services.

Service Relationship to Outages

Examining the CSDM shows the relationships between CIs and services/service offerings.

Depending on the organization's needs, there will be technical services, business services, and offerings in the service portfolio. An example of a service is shown below.

Mapping services/service offerings to the CI’s will ensure that outage records, and associated task (predominately incident) records will provide the most value to the business.

When considering how to configure IT services within your portfolio, work on those that provide the most value to the organization. It is also possible to represent IT services within a request catalog.

Each service in the portfolio can have a criticality assigned, allowing :

The impact of a CI outage is related to the affected services
Proportionate response based on the criticality of those affected services

For example, the company retail website would require a higher criticality than office print services.

Outage relationship to Service Portfolio Management / Digital Portfolio Management

Outages affect Service availability. The roll-up of the outages through service availability is viewed in service offerings.

View availability results for commitments on service offerings and application services using Service Portfolio Management.

For more details on Service Portfolio Management, see Service Portfolio Management - Process Workshop, and Digital Portfolio Management, see Digital Portfolio Management - Process Workshop Presentation.

Outage Creation

Outages can be a stand-alone record or associated with one or more tasks. Outage records typically contain:

Outage CI
Outage Type
- Outage, Degradation, Planned Outage
Beginning and End time
Related Task
Description text

Outages can be created manually by an agent/operator or automatically, e.g., every P1 incident has an outage automatically created, and P2 and below are manually created.

When considering if an outage should is to be created automatically, the population of the fields in the outage record needs consideration, especially those around timing. As an example, it would be possible to create the outage automatically with the start date/time of the incident and then the outage record updated on the close/resolution of the P1. Consider though the example uses cases previously given – would this accurately represent the outage period?

Where the Outage record is created manually, the timings may be set as part of the RCA (Root Cause Analysis). Typically, the RCA process is managed by the person who is fulfilling the role of Major Incident Management or Service Delivery Manager.

More details related to this topic is found in Task Outage and Log Outages

Minimizing Outages

As well the creation of outages, it is important to consider ways of minimizing them, providing a long-term sustainable approach in delivering service availability. Approaches to take are shown below :

Preventative Maintenance

Hardware and software should be regularly maintained to prevent failures and issues such as security vulnerabilities.

Redundancy

Have redundant and backup systems and solutions to provide continuity of service in the event of a failure.

Monitoring

Use monitoring tools to assist detection of potential issues before they can causes outages, taking proactive measures to address.

Incident Response

It is important that there is a well-defined incident response plan for handling outages to minimize the impact and resolve the issue as quickly as possible.

Roles and Responsibilities

Role Name	Service Desk Agent (1st Line)
Description	The Service Desk Agent (SDA) is responsible for raising incidents and associating CI’s to them. If required, they will create outage records and associate that record with the incident. The Service Desk agent is responsible for assigning tasks to the IT Support Teams and assists in resolving the incident.

Role Name	Operator
Description	The operator, as part of the IT Operations Management team, is likely to be monitoring the IT systems and, therefore, create outage records based on CI status from their monitoring systems.

Role Name	IT Support Teams (2nd / 3rd Line)
Description	The IT Support Teams is responsible for providing specialist knowledge and skills in resolving the incident.

Role Name	Major Incident Manager
Description	The Major Incident Manager is concerned entirely with major incidents. They are the coordinator responsible for resolving a major incident as soon as possible and ensuring it does not reoccur. If the outage is severe enough, e.g., disrupting critical service availability, a major incident may be raised.

Role Name

Incident Management Process Owner

Description

The Incident Management Process Owner’s primary objective is to own and maintain the Incident Management process. The Process Owner is usually a senior manager with the ability and authority to ensure the process is rolled out and used by all stakeholders.

Part of their responsibility is reporting on Outages.

Reporting

Operational Reporting

From an operational perspective, Outages have a significant influence in cost and risk.

Reporting of incidents generally falls under the responsibility of the Incident Management Process owner and forms part of their KPIs.[1] Examples of these are:

[1] Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.

Cost	Optimize Major Incident Response
Reduce Outage Volume Worked	# of Unplanned Outages
Reduce Outage Response Effort	Unplanned Outage MTTR

Risk	Ensure High Availability
Reduce Business Disruption from Outages (Volume)	# of Unplanned Outages
Reduce Business Disruption from Outages (Duration)	Unplanned Outage MTTR

Service Status on Service portal

The Service Portal provides an essential method of communicating outages and service availability to users.

There are several widgets provided. Review them here: Service Portal service status widgets

These can provide status to both the service owners and service consumers.

Service Overall Status

Service Status over time

mikesisson · ‎01-25-2024

Excellent Read.

In your Service Scenarios you refer to "Meeting the Demand" of the Service. Where is that demand expected to be documented? My thought is in the Service Level Requirement (SLR) of the Offering. The idea would be when looking at an incident if the offering is meeting that Service level requirement and if not then the unplanned outage would equate to the time we were below that requirement, if it wasn't but there was still impact then we would record a Degradation Outage.

However, I'm currently trying to solve for an ask to account for outages that are sporadic in nature and the ask is to seek a % of impact and have that reflected in the end result of the availability. Meaning if 1 in 10 transactions fail then they only want 10% of the minutes of the outage to count. (I am translating that back that this is measuring apples and oranges and not correct math, but open to scrutiny there)

The current stance is that the duration of the outage is either all or none, we don't modify the calculation of the outage because 1 in 10 transactions were impacted. We are asking for the Service Level Requirement to identify the threshold of acceptable demand (ex. 95% of successful transactions per minute)

So the sequence of the incident goes like this:

1. Major Incident Established and noted impact to one or many service offerings.

2. Analysis of impact to those offerings and decide whether any of those offerings should take an outage

3. Manually (unfortunately) establish whether what is stated in the SLR was breached and if so create an outage for the time period of not meeting the SLR, thus impacting Availability KPI.

4. If there is impact but it is within the SLR threshold then we would put that period of time against degradation and have a KPI measurement of degraded minutes measured.

Is this a best practice approach or are others looking to measure availability based on a percentage of impact in relation to the overall outage?

Alex Rathwyn · ‎02-02-2024

In your "Outage Use Case" section you highlight the importance of services to outages and how different services can be affected differently by the same outage. In the example provided there are 4 services each affected slightly differently by the CI's status. How would an organization translate the status of the CI into the status of the Service? Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)? Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4? Translating this status correctly up to the Service level is essential for accurate reporting and communication to users, such as via the Service Status page on the Portal.

mikesisson · ‎02-12-2024

Thanks for engaging.

For Context we are adhering to CSDM relationships (The Business Service Offering has a dependent relationship to the Application Service which may have a dependent relationship to a Technical Service and its Infra components.

Replaying your Questions:

How would an organization translate the status of the CI into the status of the Service?
1. The CI according to Best Practice would be The Application Service itself or an infra component or Tech Service that the Application Service is dependent on. Based on the fact that we rarely have an Entire App Service down, we are at a degraded state, but at the component level we may either have an unplanned outage or degraded outage. The Business Service Offering that is dependent on the App Service could be either fully unavailable or degraded. The problem we have is exactly where your question is at...how does the business perceive the service that they are offered? Often they are seeing a sporadic level of service....but what is the line in the sand of when that degradation is beyond acceptable and should be considered a full unplanned outage effecting availability?
Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)?
1. Most of the times we have a 1:1 relationship from the Business Service Offering to the Application Service. However there is a 1 to many relationship from the App Service to the Service Offering.
Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4?
1. At this time we would have separate outages (we are not putting outages on the App Service yet, but will be down the road) So often times we go back to the scenerio where the App Service is displaying degradation where offerings that depend on him are a mixed bag of Degradation and Unplanned Outage.

In the end, i'm still left with a location to define what is an acceptable level of Service to determine if the offering should qualify for a degradation or an unplanned outage. Or are we trying to split hairs unnecessarily? However it is important to measure both and the data quality of the differences are important. At what point is the offering deemed as not meeting the expected level of service to qualify for Degradation vs Unplanned Outage? Currently we are looking to be more detailed in the "Service Level Requirement" free text field to note a general statement of acceptable performance between degradation and unplanned outage.

I was curious if others are experiencing something similar and what they have done as well as recognizing the importance of measuring both Availability and Degradation Minutes. Which leads us further down the road of finding a way to set a threshold of acceptable degradation before breach of threshold.

#csdm #spm #digital portfolio management #dpm #cmdb #outage

Peter102 · ‎08-08-2024

Hey,

Thank you for this article, it was useful. I am still struggling with the link between CI and Service. Based on your reply to Alex, am I correct in assuming that if I have an outage on Switch123 or App123 then I have to post separate outages for each service that is also affected for them to display in the portal as down?

Is there a way to use the affect CI section of the outage to automate this?

Peter

mikesisson · ‎02-11-2025

Yes, you can leverage Affected CI's to help identify areas of impact at the service level and perhaps even automate if you get to that level of maturity and trust in your level of granularity and relationships.

The function you would use is "Refresh Impacted Services". You add the app service(s) as the affected CI, then you click "refresh impacted services" and you can then have a list of services potentially impacted, based on your dependency relationships that, if you are using service builder, have the correct relationships for this to work.

Ash42 · ‎02-19-2025

A portion of the following video demos a custom solution for automating Service outage from a depended CI outage (38.25 min mark). A fully developed CSDM may not be enough. They have a feature to capture whether an outage of one or more related CIs causes an outage to the Service. That mapping is used to automatically create an outage on the Service when an outage on a related CI is created. This mapping could get very complex to cover various scenarios.

Chris Shakespea · ‎02-20-2025

There are some great questions and for me it highlights that outages, on their own, are not a useful construct.

A good place to start for all of these is the walk of technical service > application service > Business service

This is a big area and includes CMDB / CSDM and capabilities such as Service Portfolio Management and Digital Portfolio Management.

Trying to answer the questions -

Mike - not sure on that logic of outage to count. My thoughts are that outage is at a CI level. The duration and nature of that outage should be determined. Ultimately these might lead to a change e.g. sporadic overheating may cause transaction failures with no clear pattern but the server may need replacement.

To me the scenario talks to service availability and achieving the commitments rather than outage. e.g. if you have 100% commitment on transactions then 1 in 10 is you are operating at 90%

Alex - you raise a good point in that one CI may affect multiple services (e.g. a network switch fault could affect a range of services). So the switch fails that should have an outage record. This affects the technical service which is related to the offering. It’s the relationship between the CI and those Services which is key. The other way to think about it is the network switch failure is the root cause - there is only one CI. Outage record is bottom of the hierarchy.

Mike - to you point 3 (similar to above). Having 3 outages on the same CI at the same time feels like duplication. In my opinion having multiple Outages with different status depending on the service is stretching the purpose of outage. Its effectively translating down higher level constructs (such as services) onto the CI

Have you looked at using Service Commitments ?

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=8455d65a9789de906eedb30e6253aff9&nc_s...

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=6b633af647b8dad0123f3975d36d4320&nc_s...

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=c7f18f439346fdd02fac74096cba109f&nc_s...

snowdev8 · ‎03-12-2025

@Chris Shakespea What are your thoughts on Outage creation and Change Management, especially Normal Changes?
Change Request by default does not have a checkbox to dictate whether downtime is required. I have always ended up adding a checkbox before the concept of outages. Is there a better way to handle this process?

I would love to tie the existence of an Outage Record, the type of outage and the duration of outage to Risk Calculation .

mikesisson · ‎04-23-2025

@Chris Shakespea yes we use commitments at the offering level. Our solution that we have recommended is to establish a threshold from Degradation to Unplanned outage and get that agreed upon by the business. So if we exceed 5% of transactions it will be considered an outage for example. Below 5% would be a degradation. And we measure both.

Next we are trying to solve for KPI's where we can measure and display the amount of time we neither have a degradation nor an unplanned outage and target a commitment around that as a red flag to act upon. We haven't missed our availability commitment however we also have Degradation commitment and a combined measurement where both degradation and availability is considered.

GarethH1 · ‎05-16-2025

Unfortunately one glaring issue on this system status page is that the Subscription Widget yields no notifications to the service subscriber. The expectation is that those notification that are created OOB, are active and subscribable, should be sent out when those outages start and end. This seems to be a recurring issue in the community that I can see, is that Service Outage Begin and End notifications at least are not processed out.

kmattoon · ‎10-29-2025

Please explain the difference between "Planned outage" and "Planned maintenance." Planned maintenance could result in a degradation or an outage depending upon the maintenance work to be completed.

The inconsistency in the use of the terminology is problematic, and we are working to determine the best way to handle this in our environment. The option we are considering is updating the label to reflect "Planned maintenance" and be prepared for the skip records when they inevitably come through. Any guidance would be welcome.

mikesisson · ‎11-10-2025

Planned Maintenance vs. Planned Outage...What’s the Difference?

Planned Maintenance
This refers to scheduled work like upgrades, patches, or inspections. It may or may not cause downtime. Sometimes the system stays up but runs slower or has limited functionality. Its the activity.
Planned Outage
This is a specific type of planned maintenance where the system is intentionally taken offline. It’s scheduled, approved, and communicated in advance. Its the impact outcome you should be communicating to the consume domain (impact to the service offering stakeholders)

Impact on Availability Score in ServiceNow (Yokohama Release)

Only unplanned outages (unexpected downtime) affect the availability score of a Business Service Offering.
Planned outages, even if they cause full downtime, do not reduce the availability score, because the business agreed to the downtime ahead of time.

Why This Matters

Using “Planned Maintenance” as the standard label is fine, as long as you track the expected impact (e.g., degradation vs. full outage) and log it correctly in the system. This ensures:

Accurate reporting
No false hits to availability SLAs
Clear communication across teams

Ideally - Leverage your agreed upon Maintenance Windows. This way depending on your policy and procedures, you can have your changes even with planned outages auto approve and you are covered with the communication to your impacted consumers and have streamlined change control

kmattoon · ‎11-10-2025

@mikesisson I should have been more clear. I understand the meaning/purpose of Planned Maintenance vs Planned Outage, but I do not understand the reason behind the inconsistency in the terms' usage throughout the portals and workspaces. It seems that the terms are used interchangeably within the platform.

In the CSM portal, both terms appear, while on the back end, only Planned Outage appears. How are you handling the inconsistent messaging to end users?

mikesisson · ‎11-30-2025

Not sure what you mean by back end. Maybe a scenerio would help and where the confusion is, in the communication.

1. You can plan maintenance with Planned Outage. Just because you are planning maint does not give you a free ticket to cause an unplanned outage. If you do have an outage in during maintenance where an outage was not planned then the outage needs to be recorded (Outage or Degradation)

2. Maybe that is where the logic lies on your backend. You cannot just put in a planned outage operationally as a result of the change (planned maint). As a result of the change you either have an outage or a degradation, you cannot then have a planned outage, because it has to be recorded during the planning of the change, the change is approved with the fact that there is going to be an outage (planned). So when that occurs you do not record the outage because it has already been accounted for.

Matthew_13 · ‎12-18-2025

Excellent Read.

Reginald U_ · ‎01-13-2026

@Chris Shakespea

Thank you for article as it was helpful. However, I am still a little unclear on something. The below quotes from the article seem to imply that in a perfect world, e.g. if Service Mapping is fully available and accurate, the purpose of the Outage record is to capture the more granular Configuration Item as opposed to the higher-level Service that may depend on it. It seems that the Outage record is not intended to tell the whole story but instead is intended to be used as a singular data point i.e. ("Server123) is down as opposed to ("Application123) is down as a result. This seems intuitive to me. When working in a PDI, on the out-of-box Change application, the 'Create Outage' UI Action copies the Configuration item field value over to the Configuration item field value of the new Outage record. Where I'm getting lost is in the fact that what displays in the Service Portal is:

"Planned maintenance - [Configuration item], service will be unavailable [Outage.Begin] to [Outage.End] "

See screenshot:

It seems that this alert is tied directly to the value that is in the Configuration item field value of the Outage record.

Am I correct in my statement that the most granular Configuration item is generally what should be listed in the Configuration item field of an Outage record and NOT the higher level Service CI?

If so, should the goal be that once an organization has mature Service mapping in place, they should reconfigure any out-of-box Service portal alerts that may point directly to the Configuration Item field value of an Outage record, and instead make the alerts more dynamic so that they instead point to the appropriate Service(s) related to the Configuration item. The quotes from your article that I am focused on are below:

Quote #1 - “An Outage represents CI unavailability.”
Quote #2 - “Outages on their own are data points rather than informational e.g., knowing database_server123@mycompany.com is offline helps the IT staff work the issue and knowing that Finance Services are unavailable.”
Quote #3 – “Mapping services/service offerings to the CI’s will ensure that outage records, and associated task (predominately incident) records will provide the most value to the business.”

Chris Shakespea · ‎02-03-2026

Thanks, lots to think about. Firstly ServiceNow has moved on since the article was put together so worth looking at features around ML and HLA etc that can help.

This article created alot of great discussion. From that its clear its not a black and white answer.

Its been some time since I dug into this area.

It seems that the Outage record is not intended to tell the whole story but instead is intended to be used as a singular data point

- I agree, hence at the service level there can be various impact levels (e.g. a single CI may or may not break a service)

The widget you show gathers information from the cmdb_ci_outage table. Any planned maintenance within the following five days appears in the Planned Maintenance widget.

From memory its showing only class=service offering

So I dont see the need to reconfigure (but would need to look into the code to check)

On the third quote I see customers start at different points e.g. some organizations start at key business services , others start at technical service (now Technology Management Service) or app services.