
- Post History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
‎12-04-2023 10:15 AM - edited ‎05-16-2025 08:40 AM
An IT outage is when computer systems or networks stop working correctly, often due to hardware issues, software glitches, or other unforeseen events. This disruption can lead to downtime, impacting productivity and business operations. To minimize such issues having a well established process for managing outages is essential.
This article aims to take you through
- What is an outage?
- Common Use Cases
- Outages relationship to Services
- Creating Outage records
- Who should be involved
- Reporting Outages
Outage Overview
An Outage represents CI unavailability. The causes are :
- Outage
- Planned Outage
- usually the result of a routine maintenance schedule, upgrade action
- Degradation
- Partial, Slow, Intermittent
CI unavailability, or outage, is the actual downtime of a CI. [1]
ServiceNow provides the capability to
- Create a stand-alone outage record
- Associate an outage record to a task
- Create an outage record from a task
Outages have a key relationship to Incident Management and Major Incident Management.
[1] Whenever there is an outage for any of the CI items, the outage information is stored in the Outage [cmdb_ci_outage] table. The Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.
Outage Use Case
Outages on their own are data points rather than informational e.g., knowing database_server123@mycompany.com is offline helps the IT staff work the issue and knowing that Finance Services are unavailable. Its month-end is far more informative.
Look at a simple outage case and a single CI relating its outage impact to 4 Services.
Looking through those, how could the services each be affected differently by the outage
Service 1 |
Demands of the Service on the CI are still able to be met by the CI degradation |
Service 2 |
Demands of the Service on the CI are unable to be met by the CI degradation & outage |
Service 3 |
Demands of the Service on the CI cannot be met by the CI degradation and the outage is not impactful. One possible scenario is CI failover meant service availability was unaffected. |
Service 4 |
Demands of the Service on the CI are still able to be met by the CI degradation and outage was not impactful (similar to service 3). An alternative is that the service was not operational at the time; therefore no impact. |
The key here is if you want to consider outages, then this cannot be independent of services.
Service Relationship to Outages
Examining the CSDM shows the relationships between CIs and services/service offerings.
Depending on the organization's needs, there will be technical services, business services, and offerings in the service portfolio. An example of a service is shown below.
Mapping services/service offerings to the CI’s will ensure that outage records, and associated task (predominately incident) records will provide the most value to the business.
When considering how to configure IT services within your portfolio, work on those that provide the most value to the organization. It is also possible to represent IT services within a request catalog.
Each service in the portfolio can have a criticality assigned, allowing :
- The impact of a CI outage is related to the affected services
- Proportionate response based on the criticality of those affected services
For example, the company retail website would require a higher criticality than office print services.
Outage relationship to Service Portfolio Management / Digital Portfolio Management
Outages affect Service availability. The roll-up of the outages through service availability is viewed in service offerings.
View availability results for commitments on service offerings and application services using Service Portfolio Management.
For more details on Service Portfolio Management, see Service Portfolio Management - Process Workshop, and Digital Portfolio Management, see Digital Portfolio Management - Process Workshop Presentation.
Outage Creation
Outages can be a stand-alone record or associated with one or more tasks. Outage records typically contain:
- Outage CI
- Outage Type
- Outage, Degradation, Planned Outage
- Beginning and End time
- Related Task
- Description text
Outages can be created manually by an agent/operator or automatically, e.g., every P1 incident has an outage automatically created, and P2 and below are manually created.
When considering if an outage should is to be created automatically, the population of the fields in the outage record needs consideration, especially those around timing. As an example, it would be possible to create the outage automatically with the start date/time of the incident and then the outage record updated on the close/resolution of the P1. Consider though the example uses cases previously given – would this accurately represent the outage period?
Where the Outage record is created manually, the timings may be set as part of the RCA (Root Cause Analysis). Typically, the RCA process is managed by the person who is fulfilling the role of Major Incident Management or Service Delivery Manager.
More details related to this topic is found in Task Outage and Log Outages
Minimizing Outages
As well the creation of outages, it is important to consider ways of minimizing them, providing a long-term sustainable approach in delivering service availability. Approaches to take are shown below :
Preventative Maintenance
Hardware and software should be regularly maintained to prevent failures and issues such as security vulnerabilities.
Redundancy
Have redundant and backup systems and solutions to provide continuity of service in the event of a failure.
Monitoring
Use monitoring tools to assist detection of potential issues before they can causes outages, taking proactive measures to address.
Incident Response
It is important that there is a well-defined incident response plan for handling outages to minimize the impact and resolve the issue as quickly as possible.
Roles and Responsibilities
Role Name |
Service Desk Agent (1st Line) |
Description |
The Service Desk Agent (SDA) is responsible for raising incidents and associating CI’s to them. If required, they will create outage records and associate that record with the incident. The Service Desk agent is responsible for assigning tasks to the IT Support Teams and assists in resolving the incident. |
Role Name |
Operator |
Description |
The operator, as part of the IT Operations Management team, is likely to be monitoring the IT systems and, therefore, create outage records based on CI status from their monitoring systems. |
Role Name |
IT Support Teams (2nd / 3rd Line) |
Description |
The IT Support Teams is responsible for providing specialist knowledge and skills in resolving the incident. |
Role Name |
Major Incident Manager |
Description |
The Major Incident Manager is concerned entirely with major incidents. They are the coordinator responsible for resolving a major incident as soon as possible and ensuring it does not reoccur. If the outage is severe enough, e.g., disrupting critical service availability, a major incident may be raised. |
Role Name |
Incident Management Process Owner |
Description |
The Incident Management Process Owner’s primary objective is to own and maintain the Incident Management process. The Process Owner is usually a senior manager with the ability and authority to ensure the process is rolled out and used by all stakeholders. Part of their responsibility is reporting on Outages. |
Reporting
Operational Reporting
From an operational perspective, Outages have a significant influence in cost and risk.
Reporting of incidents generally falls under the responsibility of the Incident Management Process owner and forms part of their KPIs.[1] Examples of these are:
[1] Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.
Cost |
Optimize Major Incident Response |
Reduce Outage Volume Worked |
# of Unplanned Outages |
Reduce Outage Response Effort |
Unplanned Outage MTTR |
Risk |
Ensure High Availability |
Reduce Business Disruption from Outages (Volume) |
# of Unplanned Outages |
Reduce Business Disruption from Outages (Duration) |
Unplanned Outage MTTR |
Service Status on Service portal
The Service Portal provides an essential method of communicating outages and service availability to users.
There are several widgets provided. Review them here: Service Portal service status widgets
These can provide status to both the service owners and service consumers.
Service Overall Status
Service Status over time
- 31,803 Views
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Excellent Read.
In your Service Scenarios you refer to "Meeting the Demand" of the Service. Where is that demand expected to be documented? My thought is in the Service Level Requirement (SLR) of the Offering. The idea would be when looking at an incident if the offering is meeting that Service level requirement and if not then the unplanned outage would equate to the time we were below that requirement, if it wasn't but there was still impact then we would record a Degradation Outage.
However, I'm currently trying to solve for an ask to account for outages that are sporadic in nature and the ask is to seek a % of impact and have that reflected in the end result of the availability. Meaning if 1 in 10 transactions fail then they only want 10% of the minutes of the outage to count. (I am translating that back that this is measuring apples and oranges and not correct math, but open to scrutiny there)
The current stance is that the duration of the outage is either all or none, we don't modify the calculation of the outage because 1 in 10 transactions were impacted. We are asking for the Service Level Requirement to identify the threshold of acceptable demand (ex. 95% of successful transactions per minute)
So the sequence of the incident goes like this:
1. Major Incident Established and noted impact to one or many service offerings.
2. Analysis of impact to those offerings and decide whether any of those offerings should take an outage
3. Manually (unfortunately) establish whether what is stated in the SLR was breached and if so create an outage for the time period of not meeting the SLR, thus impacting Availability KPI.
4. If there is impact but it is within the SLR threshold then we would put that period of time against degradation and have a KPI measurement of degraded minutes measured.
Is this a best practice approach or are others looking to measure availability based on a percentage of impact in relation to the overall outage?
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
In your "Outage Use Case" section you highlight the importance of services to outages and how different services can be affected differently by the same outage. In the example provided there are 4 services each affected slightly differently by the CI's status. How would an organization translate the status of the CI into the status of the Service? Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)? Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4? Translating this status correctly up to the Service level is essential for accurate reporting and communication to users, such as via the Service Status page on the Portal.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Thanks for engaging.
For Context we are adhering to CSDM relationships (The Business Service Offering has a dependent relationship to the Application Service which may have a dependent relationship to a Technical Service and its Infra components.
Replaying your Questions:
- How would an organization translate the status of the CI into the status of the Service?
- The CI according to Best Practice would be The Application Service itself or an infra component or Tech Service that the Application Service is dependent on. Based on the fact that we rarely have an Entire App Service down, we are at a degraded state, but at the component level we may either have an unplanned outage or degraded outage. The Business Service Offering that is dependent on the App Service could be either fully unavailable or degraded. The problem we have is exactly where your question is at...how does the business perceive the service that they are offered? Often they are seeing a sporadic level of service....but what is the line in the sand of when that degradation is beyond acceptable and should be considered a full unplanned outage effecting availability?
- Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)?
- Most of the times we have a 1:1 relationship from the Business Service Offering to the Application Service. However there is a 1 to many relationship from the App Service to the Service Offering.
- Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4?
- At this time we would have separate outages (we are not putting outages on the App Service yet, but will be down the road) So often times we go back to the scenerio where the App Service is displaying degradation where offerings that depend on him are a mixed bag of Degradation and Unplanned Outage.
In the end, i'm still left with a location to define what is an acceptable level of Service to determine if the offering should qualify for a degradation or an unplanned outage. Or are we trying to split hairs unnecessarily? However it is important to measure both and the data quality of the differences are important. At what point is the offering deemed as not meeting the expected level of service to qualify for Degradation vs Unplanned Outage? Currently we are looking to be more detailed in the "Service Level Requirement" free text field to note a general statement of acceptable performance between degradation and unplanned outage.
I was curious if others are experiencing something similar and what they have done as well as recognizing the importance of measuring both Availability and Degradation Minutes. Which leads us further down the road of finding a way to set a threshold of acceptable degradation before breach of threshold.
#csdm #spm #digital portfolio management #dpm #cmdb #outage
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hey,
Thank you for this article, it was useful. I am still struggling with the link between CI and Service. Based on your reply to Alex, am I correct in assuming that if I have an outage on Switch123 or App123 then I have to post separate outages for each service that is also affected for them to display in the portal as down?
Is there a way to use the affect CI section of the outage to automate this?
Peter
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Yes, you can leverage Affected CI's to help identify areas of impact at the service level and perhaps even automate if you get to that level of maturity and trust in your level of granularity and relationships.
The function you would use is "Refresh Impacted Services". You add the app service(s) as the affected CI, then you click "refresh impacted services" and you can then have a list of services potentially impacted, based on your dependency relationships that, if you are using service builder, have the correct relationships for this to work.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
A portion of the following video demos a custom solution for automating Service outage from a depended CI outage (38.25 min mark). A fully developed CSDM may not be enough. They have a feature to capture whether an outage of one or more related CIs causes an outage to the Service. That mapping is used to automatically create an outage on the Service when an outage on a related CI is created. This mapping could get very complex to cover various scenarios.

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
There are some great questions and for me it highlights that outages, on their own, are not a useful construct.
A good place to start for all of these is the walk of technical service > application service > Business service
This is a big area and includes CMDB / CSDM and capabilities such as Service Portfolio Management and Digital Portfolio Management.
Trying to answer the questions -
Mike - not sure on that logic of outage to count. My thoughts are that outage is at a CI level. The duration and nature of that outage should be determined. Ultimately these might lead to a change e.g. sporadic overheating may cause transaction failures with no clear pattern but the server may need replacement.
To me the scenario talks to service availability and achieving the commitments rather than outage. e.g. if you have 100% commitment on transactions then 1 in 10 is you are operating at 90%
Alex - you raise a good point in that one CI may affect multiple services (e.g. a network switch fault could affect a range of services). So the switch fails that should have an outage record. This affects the technical service which is related to the offering. It’s the relationship between the CI and those Services which is key. The other way to think about it is the network switch failure is the root cause - there is only one CI. Outage record is bottom of the hierarchy.
Mike - to you point 3 (similar to above). Having 3 outages on the same CI at the same time feels like duplication. In my opinion having multiple Outages with different status depending on the service is stretching the purpose of outage. Its effectively translating down higher level constructs (such as services) onto the CI
Have you looked at using Service Commitments ?
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@Chris Shakespea What are your thoughts on Outage creation and Change Management, especially Normal Changes?
Change Request by default does not have a checkbox to dictate whether downtime is required. I have always ended up adding a checkbox before the concept of outages. Is there a better way to handle this process?
I would love to tie the existence of an Outage Record, the type of outage and the duration of outage to Risk Calculation .
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@Chris Shakespea yes we use commitments at the offering level. Our solution that we have recommended is to establish a threshold from Degradation to Unplanned outage and get that agreed upon by the business. So if we exceed 5% of transactions it will be considered an outage for example. Below 5% would be a degradation. And we measure both.
Next we are trying to solve for KPI's where we can measure and display the amount of time we neither have a degradation nor an unplanned outage and target a commitment around that as a red flag to act upon. We haven't missed our availability commitment however we also have Degradation commitment and a combined measurement where both degradation and availability is considered.
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Unfortunately one glaring issue on this system status page is that the Subscription Widget yields no notifications to the service subscriber. The expectation is that those notification that are created OOB, are active and subscribable, should be sent out when those outages start and end. This seems to be a recurring issue in the community that I can see, is that Service Outage Begin and End notifications at least are not processed out.