Chris Shakespea
ServiceNow Employee
ServiceNow Employee

An IT outage is when computer systems or networks stop working correctly, often due to hardware issues, software glitches, or other unforeseen events. This disruption can lead to downtime, impacting productivity and business operations. To minimize such issues having a well established process for managing outages is essential.

 

This article aims to take you through

- What is an outage?

- Common Use Cases

- Outages relationship to Services

- Creating Outage records

- Who should be involved

- Reporting Outages

 

Outage Overview 

 

An Outage represents CI unavailability. The  causes are :

  • Outage
  • Planned Outage
    • usually the result of a routine maintenance schedule, upgrade action
  • Degradation
    • Partial, Slow, Intermittent

 

CI unavailability, or outage, is the actual downtime of a CI. [1]

 

ServiceNow provides the capability to

  • Create a stand-alone outage record
  • Associate an outage record to a task
  • Create an outage record from a task

 

Outages have a key relationship to Incident Management and Major Incident Management.

 

 

[1] Whenever there is an outage for any of the CI items, the outage information is stored in the Outage [cmdb_ci_outage] table. The Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.

 

Outage Use Case

 

Outages on their own are data points rather than informational e.g., knowing database_server123@mycompany.com is offline helps the IT staff work the issue and knowing that Finance Services are unavailable. Its month-end is far more informative.

 

 Look at a  simple outage case and a single CI relating its outage impact to 4 Services.

 

ChrisShakespea_0-1701702272720.png

 

 

Looking through those, how could the services each be affected differently by the outage

Service 1

Demands of the Service on the CI are still able to be met by the CI degradation

Service 2

Demands of the Service on the CI are unable to be met by the CI degradation & outage

Service 3

Demands of the Service on the CI cannot be met by the CI degradation and the outage is not impactful. One possible scenario is CI failover meant service availability was unaffected.

Service 4

Demands of the Service on the CI are still able to be met by the CI degradation and outage was not impactful (similar to service 3). An alternative is that the service was not operational at the time; therefore no impact.

 

 

The key here is if you want to consider outages, then this cannot be independent of services.

 

 

Service Relationship to Outages

 

Examining the CSDM shows the relationships between CIs and services/service offerings.

 

ChrisShakespea_5-1701702551337.png

 

 

 

Depending on the organization's needs, there will be technical services, business services, and offerings in the service portfolio. An example of a service  is shown below.

 

ChrisShakespea_4-1701702515836.png

 

Mapping services/service offerings to the CI’s will ensure that outage records, and associated task (predominately incident) records will provide the most value to the business.

When considering how to configure IT services within your portfolio, work on those that provide the most value to the organization. It is also possible to represent IT services within a request catalog.

 

Each service in the portfolio can have a criticality assigned, allowing :

  • The impact of a CI outage is related to the affected services
  • Proportionate response based on the criticality of those affected services

 

For example, the company retail website would require a higher criticality than office print services.

 

Outage relationship to Service Portfolio Management / Digital Portfolio Management

 

Outages affect Service availability. The roll-up of the outages through service availability is viewed in service offerings.

 

ChrisShakespea_6-1701702666304.png

 

 

View availability results for commitments on service offerings and application services using Service Portfolio Management.

For more details on Service Portfolio Management, see Service Portfolio Management - Process Workshop, and Digital Portfolio Management, see Digital Portfolio Management - Process Workshop Presentation.

 

 

Outage Creation

 

Outages can be a stand-alone record or associated with one or more tasks. Outage records typically contain:

  • Outage CI
  • Outage Type
    • Outage, Degradation, Planned Outage
  • Beginning and End time
  • Related Task
  • Description text

 

Outages can be created manually by an agent/operator or automatically, e.g., every P1 incident has an outage automatically created, and P2 and below are manually created.

When considering if an outage should is to be created automatically, the population of the fields in the outage record needs consideration, especially those around timing. As an example, it would be possible to create the outage automatically with the start date/time of the incident and then the outage record updated on the close/resolution of the P1. Consider though the example uses cases previously given – would this accurately represent the outage period?

Where the Outage record is created manually, the timings may be set as part of the RCA (Root Cause Analysis). Typically, the RCA process is managed by the person who is fulfilling the role of Major Incident Management or Service Delivery Manager.

More details related to this topic is found in Task Outage and Log Outages

 

Minimizing Outages

As well the creation of outages, it is important to consider ways of minimizing them, providing a long-term sustainable approach in delivering service availability. Approaches to take are shown below :

 

Preventative Maintenance

Hardware and software should be regularly maintained to prevent failures and issues such as security vulnerabilities.

 

Redundancy

Have redundant and backup systems and solutions to provide continuity of service in the event of a failure.

 

Monitoring

Use monitoring tools to assist detection of potential issues before they can causes outages, taking proactive measures to address.

 

Incident Response

It is important that there is a well-defined incident response plan for handling outages to minimize the impact and resolve the issue as quickly as possible.

 

Roles and Responsibilities

 

Role Name

Service Desk Agent (1st Line)

Description

The Service Desk Agent (SDA) is responsible for raising incidents and associating CI’s to them. If required, they will create outage records and associate that record with the incident. The Service Desk agent is responsible for assigning tasks to the IT Support Teams  and assists in resolving the incident.

 

Role Name

Operator

Description

The operator, as part of the IT Operations Management team, is likely to be monitoring the IT systems and, therefore, create outage records based on CI status from their monitoring systems.

 

Role Name

IT Support Teams (2nd / 3rd Line)

Description

The IT Support Teams  is responsible  for providing specialist knowledge and skills in resolving the incident.

 

Role Name

Major Incident Manager

Description

The Major Incident Manager is concerned entirely with major incidents. They are the coordinator responsible for resolving a major incident as soon as possible and ensuring it does not reoccur. If the outage is severe enough, e.g., disrupting critical service availability, a major incident may be raised.

 

Role Name

Incident Management Process Owner

Description

The Incident Management Process Owner’s primary objective is to own and maintain the Incident Management process. The Process Owner is usually a senior manager with the ability and authority to ensure the process is rolled out and used by all stakeholders.

Part of their responsibility is reporting on Outages.

 

 

Reporting

 

Operational Reporting

From an operational perspective, Outages have a significant influence in cost and risk.

Reporting of incidents generally falls under the responsibility of the Incident Management Process owner and forms part of their KPIs.[1] Examples of these are:

 

 

[1] Task-Outage table [task_outage] maintains the mapping between the Task [task] table and the Outage [cmdb_ci_outage] table.

 

Cost

Optimize Major Incident Response

Reduce Outage Volume Worked

# of Unplanned Outages

Reduce Outage Response Effort

Unplanned Outage MTTR

 

Risk

Ensure High Availability

Reduce Business Disruption from Outages (Volume)

# of Unplanned Outages

Reduce Business Disruption from Outages (Duration)

Unplanned Outage MTTR

 

Service Status on Service portal

The Service Portal provides an essential method of communicating outages and service availability to users.

There are several widgets provided. Review them here: Service Portal service status widgets

These can provide status to both the service owners and service consumers.

 

Service Overall Status

ChrisShakespea_7-1701703495620.png

 

Service Status over time

ChrisShakespea_8-1701703540213.png

 

ChrisShakespea_9-1701703571283.png

 

 

 

 

 

 

 

 

Comments
mikesisson
Mega Guru

Excellent Read.

In your Service Scenarios you refer to "Meeting the Demand" of the Service.  Where is that demand expected to be documented?  My thought is in the Service Level Requirement (SLR) of the Offering.  The idea would be when looking at an incident if the offering is meeting that Service level requirement and if not then the unplanned outage would equate to the time we were below that requirement, if it wasn't but there was still impact then we would record a Degradation Outage.

 

However, I'm currently trying to solve for an ask to account for outages that are sporadic in nature and the ask is to seek a % of impact and have that reflected in the end result of the availability.  Meaning if 1 in 10 transactions fail then they only want 10% of the minutes of the outage to count.  (I am translating that back that this is measuring apples and oranges and not correct math, but open to scrutiny there) 

 

The current stance is that the duration of the outage is either all or none, we don't modify the calculation of the outage because 1 in 10 transactions were impacted.  We are asking for the Service Level Requirement to identify the threshold of acceptable demand (ex. 95% of successful transactions per minute)  

 

So the sequence of the incident goes like this:

 

1. Major Incident Established and noted impact to one or many service offerings.

2. Analysis of impact to those offerings and decide whether any of those offerings should take an outage

3. Manually (unfortunately) establish whether what is stated in the SLR was breached and if so create an outage for the time period of not meeting the SLR, thus impacting Availability KPI.

4. If there is impact but it is within the SLR threshold then we would put that period of time against degradation and have a KPI measurement of degraded minutes measured.

 

Is this a best practice approach or are others looking to measure availability based on a percentage of impact in relation to the overall outage?

Alex Rathwyn
Tera Contributor

In your "Outage Use Case" section you highlight the importance of services to outages and how different services can be affected differently by the same outage. In the example provided there are 4 services each affected slightly differently by the CI's status. How would an organization translate the status of the CI into the status of the Service? Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)? Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4? Translating this status correctly up to the Service level is essential for accurate reporting and communication to users, such as via the Service Status page on the Portal.

mikesisson
Mega Guru

Thanks for engaging.

 

For Context we are adhering to CSDM relationships (The Business Service Offering has a dependent relationship to the Application Service which may have a dependent relationship to a Technical Service and its Infra components.

 

Replaying your Questions:

  1. How would an organization translate the status of the CI into the status of the Service? 
    1. The CI according to Best Practice would be The Application Service itself or an infra component or Tech Service that the Application Service is dependent on.  Based on the fact that we rarely have an Entire App Service down, we are at a degraded state, but at the component level we may either have an unplanned outage or degraded outage.  The Business Service Offering that is dependent on the App Service could be either fully unavailable or degraded.  The problem we have is exactly where your question is at...how does the business perceive the service that they are offered?  Often they are seeing a sporadic level of service....but what is the line in the sand of when that degradation is beyond acceptable and should be considered a full unplanned outage effecting availability?
  2. Is this a function of the relationship in the CMDB between the components and the service (which assumes the type of relationship is absolute)?  
    1. Most of the times we have a 1:1 relationship from the Business Service Offering to the Application Service.  However there is a 1 to many relationship from the App Service to the Service Offering.
  3. Or is it instead that the organization would create separate outage records for each status of each component, i.e. two outage records for the CI, one for Service 1, two for Service 2, 1 for Service 3, and none for Service 4?
    1. At this time we would have separate outages (we are not putting outages on the App Service yet, but will be down the road)  So often times we go back to the scenerio where the App Service is displaying degradation where offerings that depend on him are a mixed bag of Degradation and Unplanned Outage.

 

In the end, i'm still left with a location to define what is an acceptable level of Service to determine if the offering should qualify for a degradation or an unplanned outage.  Or are we trying to split hairs unnecessarily?  However it is important to measure both and the data quality  of the differences are important.  At what point is the offering deemed as not meeting the expected level of service to qualify for Degradation vs Unplanned Outage?  Currently we are looking to be more detailed in the "Service Level Requirement" free text field to note a general statement of acceptable performance between degradation and unplanned outage.

 

I was curious if others are experiencing something similar and what they have done as well as recognizing the importance of measuring both Availability and Degradation Minutes.  Which leads us further down the road of finding a way to set a threshold of acceptable degradation before breach of threshold.

 

#csdm #spm #digital portfolio management #dpm #cmdb #outage

Peter102
Tera Contributor

Hey,

 

Thank you for this article, it was useful. I am still struggling with the link between CI and Service. Based on your reply to Alex, am I correct in assuming that if I have an outage on Switch123 or App123 then I have to post separate outages for each service that is also affected for them to display in the portal as down?

 

Is there a way to use the affect CI section of the outage to automate this?

 

Peter

mikesisson
Mega Guru

Yes, you can leverage Affected CI's to help identify areas of impact at the service level and perhaps even automate if you get to that level of maturity and trust in your level of granularity and relationships.  

 

The function you would use is "Refresh Impacted Services".  You add the app service(s) as the affected CI, then you click "refresh impacted services" and you can then have a list of services potentially impacted, based on your dependency relationships that, if you are using service builder, have the correct relationships for this to work.

Ash42
Tera Expert

A portion of the following video demos a custom solution for automating Service outage from a depended CI outage (38.25 min mark). A fully developed CSDM may not be enough. They have a feature to capture whether an outage of one or more related CIs causes an outage to the Service. That mapping is used to automatically create an outage on the Service when an outage on a related CI is created. This mapping could get very complex to cover various scenarios.

 

Chris Shakespea
ServiceNow Employee
ServiceNow Employee

There are some great questions and for me it highlights that outages, on their own, are not a useful construct.

 

A good place to start for all of these is the walk of technical service > application service > Business service

This is a big area and includes CMDB / CSDM and capabilities such as Service Portfolio Management and Digital Portfolio Management.

 

Trying to answer the questions -

Mike - not sure on that logic of outage to count. My thoughts are that outage is at a CI level. The duration and nature of that outage should be determined. Ultimately these might lead to a change e.g. sporadic overheating may cause transaction failures with no clear pattern but the server may need replacement.

To me the scenario talks to service availability and achieving the commitments rather than outage. e.g. if you have 100% commitment on transactions then 1 in 10 is you are operating at 90%

 

 

Alex - you raise a good point in that one CI may affect multiple services (e.g. a network switch fault could affect a range of services). So the switch fails that should have an outage record. This affects the technical service which is related to the offering. It’s the relationship between the CI and those Services which is key. The other way to think about it is the network switch failure is the root cause - there is only one CI.  Outage record is bottom of the hierarchy.

 

Mike - to you point 3 (similar to above). Having 3 outages on the same CI at the same time feels like duplication. In my opinion having multiple Outages with different status depending on the service is stretching the purpose of outage. Its effectively translating down higher level constructs (such as services) onto the CI

Have you looked at using Service Commitments ?

 

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=8455d65a9789de906eedb30e6253aff9&nc_s...

 

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=6b633af647b8dad0123f3975d36d4320&nc_s...

 

https://learning.servicenow.com/nowcreate?id=nc_asset&asset_id=c7f18f439346fdd02fac74096cba109f&nc_s...

snowdev8
Tera Expert

@Chris Shakespea What are your thoughts on Outage creation and Change Management, especially Normal Changes?
Change Request by default does not have a checkbox to dictate whether downtime is required. I have always ended up adding a checkbox before the concept of outages. Is there a better way to handle this process?

I would love to tie the existence of an Outage Record, the type of outage and the duration of outage  to Risk Calculation . 

mikesisson
Mega Guru

@Chris Shakespea  yes we use commitments at the offering level.  Our solution that we have recommended is to establish a threshold from Degradation to Unplanned outage and get that agreed upon by the business.  So if we exceed 5% of transactions it will be considered an outage for example.  Below 5% would be a degradation.  And we measure both.

 

Next we are trying to solve for KPI's where we can measure and display the amount of time we neither have a degradation nor an unplanned outage and target a commitment around that as a red flag to act upon. We haven't missed our availability commitment however we also have Degradation commitment and a combined measurement where both degradation and availability is considered.

GarethH1
Tera Contributor

Unfortunately one glaring issue on this system status page is that the Subscription Widget yields no notifications to the service subscriber. The expectation is that those notification that are created OOB, are active and subscribable, should be sent out when those outages start and end. This seems to be a recurring issue in the community that I can see, is that Service Outage Begin and End notifications at least are not processed out.

Version history
Last update:
‎05-16-2025 08:40 AM
Updated by: