- Post History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
on 01-20-2017 02:55 AM
Incident Management is an ITSM process area. The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within service level agreement (SLA). It is one process area within the broader ITIL and ISO 20000 environment.
Incidents that cannot be resolved quickly by the help desk will be assigned to specialist technical support groups. A resolution or work-around should be established as quickly as possible in order to restore the service.
ITIL 2011 defines an incident as:
An unplanned interruption to an IT Service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an incident — for example, failure of one disk from a mirror set. The ITIL incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized. ITIL Service Operation.
Without effective incident management an incident can rapidly disrupt business operations, information security, IT systems, employees or customers and other vital business functions.
Usually as part of the wider management process in private organizations, incident management is followed by post-incident analysis where it is determined why the incident happened despite precautions and controls. This analysis is normally overseen by the leaders of the organization, with the view of preventing repetition of the incident through precautionary measures and often changes in policy. This information is then used as feedback to further develop the security policy and/or its practical implementation. In the United States, the National Incident Management System, developed by the Department of Homeland Security, integrates effective practices in emergency management into a comprehensive national framework. This often results in a higher level of contingency planning, exercise and training, as well as an evaluation of the management of the incident.
Incident Management Process as defined by ITIL
Incident management can be defined as an unplanned interruption to an IT service or a reduction in the quality of an IT service (also known as an "incident definition as per V3"). Failure of a configuration item that has not yet impacted service is also an incident. An example of this would be failure of one disk from a mirror set.
An "incident definition as per V2" is an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services and customer productivity. The objective of incident management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.
The incident manager is a functional role, rather than a position of employment, however both may be true dependent upon the hiring organization. Incident management provides to the external customer a focal point for leadership and drive during an event by ensuring adherence to follow up on commitments and adequate information flow.
The objective of incident management during an incident is service restoration as quickly as possible; the objective is not to make a system perfect. If service can be restored by a temporary workaround quicker than by correcting the underlying root cause of the issue then that is acceptable. After service restoration, correction of underlying root causes is done by the problem management team by a process called root cause analysis (RCA). An example of service restoration by temporary workaround is that which was done on the Apollo 13.
The primary focus of incident management is to ensure a prompt recovery of the system, supervising and directing the internal or external resources. Prompt system recovery and minimization of any impact to customers has priority over unreasonably long and intensive data collection for the event root cause investigation.
Incidents can be classified into three primary categories: software (applications), hardware, and service requests. (Note that service requests are not always regarded as incidents, but rather requests for change. However, the handling of failures and the handling of service requests are similar and therefore are included in the definition and scope of the process of incident management.)
ITIL V3 separates incident management into six basic components:
- Incident detection and recording
- Classification and initial support
- Investigation and diagnosis
- Resolution and recovery
- Incident closure
- Ownership, monitoring, tracking, and communication (monitoring the progress of the resolution of the incident and keeping those who are affected by the incident up to date with the status)
Activities of Incident Management as defined by ITIL V3
- Identification - detect or reported the incident
- Registration - the incident is registered in an ICM System
- Categorization - the incident is categorized by priority, SLA etc. attributes defined above
- Prioritization - the incident is prioritized for better utilization of the resources and the Support Staff time
- Diagnosis - reveal the full symptom of the incident
- Escalation - should the Support Staff need support from other organizational units
- Investigation and diagnosis - if no existing solution from the past could be found the incident is investigated and root cause found
- Resolution and recovery - once the solution is found the incident is resolved
- Incident closure - the registry entry of the incident in the ICM System is closed by providing the end-status of the incident
Incident Manager Responsibilities
- understand any incident/fault on a basic level (at least) in order to use the appropriate competences (resources)
- drive the restoration team to gather sufficient information to start an analysis
- maintain a general overview of the incident (keeping the focus on the restoration via a workaround)
- understand the functionality of multiple areas (RAN, Core Network, VAS, BSS/OSS)
- obtain guidance on priorities to the teams starting the immediate urgent unexpected recovery work
Incident Management Software Systems
Incident management software systems are designed for collecting consistent, time sensitive, documented Incident report data. Many of these products include features to automate the approval process of an incident report or case investigation. These products may also have the ability to collect real time incident information such as time and date data. Additionally incident report systems will automatically send notifications, assign tasks and escalations to appropriate individuals depending on the incident type, priority, time, status and custom criteria. Modern products provide the ability for administrators to configure the Incident report forms as needed, create analysis reports and set access controls on the data. These incident reports may have the ability for customization that may best suit the organizations using the systems. Some of these products have the ability to collect images, video, audio and other data. Incident management software systems exist that relate directly to specific industries.
Physical Incident Management
Incident management should be considered to be much more than just the analysis of perceived threats and hazards towards and organization in order to work out the risk of that event occurring, and therefore the ability of that organization to conduct business as usual activities during the incident. It should be remembered that as well as an important part of risk management process and business resilience planning that Incident management is a real time physical activity.
The planning that has happened to formulate the response to an incident; be that a disaster, emergency, crisis or accident has been done so that effective business resilience can take place to ensure minimal loss or damage whether that is to tangible or non tangible assets of that organization. The only way the effective planning that has gone before can be implemented is by efficient physical management of the incident, making best use of both time and resources that are available and understanding how to get more resources from outside the organization when needed by clear and timely liaison.
National Fire Protection Association states that incident management can be described as; "When an emergency occurs or there is a disruption to the business, organized teams will respond in accordance with established plans. Public emergency services may be called to assist. Contractors may be engaged and other resources may be needed. Inquiries from the news media, the community, employees and their families and local officials may overwhelm telephone lines. How should a business manage all of these activities and resources? Businesses should have an incident management system (IMS). An IMS is "the combination of facilities, equipment, personnel, procedures and communications operating within a common organizational structure, designed to aid in the management of resources during incidents" (National Fire Protection Association (NFPA), 2013).
The physical incident management is very much the real time response that may last for hour's, days or longer. The United Kingdom Cabinet Office have produced the National Recovery Guidance (NRG), which is aimed at local responders as part of the implementation of the Civil Contingencies Act 2004 (CCA) and it describes the response as the following; "Response encompasses the actions taken to deal with the immediate effects of an emergency. In many scenarios, it is likely to be relatively short and to last for a matter of hours or days — rapid implementation of arrangements for collaboration, co-ordination and communication are, therefore, vital. Response encompasses the effort to deal not only with the direct effects of the emergency itself (eg fighting fires, rescuing individuals) but also the indirect effects (eg disruption, media interest)" (NRG, 2007).
International Organization for Standardization (ISO), which is the worlds largest developer of international standards also makes a point in the description of its risk management, principles and guidelines document ISO 31000:2009 that, "Using ISO 31000 can help organizations increase the likelihood of achieving objectives, improve the identification of opportunities and threats and effectively allocate and use resources for risk treatment". This again shows the importance of not just good planning but effective allocation of resources to treat the risk.
Service Now tool is used for incident management.
- 5,389 Views