Has anyone implemented any kind of SLA/OLA for Problem Management

cmacfadyen
Kilo Contributor

Has anyone implemented any kind of SLA/OLA for Problem Management.   If you have what have you done?   If not, have you thought about and decided against it, if so what was the reason against?   or for that matter the reason for doing it?

7 REPLIES 7

ianm_clayton
Tera Contributor

Cornelia



May need some additional context before I can answer you properly - but lets start things off...


We need to step back a bit - what exactly is the issue you are trying to address?   Are there situation where 'problems' happen and teams dont engage/disengage properly, or prioritize activity as they should?   Or, is there a problem team that works to a pace that doesn't fit that required to respond to a situation?



Generally, I recommend SLAs are built at the request response level - service request, and incidents are recorded as a type of request.   They govern a specific situation, or as specific as can be defined.



As a rule, escalation, and thereby the concept of an OLA, work at the same level - governing the pace of a response to a service request.


Problem management and problems, are not subject to escalation, this is because they are investigations of the unknown when they start.  


Yes, problem management may be required to authorize a workaround in flight, but again, no escalation is in play for them because they mission is the antithesis of problems.   (fix it quick versus fix it permanently).



That said, there is no reason why an OLA could not be crafted for a problem management function - to perhaps put governance boundaries around how they are involved in incident management, and conduct problem management activities at large.



Is this helping?   I need your guidance and further context as to the issue before going much further...


Ian,



Yes that is helpful.   I am looking at the idea of governance boundaries.   I do like the direction you are heading.



For instance, our problem tickets we have due dates on and on the tasks as well.     Those due dates give us a timeframe in terms of when the ticket is moved from Open to either Known error or Resolved/Closed.   That was one thought I had.   but I'd like to hear your thoughts on that latter point of yours.


Cornelia



Do you have formal definitions of the boundary - what represents an incident versus a problem - and are they based upon ITIL?   I ask because ITIL is a start - but in my experience (I ran the first ITIL Manager certification class here in the US in 1996 and 1st Expert class outside of UK in 2007) they are 'basic'.   I've written profusely on this - a problem is different in that it is suspected or known to have a 'significant impact'.  



The concept of a major incident blurs the lines - but the missions of incident management and problem are opposite - as mentioned - fix it quick as per what we contracted versus find out whats happening and report in then let management decide how much resource to invest to mitigate the situation or make it go away.



When you say problem tickets - do you mean incidents - or actual problems?   Problems to me have a statement of the problem (factual or hypothetical statement of whats happening, not why), and one or more impact statements - detailing the parties impacted, to what extent (what they cannot do and the consequences).   Stage 1 of problem management?


Ian,



I am talking about problem management as distinct from Incident Management, which includes Major Incidents.     When I talk with people about the distinctions of incident vs problem management I use the analogy of the Emergency Room vs being moved upstairs to a bed for the long term fix.   The incident is about getting everything back up and running quickly, while the Problem is more about looking at the long term solution.   For instance, you have a workaround in place on the incident in order to be up and running.   In that case I would expect to see a problem ticket to get at the long term solution.   Now yes it is up to management to decide if we spend resources or document a known error and workaround.     I'm looking for how we can do better governance around problem management.   We have OLA's for incident management.