KPI based on slow patterns (queries, scripts, etc)

Max Nowak · ‎03-07-2023

Hi,

I'm working with a customer who's very focused on instance health and performance. We're operating their platform, and they've come up with a few KPIs regarding performance.

Among them, something like "Percentage of slow queries / scripts resolved in 14 days", or something like it. I can't remember the exact wording, but in essence, they want to measure how long it takes us to eliminate slow patterns, and they want us to eliminate at least 90%.

While I really like that the customer has perfomance in mind, I'm not sure how I would even measure this, and if it's even feasible. As of now, I'm using the slow patterns log sporadically, or if I know that I've built something that's potentially slow. There always seems to be lots of slow queries and scripts in there, even OOTB ones.

I'm not sure if it's even possible to eliminate 90% of all slow queries and script executions. In my mind, sometimes things might just take long, and you'd have to check on an individual basis if that's acceptable or not (for example, a quarterly running scheduled job that takes a while would be more acceptable than a before query business rule that gets executed multiple times per second, worst case).

Adding to this, even if I wanted to go through with it, I wouldn't know how to measure it. I mean, the slow patterns logs are just logs, I'm not sure if I could measure something like "List of slow queries that have not been resolved in 15 days" with the information given in the log. I guess I could report on "Slow queries where First Sighting > 15 days and Last Sighting < 15 days", or something like that. Another Idea was to create Incidents for every slow query entry, but I'm not sure if I really want that.

I'd like to hear your thoughts on that, and if anyone of you also measures KPIs based on the slow pattern logs.

CezaryBasta · ‎03-14-2023

Hi,

As you said, slow queries happen every now and then, here and there, you won't avoid it. Maybe what your customer really needs is some kind of slow query analytics? Like, is there a common pattern? I can't imagine reporting on that though.

In ServiceNow instance we don't have a lot control over the queries itself, we can't alter the SQL syntax. So we are left with observing and some manual actions. I guess it also depends on the target instance - I've just checked what's there in Utah and you can even find index suggestions there.

So although it has to be checked, creating KPI on slow queries makes no sense to me, it's too dependant on random events that we don't have control on (DB load, node load, server load, DB version, DB query, ...) to mitigate

--
See more of my content here.