Anne Marie Duff
ServiceNow Employee
ServiceNow Employee

How do I Troubleshoot Service Graph Connector Performance Problems?

 

This article outlines how to identify where the bottleneck is for Service Graph Connector Scheduled Import Jobs that are taking a longer than anticipated time to run. We will be looking at a sample SG-Tanium Hardware and Software Scheduled Import Job run for the Tanium Service Graph Connector throughout this article for demonstration purposes. The following topics are covered in this Article:

 

  • Determine whether Service Graph Connector is Load Rate Limited or Transform Rate Limited
  • Determine Import Set Transform %Processing Time breakdown 
  • Troubleshooting the Root Cause of a Performance Problem

 

Determine whether Service Graph Connector is 'Load Rate Limited' or 'Transform Rate Limited'

 

At a high level the Import Job is broken up into the 2 Import Job processing sections below:

 

  • Loading the data  - Single Threaded
  • Transforming the data  - Multi-Threaded

 

So to start with, you want to see in which of these 2 processing sections the Performance Bottleneck is. In other words, you want to determine whether the Service Graph Connector is Load Rate Limited or Transform Rate Limited.

 

Load Rate Limited

 

The rate at which data is loaded from the external data source is the main factor determining overall end-to-end process time.  If your process is Load Rate Limited, improving the Transform Rate will not substantially improve the end-to-end time.  Instead, you will want to focus on improving the load rate/time, either by improving the throughput or reducing the amount of data that has to be loaded and processed.

 

Transform Rate Limited

 

The rate at which data is being transformed and updated into the CMDB is the main factor determining overall end-to-end process time.  If your process is Transform Rate Limited, you may be able to improve end-to-end process time by finding and optimizing slow executing transformations, or adding more import set transformer jobs for greater concurrency.

 

To determine whether the Service Graph Connector is Load Rate Limited or Transform Rate Limited you do the following for the Last Loaded Import Set in the Concurrent Import Set Record associated with the Last Scheduled Import job Run:

 

  1. Determine the Load Completed Time for the Import Set
  2. Determine the Transform Started Time for the Import Set
  3. Determine if the Transform Started Time is < 1 Minute later or > 1 Minute later than the Load Completed Time

 

If the Transform Started Time is < 1 Minute later than the Load Completed Time, then the Service Graph Connector is Load Rate Limited.

- There was an Import Set Transformer job that was sitting idle waiting for an Import Set so that it could begin Transforming it's data.

 

If the Transform Started Time is > 1 Minute later, then the Service Graph Connector is Transform Rate Limited

- There was an Import Set that was Loaded and Ready to be processed, but there was no Import Set Transformer Job that was available to process it.

 

If it is determined that the bottleneck is with the Loading Data sub process then you know that you need to investigate what is causing the Data Loading slowness e.g. Is it a Data Latency problem with connecting to the Data Source? etc.

 

If on the other hand, the performance problem is deemed to be in the Transform sub process, then you know that you need to delve deeper into what is causing the slow Transforms. How to do this is covered in the next Determine Transform %Processing Time breakdown section. 

 

The screen shot below shows a CISET0010034 Concurrent Import Set Record for the Tanium Service Graph Connector SG-Tanium Hardware and Software Scheduled Data Import Job run. You can see that the Load Completed Time for the Last Loaded Import Set (ISET0016457) in this Concurrent Record is 2024-05-22 08:21:26.

 

AnneMarieDuff_1-1716808872267.png

 

The screen shot below shows the Transform Histories Tab for this CISET0010034 Concurrent Import Set Record. You can see that the Transform Started Time for the Last Loaded Import Set (ISET0016457) in this Concurrent Record is 2024-05-23 21:46:04.

 

AnneMarieDuff_2-1716809384470.png

 

For our Tanium Import Job Run example, the Transform for the Last Loaded Import Set ISET0016457 started at 2024-05-23 21:46:04, 37 hours, 20 minutes & 4 seconds AFTER it's Data Load was completed at 2024-05-22 08:21:26.

 

This shows us that there was no performance problem associated with Loading the Data where the bulk of the overall Scheduled Import Job processing time was spent on Transforming the Data. In other words, our Tanium Service Graph Connector SG-Tanium Hardware and Software Scheduled Data Import Job run was Transform Rate Limited as oppose to Load Rate Limited.

 

We now know that we need to delve into the Transform sub process to see where the Time is being spent in this Transform sub process thus allowing us to to see if there is a performance problem associated with Transforming the data. This is covered in the next section.

 

 

Determine 'Import Set Transform %Processing Time' breakdown

 

The following topics are covered in this section:

 

  • Determine overall Import Set Transform %Processing Time breakdown 
  • Determine RTE %Processing Time breakdown
  • Determine IRE %Processing Time breakdown

 

Determine overall Import Set Transform %Processing Time breakdown

 

The Import Set Processing Concepts screen shot below shows how the Import Set Transform sub process Processing Time is broken out:

 

AnneMarieDuff_0-1715881352160.png

 

You can see from this Screen shot that the Transform sub Process is broken out into the below parts:

 

  • RTE (Robust Transform Engine)
  • IRE (Identification & Reconciliation Engine)

 

We want to determine the overall Import Set Transform %processing Time across the RTE & IRE in order to determine if the Import Set Transform Rate is limited by the RTE or IRE or neither.

 

It is possible to capture RTE & IRE Processing Statistics in RTE Stats[sys_rte_pattern] & IRE Stats[sys_ire_pattern] Tables respectively. These Tables capture RTE & IRE Processing Statistics in CPU Time rather than Actual Elapsed Time. This is at both the RTE\IRE Summary Level and the more Detailed RTE\IRE Sub Process Level. Set the below System Properties in order to enable the capturing of these RTE & IRE Processing Statistics:

 

  • glide.robust_transform_engine.rte_stats.enabled - true
  • glide.identification_engine.ire_stats.enabled - true

 

In order to get an accurate reading of the overall Import Set Transform %Processing Time for 1 of your Import Jobs, follow the steps below:

 

  • Set the RTE\IRE Stats System Properties as described above.
  • Ensure that No other Jobs are running to allow us to focus on capturing data associated with  our Import Job Run only.
  • Clear out the following Tables
  1. Slow Patterns[sys_pattern]
  2. Slow Queries[sys_query_pattern]
  3. Slow Scripts[sys_script_pattern]
  4. Partial Payloads[cmdb_ire_partial_payloads]
  5. Incomplete Payloads[cmdb_ire_incomplete_payloads]
  • Kick off your Import Job

 

At the end of the Import Job run do the following:

 

  • Navigate to the RTE Stats[sys_rte_pattern] Table and lookup the Total Time(ms) value for RTE::transform_time_whole.

This gives the Total Time spent by the RTE in Transforming the Data. The RTE:transform_time_whole value was 2,799,560,349 Milliseconds in CPU Time for our SG-Tanium Hardware and Software Scheduled Data Import job run.

 

  • Navigate to the IRE Stats[sys_ire_pattern] Table and lookup the Total Time(ms) value for Payload Execution Time.

This gives the Total Time spent by the IRE in processing the Transformed Payload Data being passed to it by the RTE. The Payload Execution Time value was 1,739,556,598 Milliseconds in CPU Time for our SG-Tanium Hardware and Software Scheduled Data Import job run.

 

Having the Total RTE Time and Total IRE Time associated with the respective RTE & IRE sub processes for your Scheduled Import job run will allow you to determine the overall Import Set Transform %Processing Time breakdown across these sub processes by using the formulas below:

 

Total Import Set Transform Processing Time = Total RTE Time + Total IRE Time

%RTE processing Time                                     = (RTE Time/Total Transform Processing Time) * 100

%IRE processing Time                                      = (IRE Time/Total Transform Processing Time) * 100

 

Using these Total RTE Time (RTE:transform_time_whole) and Total IRE Time (Payload Execution Time) values for our Import Job run the overall Import Set Transform %Processing Time breakdown for the RTE & IRE sub processes was below:

 

Total Import Set Transform Processing Time = 4,539,116,947 ms (2,799,560,349 ms (RTE Time) + 1,739,556,598 ms (IRE Time) )

%RTE processing Time                                     = 61.68% ( 2,799,560,349/4,539,116,947 * 100)

%IRE processing Time                                      = 38.32% ( 1,739,556,598/4,539,116,947 * 100)

 

We can see that it was a relatively even split of Time between the RTE & IRE sub processes associated with the Import Set Transform for our Import Job run with slightly more time being spent on the RTE than on the IRE. This is what we would expect to see, i.e. a relatively even spilt of Time between the RTE & IRE for an Import Job run.

 

Determine RTE %Processing Time breakdown

 

The RTE Stats screenshot below shows the RTE Statistics that were captured for our SG-Tanium Hardware and Software Scheduled Data Import job run:

 

AnneMarieDuff_0-1716811872162.png

 

The RTE::transform_time_whole Time shown in this screen shot is inclusive of the below 3 sub processes (shown in the Import Set Processing Concepts screen shot from the above Determine overall Import Set Transform %Processing Time breakdown section) :

 

  • Conditional Evaluation - RTE::transform_time_slice1_condition_evaluation
  • Transform in Batch - RTE::transform_time_slice2_transform_in_batch
  • Output Result Formatting - RTE::transform_time_slice3_transformed_result_mapping

 

For our Import Job run the Times for these RTE sub processes was below:

 

RTE::transform_time_slice1_condition_evaluation              = 65,110,414 ms

RTE::transform_time_slice2_transform_in_batch                 = 2,734,167,639 ms

RTE::transform_time_slice3_transformed_result_mapping = 26,054 ms

 

Based on these numbers, the overall RTE %Processing Time breakdown for our Import Job run is below:

 

Total RTE Processing Time (RTE::transform_time_whole) = 2,799,560,349 ms

%Conditional Evaluation Processing Time                          = 2.32% ( 65,110,414/2,799,560,349 * 100)

%Transform in Batch Processing Time                                 =  97.66% (2,734,167,639/2,799,560,349 * 100)

%Output Result Formatting Processing Time                     = 0.02% (26,054/2,799,560,349 * 100)

 

You can see that the bulk of Time is spent in the Transform in Batch sub process that takes 97.66% of the overall RTE Processing Time. This is a typical RTE Process breakdown that is to be expected.

 

Note: The IRE::Message Listener sub process shown in the above screen is the process that calls the IRE after the RTE is finished Transforming the Input Payload. It passes the Transformed Output Payload from the RTE to the IRE. It can be thought of as the total time that IRE is executing. The screen shot below shows this flow.

 

AnneMarieDuff_0-1716215421379.png

 

 

Determine IRE %Processing Time breakdown

 

The IRE Stats screenshot below shows some of the IRE Statistics (20 out of 54 in Total) that were captured for our SG-Tanium Hardware and Software Scheduled Data Import job run. The Table is sorted on the Total Time column so the Longest running sub-processes are shown at the Top of this list.

 

AnneMarieDuff_1-1716812469490.png

 

At a High Level, The Payload Execution Time shown in this screen shot is inclusive of the below sub processes (shown in the Import Set Processing Concepts screen shot from the above Determine overall Import Set Transform %Processing Time breakdown section) :

 

  • Acquire Mutexes (IdentificationEngine::acquireMutexes)
  • Process CI's (IdentificationEngine::process-processCIs)
  • Commit Records in Batches (IdentificationEngine::process-CommitRecordsinBatches)
  • Other

 

For our Import Job run the Times for these IRE sub processes was below:

 

Acquire Mutexes                                                             = 96,490,618 ms

Process CI's                                                                      = 338,146,604 ms

Commit Records in Batches                                           = 1,125,659,226 ms

 

Based on these numbers, the overall IRE %Processing  Time breakdown for our Import Job run is below:

 

Total IRE Processing Time (Payload Execution Time) = 1,739,556,598 ms

%Acquire Mutexes Processing Time                            = 5.55%    (96,490,618/1,739,556,598 * 100)

%Process CI's Processing Time                                   =  19.44% (338,146,604/1,739,556,598 * 100)

%Commit Records in Batches Processing Time        =  64.71% (1,125,659,226/1,739,556,598 * 100)

%Other                                                                          =  10.3%

 

You can see that the Bulk of Time is spent in the Commit Records in Batches sub process that takes 64.71% of the overall RTE Processing Time. This is a typical IRE Process breakdown that is to be expected.

 

If you see a percentage breakdown that is significantly different from this, then you know that you need to look at the sub process that is taking a larger than expected processing Time. For example if a large percentage of the processing Time is given to the AcquireMutexes sub process, then you know that you need to investigate why some much time is being given to your Threads acquiring an exclusive Lock on Shared Resources. 

 

Commit Records in Batches sub process

 

The Commit Records in Batches Sub process is further broken down into the below sub processes:

 

  • Commit Inserts (IdentificationEngine::CommitRecordsinBatches-commitInserts)
  • Query and Commit Updates (IdentificationEngine::CommitRecordsinBatches-queryAndCommitUpdates)
  • Other

 

For our Import Job run the Times for these IRE sub processes was below:

 

Commit Inserts                              = 32,771,768 ms

Query and Commit Updates       = 1,091,643,152 ms

 

Based on these numbers, the overall Commit Records in Batches %Processing Time breakdown for our Import Job run is below:

 

Total 'Commit Records in Batches' Processing Time1,125,659,226 ms

%Commit Inserts Processing Time                              = 2.91 % (32,771,768/1,125,659,226 * 100)

%Query and Commit Updates Processing Time        = 96.97 % (1,091,643,152/1,125,659,226 * 100)

%Other                                                                           = 0.12 %

 

You can see that the bulk of the Time was spent on Committing Updates to already existing CI's in the CMDB with a smaller percentage of Time being given to inserting new CI's. For our sample SG-Tanium Hardware and Software Scheduled Import Job run most of the CI's coming in the Import were already existing in the CMDB so this makes sense.

 

If on the other hand you are running an Import Job associated with your Service Graph Connector for the first time, then you can expect to see a greater percentage of the Commit Records in Batch Processing Time being given to the Commit Inserts sub process for example.

 

Troubleshooting 'Root Cause' of Performance Problem

 

If you determine that there is a Performance Problem in either the RTE or IRE sub processes you can go to the Pattern Tables below to get more insight into where the Performance problem lies:

 

  • Slow Scripts
  • Slow Queries

 

These Tables will contain Long Running Script or Long Running Query records respectively.

 

Note: The Slow Scripts[sys_script_pattern] and Slow Queries[sys_query_pattern] Tables are children Tables of the Slow Patterns[sys_pattern] Table. The Slow Patterns[sys_pattern] Table will also contain the Long running Scripts and Query records shown in these Tables but with less detail.

 

Slow Scripts

 

The Slow Scripts[sys_script_pattern] Table has a detailed breakdown of Script sub processes spawned off from the RTE & IRE Sub processes described in the above sections. If there are any Long Running Scripts that are the source of the Performance Problem they will be listed as Records in this Table. You can then drill into these records to get more details on what logic within these Scripts could potentially be causing the performance problem. Please refer to Service Now Slow scripts log record detail Documentation Page for more details.

 

A screen shot of this Table, sorted on Total Execution Time, from our SG-Tanium Hardware and Software Scheduled Data Import job run is shown below:

 

AnneMarieDuff_0-1717573006752.png

 

Slow Queries

 

The Slow Queries[sys_query_pattern] Table has a detailed breakdown of Query Sub Processes spawned off from the RTE & IRE Sub processes described in the above sections. If there are any Long Running Queries that are the source of the Performance Problem they will be listed as Records in this Table. You can then drill into these records to get more details on the Query that could potentially be causing the performance problem. 

 

For example, there may be a long running Query that is being executed from an execute before script associated with the Transform Definition associated with your Service Graph Connector. Please refer to the ServiceNow Use a slow query log Documentation Page for more details.

 

A screen shot of this Table, sorted on Total Execution Time, from our SG-Tanium Hardware and Software Scheduled Data Import job run is shown below:

 

AnneMarieDuff_2-1717573958392.png

 

Comments
Dhirendra Singh
Mega Sage

Useful,

Thanks

Version history
Last update:
‎07-24-2024 05:19 AM
Updated by: