Inbound integrations performance design on the ServiceNow platform

Mwatkins · ‎02-01-2017

Contents
Overview
User and session management
Integration user setup
Session handling
Data imports
Text indexing
Scripts triggered from transform maps or insert/update operations
Make an impementation-specific assessment of third-party import tools
Test a full import run

Overview

This article addresses some concerns that occur when troubleshooting (or designing) an inbound integration. This article uses the term "inbound" to mean an integration where ServiceNow is the provider as opposed to the consumer. The primary audience of this document is customer developers who build integrations that pull data out of, or push data into, ServiceNow. This article is a supplemental resource to the Web Services Integrations Best Practice information available in the product documentation.

User and session management

Integration User Setup

Ensure that each one of your integrations uses a separate user account. It may be convenient to use a generic user account such as "sn.integration.user" for ALL integrations, but this is not a good practice. If you have dedicated accounts for each of your integrations, then it is easy to troubleshoot the specific integration causing the issue. Also, if you want to quickly switch off a single integration in an emergency, an administrator can lock out a specific account.

It seems obvious, but integrations should use a local user account as opposed to a remotely authenticated account type (for example, LDAP). Having remotely authenticated integration users can add excessive overhead to the integration request process and cause serious performance issues.

Session Handling

Each production instance of ServiceNow uses an application cluster to divide processing requirements across at least two physical application (app) servers and a variable number of nodes (Apache Tomcat instances wrapping ServiceNow code) hosted on those app servers. To ensure sessions are evenly distributed between the various Tomcats in the instance cluster, ServiceNow routes all incoming transactions through a load balancer inside our network.

If a transaction contains a certain cookie (provided in a previous response), then the load balancer sends the transaction to the application server specified in the cookie. If the cookie is not provided, then the transaction is arbitrarily routed to one of the nodes in the cluster based on an algorithm in the load balancer.

When the transaction reaches the node, the Tomcat servlet checks for the presence of a second cookie. This second cookie is used to determine if the user related to the given transaction has already been authenticated. If the cookie is present, then the transaction is associated with an existing session object. If the cookie is not present, then the transaction must be authenticated and a new session object is created.

Integations - LoadBalancingSessions.png

Understanding how cookies are used in ServiceNow session management is an important factor in determining how to design your inbound integration.

In most cases, a simple web service client does not include cookies in each subsequent request. Because the cookies are not included, ServiceNow does not know to use the same session as the previous request and creates a new session in Java for each request. If there are too many sessions in Java, the application can run out of memory. For more information, see REST and other integration traffic often leads to stale session buildup and scarce available memory. To mitigate this risk, in Fuji Patch 7 and all later versions, ServiceNow sets a short session timeout value (5 minutes by default) for all integration traffic. The default integration timeout value can be overwritten in the "GetIntegrationSessionTimeout" Installation Exit.

The ServiceNow product documentation recommends that you always include cookies so that your integration uses persistent sessions. This avoids excessive session creation and memory issues. However, this method also has potential drawbacks.

If cookies are included to ensure persistent sessions, the integration "sticks" to a single node. This might mean that one of your nodes receives a lot more integration traffic than the others and may become overloaded. Users logged into that node may experience sub-optimal performance.
Using persistent sessions can also potentially be a problem because it "un-parallelizes" the integration. The ServiceNow application only allows one transaction per session at a time. This behavior is called session synch. When integrations are not using persistent sessions, they use an asynchronous model. This allows parallel requests from the same integration to be processed at the same time. However, when integrations are using persistent sessions, session synch causes the integration to use a synchronous model where each request must wait for the previous one to complete.

Before creating a new integration, consider the impact of your session management configuration. The following is not an exhaustive list of considerations, but can demonstrate the principles involved.

What Session Management Option is Best for My Performance?

There is no one-size-fits-all answer to this question. To determine what model is going to work best for your situation, review your specific business case. Consider:

how many requests will be sent from the integration per second (frequency)?
how long will it take to process each request (duration)?

As stated earlier, if your integration is including the cookies to achieve persistent sessions, then a feature called session synch limits integration to 1 operation at a time. This means that if you send more requests than the number that can be processed in a timely manner, your requests start to accumulate wait time. Before you configure your integration for session persistence, consider if your duration and frequency rates require more than one operation to be processed at a time. If your duration multiplied by your frequency is greater than 1, then your integration will start to fall behind.

[Duration] * [Frequency] < 1

So, for example, you have an integration that has a maximum frequency of 10 requests per second and a highest expected response time of 200 milliseconds per request. Given these assumptions, after only one second of peak activity, 1 second of wait time is built up.

.2 seconds Avg. duration of request * 10 requests per second = 2 seconds of processing

(This scenario would cause an integration with persistent sessions to fall behind 1 second for every second that transactions continue at this rate)

If your integration exceeds the threshold of requests that can be processed by a single thread, then any subsequent transactions start to queue. This might not be a problem if the queueing only happens for short periods of times. However, if there are more than 5 "waiters" for a single session, or the current executing transaction takes longer than 30 seconds, ServiceNow starts to automatically reject any additional requests to the same session. These requests are returned to the client with HTTP code 202. In addition to the session synch limitations, each semaphore pool has a queue of waiting transactions. This queue holds any requests that cannot be processed immediately by an available semaphore. If all 150 positions in a queue are taken, then any additional incoming requests are not processed and the requests are immediately returned to the client with code of 429.

If the frequency/duration combination of your integration is too much to be handled synchronously, then your first step should be to try to improve the integration. Devise a way to process the requests more quickly or to reduce the frequency of requests per session. Determine if there is something that can easily be done to reduce the duration of your web service requests.

Are you querying a specific date range based on the incident.sys_created_on, but have no database index on that field?
Are you pulling back more data than you need? Can you reduce the number of records queried/updated? Can you reduce the number of fields supplied in the payloads?
Is there an inefficient business rule being executed (see the slow business rule log in Geneva or perform the same query through the UI while Debug Business Rules is turned on)?
Is your integration client set to re-use TCP connections (avoiding multiple SSL handshakes)?

If you have reviewed the efficiency of the operations being performed and determined that they are reasonably efficient, then you should look at ways to reduce the frequency of the requests. Often there is a way to reduce the frequency of requests at the web service client. Here are several options:

If you can control the number of requests sent out per unit of time on the client-side, this might be an easy way to throttle the integration down to workable levels.
To reduce the frequency of requests from a web service client, break it into smaller parts.
Have multiple active client threads, each with their own session. This can pose a maintenance challenge for the integration administrator, but might be worth considering. The more client threads that you divide your web service requests between, the more you can make use of ServiceNow's load balancing feature.
Do not reuse sessions. See the following section for more details about this option.

Do I Need to Reuse Sessions?

The best practice documentation in the product documentation encourages reuse of sessions. While this is the recommended best practice, many customers can safely implement their web services without reusing sessions by adjusting the factors that contribute to the amount of average active sessions per node.Frequency and session timeout length are the main factors that affect the number of active sessions. For example, suppose you have an integration that sends a approximately 5 requests every second. If you have a 1-hour session timeout on your instance, with 2 nodes and you are not reusing sessions, then this results in about 9,000 active sessions per node at any given point in time.

1 * 60 * 60 * 5 / 2 = 9,000

[Session timeout] x [Avg. requests per second] / [Number of nodes] = [Avg. active sessions per node]

By lowering the global session timeout value of your instance (glide.ui.session_timeout), you can reduce the number of active sessions at any given point in time. The base system value for this property is 30 minutes. In Fuji Patch 7 and later versions of ServiceNow, an independent timeout value for integration users is available. By default, the integration timeout value is set to 5 minutes and can be configured independent of the global glide.ui.session_timeout using the GetIntegrationSessionTimeout Installation Exit. For more information, see REST and other integration traffic often leads to stale session buildup and scarce available memory.

Another question is, how many active sessions is too many? You can estimate if your system will run out of memory due to a high number of sessions. To see how many sessions are being used on each node of your instance, view the ServiceNow Performance homepage. Check the 30-day view and note the max session values per node trend.

For more information, see ServiceNow Servlet.Also on the ServiceNow Performance homepage, check how close you are to the heap memory threshold. Generally, memory should not be spiking above 80% usage on a regular basis (1.6k). There have been cases where 7,000 active sessions have taken 85% of heap memory. From these figures, we can generally estimate that every 820 users represents about 10% of available memory (heap memory is fixed at 2GB per node). The overall memory usage of your instance also depends on other factors of your ServiceNow usage. If your normal memory garbage collection goes from 50% to 70% (1k to 1.4k), then adding another 10% of consistent memory usage (820 active user sessions per node) might put you over the edge.

Data imports

When planning go-live and ongoing large scale data imports, consider the points in this section. These are often overlooked and can cause import processing delays as well as performance degradation across the instance:

Text indexing

For initial large scale data imports (more than 500k records in quick succession), do the following:

Turn off text indexing on the target table. For each of the records being imported, a text_index event is inserted into the sysevent table. By flooding the sysevent table, the inserts get progressively slower (5 — 6 seconds per row). This exponentially increases the time your import takes to run. More importantly, it also severely impacts the "normal" operation of an instance (remember that the sysevent table holds text index events, metric update events and all "regular" events in the default event queue). The most common symptom is that notifications are not generated, but many other actions can be delayed.
After the import is complete, re-enable the text index property on the collect record in sys_dictionary. Take this opportunity to fine-tune the fields that are text searchable. This is especially important for CMDB tables, sys_user, and any table that is likely to have multiple/frequent updates to inconsequential fields. Adding the no_text_index=true attribute to all fields you do not want to be searchable by Zing improves performance for searching for artifacts you do want searched, as well as reducing the overheads on event processing.

Note: In order to make the data that was imported while text indexing was turned off available through text search, run a separate re-indexing operation for the relevant table(s). For more information, see Index a single tab.

Scripts triggered from transform maps or insert/update operations

Be aware of business rule / transform map logic that is triggered by the import. If there are synchronous business rules / transforms that perform sub-optimal glide record queries (for example, Pull data from large un-indexed record sets), this increases the execution time of importing each record. This gets progressively slower as the datasets increase (see point about full-run tests). Continual requests for large un-indexed datasets also flush the buffer pool on the database, which has a negative impact on the entire instance. If a slow operation like this is identified, see if the operation can be improved by following best practices for scripting and query execution.

One way to improve the speed of an import is to move slow script execution to an Asynchronous Business Rule. The import itself can complete without having to wait for the slower scripts to complete. However, be very careful about using Asynchronous Business Rules because you are essentially creating a multi-threaded situation. When moving some operation to a multi-threaded design, ask yourself if anything is dependent on the completion of the operation. If something is dependent on the completion of an asynchronous operation, then there is the potential for a race condition. For example, suppose you are importing tasks and filling in the category and sub-category fields with an asynchronous business rule. Then, suppose you have a business rule that fires on insert and assigns the incident to a certain assignment group based on the incident category and sub-category. If the asynchronous business rule takes long enough to complete, this could result in unexpected behavior. This is just a simplified example, but you should consider the pattern. What could happen if the operation you are running asynchronously takes 10 seconds to complete?

One last thing that should be considered is the Run Business Rules option. If you are using a transform map for your import, take notice of the Run Business Rules option on the Transform Map form. Whenever possible, this option should be cleared. Clearing the option tells the system to bypass all scripts, engines, auditing, and the customer update tracking mechanism (usually not applicable to tables that are the target of an import). Before doing this, of course, you should make sure that you do not need any of those to run — if you need only one or two business rules to run, you might want to replicate the logic in those rules in a transform map script so you can clear the option and avoid the other costly operations of the scripts and engines. Clearing Run Business Rules often saves 50-90% of execution time for an import. To see the full list of items that are skipped by clearing the Run Business Rules option, Execution order of scripts and engine.

Note: Clearing the Run Business Rules option does not bypass the update of the sys_ fields (sys_created_on, sys_created_by, sys_updated_on, sys_updated_by, sys_mod_count).

Make an implementation-specific assessment of third-party import tools (for example, Perspectium or SnowMirror)

Third-party tools that successfully export data from ServiceNow also advertise the capability to import data into another instance. Their products are typically only tested at a generic level and do not take into account the text indexing policies and custom logic that may be implemented in the target instance. Even if you have successfully leveraged one of these replication buses in another project, take time to check how the engine will behave with a particular instance.

If the tool of choice relies on scheduled jobs to poll for/subscribe to information from an external source, check the number of scheduled jobs that have been provisioned for this purpose. Remember that each node in the cluster is checking in for 'past due' jobs every 30 seconds. If you have 20 subscriber jobs configured, then the first node to check in to the queue picks all of them up (as long as there is space in his scheduler queue). They then sit in the scheduler queue on that node, waiting for the 8 schedulers process the initial 8 jobs.

Also, verify that the priority of critical scheduled jobs, such as the events processors (various), SMTP sender, and POP reader, are set to 25. This ensures that core platform functionality can proceed normally if the scheduler workers are conducting import-related activities.

Finally, check for any custom table that is used as part of the replication bus to 'stage' the incoming data. Staging tables (those that extend from sys_import_set_row) are cleaned by the scheduled job named Import Set Deleter. You can access the configuration of this job by navigating to Sys Import Sets > Scheduled Cleanup in the application navigator. The data retention period for this job is set for 7 days, but often needs to be set to a shorter period of time. If the retention period is set for, say, 7 days and you perform an import of millions of records over a couple days, the import staging tables will grow very large (perhaps multiple Gb), causing the insert/update transactions to take longer and longer as more and more data is processed.

Test a Full Import Run

Performing a full import is an important part of checking that data transforms correctly and in a timely fashion without affecting other platform components. It is possible to pass over something that seems inconsequential, or to just plain miss something, that can ultimately causes issues when it running your import in production.

James Fricker · ‎12-06-2020

> Ensure that each one of your integrations uses a separate user account.

What about the license costs for all of these accounts?

Tushar Walveka2 · ‎08-09-2022

Great Info. Thanks for sharing.