How to import 4 million records in 3 hours

Maik Skoddow · ‎06-11-2023

In my "Titans of ServiceNow" session with @Uncle Rob I talked about a self-implemented solution I'm proud of. I mentioned that I've built a custom importer which was able to import ca. 4 million records in about 3 hours. And I promised I would write an article about it, and here it is. Please forgive me for not being able to provide a ready-made solution here, but only the conceptual approach. On the one hand, the solution is very specific to our use case and on the other hand, it was developed as part of a customer project, so I can't just provide ready-made code that the customer has paid for.

But I think the approach should still be interesting, because the idea behind it can be transferred well to many other use cases.

Initial Situation & Constraints

For current my FSM-based project, various types of Foundation Data must be permanently imported from two different company-internal databases so that the Work Orders and the Work Orders Tasks can reference it, such as network elements (CMDB) or locations (table cmn_location). We don't have direct access to these databases, however, there is a large Elastic cluster that already imports them (and many more), preprocesses their data, and makes it available to consumers via a REST-based API.

Elastic does not represent a database but a so-called index, and it doesn't store database records but index documents. With each GET request to the REST-based API, you get at most 10,000 documents returned. So that means pagination is mandatory, but in Elastic this cannot be done via simple URL parameters. Instead, you have to request a so-called PIT ID which represents the frozen state of a temporary version of the underlying data. That PIT ID has to be provided within each GET payload and for the pagination feature, additionally the pointer of the last document from the previous batch. It is not so important here how exactly this works. I only mentioned it to make it clear that a OOTB data source is not sufficient here. Maybe it would have worked with Data Stream Actions from the Integration Hub, but we only have the "Starter" version available according to the contract.

Challenges

Le'ts assume we would implement a naive solution where requesting the data in nearly 400 batches (4 million divided by 10,000) and storing it in one of the target tables is done within the same script. And in addition, it is assumed that the creation of a new data record takes an average of 100 milliseconds. I think this is a realistic value for the CDMB or the cmn_location table, since several business rules are fired after the insertion. Then the whole import would take about 4.6 days. This is not only unacceptable, but would also lead to considerable technical difficulties on both sides. So one of the two challenges is to consume the data as fast as possible and cache it somewhere in a table that is inherently optimized for performance.

On the other hand, importing in just one script is very inefficient. The ServiceNow architecture is designed to run many actions simultaneously in the background. Furthermore, depending on the contract, each customer has at least 2 application nodes available that can be used for load balancing. The second challenge is therefore the parallelization of inserting the data into the corresponding target tables.

Solution

We have agreed with the Elastic team that the data fields of the index documents are named exactly like our table fields in ServiceNow. This way, all the data can be transferred one-to-one into GlideRecords and without time-consuming transformations. Instead, all additional calculations on the data, such as generating true ServiceNow references between two tables, were relegated from the importer to a secondary process I've called "Afterburner", which we won't look at further here. The elimination of all dependencies between two index documents is a crucial prerequisite for the high degree of parallelization described below.

For both challenges, the solution uses the same artifact - the table sysevent with custom queues.

Due to its high importance in ServiceNow as a kind of message bus, it is very performant and designed to decouple data producers and daa consumers from each other.

Challenge 1

The sysevent table is therefore used by the import script to cache blocks of Elastic documents. At field parm1 the name of the target table is stored and in field parm2 a JSON-stringified array of 1 to x index documents - depending on the average size of an index document delivered by Elastic.

Challenge 2

The actual import does not take place in this script, but in the registered Script Actions which are executed by ServiceNow in case one of the import events occur. The first step towards parallelization is to register multiple event queues, which will be filled evenly by the import script. The second step towards parallelization is to use all existing application nodes. That can only be achieved by directly creating a record at table sys_trigger with "System ID" = "ALL NODES".

The field "Job context" of these parent records must contain the invocation of the method GlideEventManager('QUEUE NAME').process().

As a result, ServiceNow creates child records automatically - one for each node:

In summary and visualized, the solution is as follows:

Result

With the help of the introduced approach the time for the initial import could be reduced to circa 3 hours. That was enough for us, because we also wanted to make sure that the users could work in ServiceNow as usual during the import without any performance losses, which was actually the case.

And it makes no sense to derive a universally valid speed factor for your scenarios, because the real duration depends on many other conditions. In our case, for example, it is the fact that our ServiceNow instances are operated on-premise and the provider has optimized the database and networks to the maximum, which means that our ServiceNow instances are much faster than those in the ServiceNow's own cloud. Furthermore, other performance-hungry jobs (other imports, discovery, service mapping, etc) can also have a strong impact on the import speed. And last but not least, the import speed depends on the total amount of data as well as the type of target tables. For example, in the CMDB, a lot of background actions are triggered after an insert, which would not be the case with a simple custom table.

Scott Halverson · ‎06-12-2023

@Phuong Nguyen This is the same design pattern we did for getting data out of ScienceLogic.

Jacques Clement · ‎06-24-2023

Thanks @Maik Skoddow. My understanding is that your custom Data Loader inserts entries directly in sys_trigger, which itself posts an event I different queues. Is this correct?

I am a little bit confused about the sequence because in your summarized drawing, I don’t see where you insert the sys_trigger entries.

Thanks!

Maik Skoddow · ‎06-25-2023

Hi @Jacques Clement

I quite irritated about your question and I think you are mixing something.

In the above image you can see that the data loader is inserting records into the sysevent table. The sys_trigger table contains the processors that are described in chapter "Challenge 2"

Maik

Muhammad Awais1 · ‎06-26-2023

Hi @Maik Skoddow
Please can you little bit describe the sequence of the whole process according to

Challenge 1 I created a incident.load event in the event registry table

I'm using scripted rest API(POST method) once the data is received from third party I can trigger incident.load event and I can pass parm1 the name of the target table and in parm2 a JSON-stringified array of 1 to x index.

2) Also I created a record at table sys_trigger with "System ID" = "ALL NODES".

but I'm confused for import events and import script as well for next process.

Thanks
Hafiz Awais.

Maik Skoddow · ‎06-26-2023

Hi @Muhammad Awais1

did you consider implementing a script action that consumes the events?

Muhammad Awais1 · ‎07-03-2023

Hi @Maik Skoddow ,

Yes I can consider please can you tell me what need to be done in the Script Action after event trigger and then what will be next.

ryanlitwiller · ‎02-09-2024

@Maik Skoddow great article, thanks for sharing. I'm curious what mechanism you have in place for the data loader to distribute to all the queues?

Also, any tips on the following:

Selecting an appropriate priority and repeat on for the sys_trigger?
The number of custom queues to go with?
Any benefit to creating more sys_trigger jobs?

svani · ‎07-25-2024

Hi @Muhammad Awais1

We have the same issue where we need to import 2 million records from 3rd party using rest message and scheduled job it is taking 1 day of time to load so if you have implemented this .please let me know the steps so that i can also go ahead and do the changes.

Thanks,

Svani

VishalBhati · ‎07-28-2024

@svani you can also go with data imports with scheduled data imports along with transforms where you want to publish the data. i achieved 5 million data through this.

svani · ‎07-29-2024

HI @VishalBhati ,

Can you please elaborate on this how can we achieve it .Please let me know any steps to follow would be very helpful.

Thanks,

Svani

Anton42 · ‎10-07-2025

Hello @Maik Skoddow : May you also share your knowledge around the second step of the import you called "Afterburner"?

THanks.