Using TABLE API to extract large volumes of (changing) data

Daniel Draes · ‎03-09-2020

Quite a lot of customers use the Table API via REST to extract information out of ServiceNow for various reasons like consolidating in local data warehouses etc. While the API itself is quite optimized and can also handle larger data volumes pretty fast, we still limit the number of records to get per call to a managemable amout for the instance.

Question like 'Is it Practical to Query around 250,000 Records in a Single REST Transaction?' shows that getting more than the per-call limit is sometimes required. I wanted to document a way how this can be achieved, also keeping in mind that the data you query might be updated while you extract....

Chunking of large data sets

For the sake of this article assume we have a set of data and focus on sys_id as unique identifier and an updated timestamp. To make it a bit easier we also assume that all records have been updated in the last 24 hours - so we can focus on the hours only. The data could look something like this:

sys_Id	sys_updated_on (ASC)
aaaa	1:00
bbbb	1:30
cccc	2:00
dddd	2:30
eeee	3:00
ffff	3:30
gggg	4:00
hhhh	4:30
iiii	5:00
kkkk	22:00

Based on the principle of chunking data we could several strategies to split the volume. An obvious one is by sys_id, but this would mean we need some form of query which returns all sys_id's at once to make sure we know the whole data set. If new sys_id's will be generated inbetween, we only get them in the next run - next day. Also the long list of sys_id's might already be too much for one call. So let's try another approach and use the timestamp...

Using sysparm_limit we can define how much records we want to get returned. Let's take 4 for this example, order by sys_updated_on and run the query. We get these records:

sys_Id	sys_updated_on (ASC)
aaaa	1:00
bbbb	1:30
cccc	2:00
dddd	2:30

If we now take the highest sys_updated_on we see in the data set (2:30) and use this to update our query as starting point, the next call will get us these records:

sys_Id	sys_updated_on (ASC)
dddd	2:30
eeee	3:00
ffff	3:30
gggg	4:00

We can loop this functionality until we get no more further records - this allows us to cycle through large volumes easily.

Now what happens when data changes in between?

Lets assume record 'eeee' gets modified while we do our thing. As the records gets a new sys_update_on timestamp, it simply moves to the end of our list and will be returned again once we reach this chunk of data. So the only thing we need to worry about is to make sure we overwrite - or update - the information we got for 'eeee' in the previous call. All our other calls are still working fine as non of the other records moves around. If we would use a simple limit and offset, all records would shift and we actually might miss some updates.

So, what can we summarize as learnings for this?

Remembered sys_updated_on needs to be rounded down (‘floor’) to nearest second
- Needed as more than one record can be in the same second, unlucky chunk splits could than loose records
Duplicate records will be returned because of this, use sys_id to coalesce information
Do not use the offset as this will lead to missing records
Chunk size must be bigger than max. number of records changed per second
Stop condition for a run is an empty records set
Provides only a snapshot per execution (mostly daily), multiple updates inbetween not possible
Quite a lot of tables do have an index on sys_updated_on, you should still check it for your bespoke scenario. If there is no index, this will heavily impact the query

Scaling it up - how to run multiple threads?

Now assume we really really have to read millions of records in every cycle. If we do this with above sketched method it still could take quite some significant amount of time. Can we have multiple threads running in parallel?

Yes, with good planing this can work. Here is an example of how to split the work load across three threadss:

0:00

0:30

1:00

1:30

2:00

2:30

3:00

3:30

4:00

4:30

5:00

5:30

6:00

6:30

7:00

7:30

8:00

Thread 1

Thread 2

Thread 3

Thread 1

Call 1.1

Call 1.2

Call 2.1

Call 2.2

Call 3.1

Call 3.2

Call 1.3

Call 1.4

As you can see, we simply split the time span we want to cover into smaller windows. In this example I used 2 hours per thread. Again, this works great as all records getting modified would just be added to the lend.

As we cannot fully guarantee when the last batch is being read and some records might move into that window, we should define our windows in such ways that the upper limit is always lower than anything which gets moddified as we work.

I.e. limit your whole time window already to something like last night 24:0). This way everything which gets moddified will move to the next cycle and not just the last thread.

e_17 · ‎05-19-2020

Nice article! What do you think @Fogg of using the sys_created_on as a base for extract? Since the creation date doesn't change you will not have the behaviour where an already extracted record is extracted again in later iteration.

e.

Daniel Draes · ‎05-19-2020

Hm... you mean filtering by sys_updated_on and sorting by sys_created_on .... indeed, this should be a good option to avoid the problem in total... It would only be reread if it is changed gain which should be outside our current read window.

Kun · ‎10-21-2020

Hi @Fogg thanks for your article. Just some questions:

if we want to use sys_updated_on to split data, the pre-condition is that sys_updated_on is ordered already. is it by default that sys_updated_on is already ordered in servicenow or we need to order it by ourself ? Because I imagine order the whole dataset by sys_updated_on is also very heavy for the system.
why you said "Do not use the offset as this will lead to missing records"

Thanks a lot for your support.

Daniel Draes · ‎10-21-2020

Hi @kun,

indeed, you will need to order by sys_updated_on. The default query would not sort automatically.

The problem with offset is that records could move from one call to another. Assume you order by update_on and while you query for the first 1000 records, the 1001'st record gets updated again. This would move the record to the end of the list again. While you will still get the record at the end there is a small chance it could be updated again while you move through the offsets making it always disappear from your view.

Kun · ‎10-22-2020

Just to understand better, when we talk about offset I imagine there is an index for each record.

Is this index generated once a record is written to servicenow table and never change even the record has been updated or the index is regenerated for each request and each update?
If a record get updated, how the index get changed ?

And I didn't get quite well why sys_updated_on needs to be rounded down (‘floor’) to nearest second, I mean if sys_updated_on is at millisecond accuracy, how the chunk size would cause lost of records ?

Thanks a lot 🙂

Daniel Draes · ‎10-22-2020

Yes, every record has a unique key / index. This is sys_id - however, sorting by this is difficult if you need a delta list.

The rounding... good one 🙂 The field itself will store the information in milliseconds but when you query, I believe it cuts off the milliseconds. Hence if you have a lot of records in the same second you might have your second panning multiple chucks. With the 'floor' I reduce this problem as the likelyhood of having lots of records in the same millisecond is very low.

Kun · ‎11-10-2020

Hello @Fogg :). I am just thinking of another solution: To cut the big chunk into smaller time pieces (Ex: when loading data for one day, we load in order data from 0h-1h, 1h-2h....11-12h etc) and in each time piece we use offset to page data. This will prevent the case when one record changes while loading and gets lost because this record will be caught in the future time pieces. and we don't need to sort data by time column, which I think will have a big impact on the performance of loading. Just want to know if you see any default in this solution, because I not 100% sure how the offset works and if it will lead to missing records in this solution. Thanks a lot.

Daniel Draes · ‎11-10-2020

Hi Kun,

yes, sure. Problem could happen when either

a) you have multiple streams reading in parallel and the timestamp wise newer ones are completed before old ones - a moving record could move to a window already processed

b) if a record is modified often it could slip your read as it moves every time you try to read (small chance but can happen)

moers · ‎04-01-2022

Hi Daniel

Thanks for sharing this post, however I see some drawbacks in your solution:

multi threading - multi threading on timestamp requires to have good knowledge of the data and how its spread over the day
empty page - an empty page is not an indicator that there is no more data - there is a good chance that there is just an ACL causing an empty page
identical timestamp - there is a risk that all records in a page do have the same timestamp value, in that case the query to the next page (timestamp > lastrow(timestamp)) will miss some records.
rolling end - if the records are created/updated in high frequency the number of rows exceed the window size and the job never ends

Therefore, I propose to:

multi threading - separate jobs in threads by query for a range of sys_id values. Each thread will query the same pages with a different range like:

sys_id<40
sys_id>=40^sys_id<80
query": "sys_id>=80^sys_id<c0
sys_id>=c0

empty page - As an empty page can be caused by ACL, its not an indicator for the last page has reached. Also if the below query is used, there is a certain risk that there are more rows on the next page which can not be reached. To solve this problem, use the NEXT url from the LINK header which contains the same query but a higher sysparm_offset until there is data or no next link.

identical timestamp - additionally sort by sys_id and query the next page as following:
(sys_updated_on > last_page_max(sys_updated_on)) OR (sys_updated_on == last_page_max(sys_updated_on) && sys_id > last_page_max(sys_id))

rolling end - a threshold must be set after which the load ends. The x-total-count header on the first request can be used to calculate the expected number of pages.

We are using this approach now at Swiss Re to reliably load very large amount of data into Azure Data Lake with maximum high throughput.

I created a GitHub project https://github.com/bmoers/sn-using-table-api-right-to-extract-large-volumes which describes all the details and provides a reference implementation in JavaScript.

demo-increment.js - increment load demo with 2 parallel threads (using client.js)
demo-snapshot.js - snapshot load demo with 2 parallel threads (using client.js)
demo-tread.js - generate thread load information for 4 parallel threads (using client.js)

Fell free to fork and contribute!

Cheers

Boris

Daniel Draes · ‎04-01-2022

Hi Boris,

thanks for your feedback, great how you have taken the approach and detailed it out. I also like your openess wrt. to sharing sample code in a github - very helpful 😄

As for your comments:

multi threading
Correct, any form of multi-threading involves careful planung to make sure nothing gets lost due to the chucking of data. Something I would check in your approach is how performant the query is which is hitting the database. In your case it is more complex as it filters on date and sys_id which is probably not an index we have prepared. In my example the filter is only on updated column and hence I would expect it to be faster.
empty page
You are correct, I would anticipate that such kind of data extracts are run with a high level or privileges to be able to read all records on the given table. Usually it is used to download large volumes to an external data warehouse, with that you want to get all records anyway. But true, I should have mentioned this above.
identical timestamp
Correct, there is a risk when more records are moddified in the same timestamp than fit into one of the chucks. This comes back to the multi-threading topic of knowing your data.
rolling end
This can be an issue if your window size includes an criteria which still allows data to flow into the window. The intention above is that your window content is fix when the process start, i.e. anything modified yesterday. This is to be preferred over anything modified in the last 24 hours as last 24 hours is dynmic and allows the platform to add records to the window. It needs to be ensured that any record being modifies while the extraction runs is taken out of the window and 'queued' for the next run.

Cheers,

Daniel