High Performance Background Jobs

saschawildg · 3 weeks ago

Whenever large amounts of data, or continuous streams of data need to be processed,
solutions must be built with execution performance in mind.

Here are some generic guiderails that should help make better solution design decisions:

1. Scripted Scheduled Jobs vs. Flows
Use Scripted Scheduled Jobs instead of Flows. Flows impose some additional processing
time to every single step. The ease-of-use of Flow Designer comes at a price.
Flows may be easier to build from scratch, but performance optimization is all about
refactoring and rearranging parts of the logic. This is easier and faster with plain vanilla
JavaScript than with Flow Designer.

2. Script Includes
Implement the actual complex logic in Script Includes. This allows for flexibility on how the
complex logic is triggered (Scheduled Jobs, Business Rules, Flows, etc.). Ideally the
Scheduled Job Script is a one-liner calling a single function from a Script Include.

3. Idempotency
All steps must be implemented in a way so that re-processing the same data yields the
same repeatable desired results. A raw data item being processed twice (erroneously)
should not produce duplicate records of processed data.

4. Parallel Processing
Design for parallel processing of large data sets. This can be done through setting up a
Scheduled Job that executes more often than it is expected to run – which leads to
multiple instances of the same Scheduled Job to run in parallel. E.g. a job is triggered
every minute but expected to run for 5 minutes would result in up to 5 instances of the
job running in parallel at any time.
Alternatively, a Schedule Job can start multiple backend processes using the
GlideScriptedHierarchicalWorker API.
This requires that these multiple instances of a single Scheduled Job or process have
awareness in real-time which parts of the larger data set have already been processed
and which are not. Whenever a process picks a chunk of unprocessed data, such data
must be flagged so that other instances of the same process would not start working on
the same chunk of data. That allows multiple instances of processes working on the same
data – each on different chunks.

5. Transaction Caching
Processing multiple raw data items may require additional data retrieval. If multiple
processes working on different data chunks in parallel, the same existing data records
may be retrieved from the database repeatedly.
In some cases, these duplicate data retrievals can be avoided by caching the results in
memory and look them up in memory instead of asking the database again for the same
information.
Caching in memory can be done by storing retrieved data in a key-value-JavaScript
object that remains in memory during the runtime of a single process instance.

6. Stream Processing
If the processing of data can be split into different steps it might be feasible to separate
these steps into separate Scheduled Jobs – each of which may or may not run in parallel.
E.g. If data is first received from an external system, separating the retrieval step may
reduce the overall idle time of jobs processing the data due to network latency between
systems.
Another example: if one step is more dependent on retrieving a lot of data from the
local database, while another step is more expensive in terms of computation,
separating these steps can help making each of the processes working more efficiently.
This pattern may require storing raw, partly processed, and processed data in separate
tables. E.g. The data received could be stored first in a table that represents its
unprocessed form (which may cause additional processing time) and then multiple
processes to pick chunks of data to be processed from that table to transform the data
into its final state.

Read the full story here:

The Whitepaper