Introduction to Document Intelligence Data Extraction and Best Practices

Loic1 · ‎05-19-2022

Introduction to Document Intelligence Data Extraction and Best Practices

This guide provides a more detailed overview of Document Intelligence for Data Extraction.

By the end of this guide, you’ll:

understand what a Document Intelligence Use Case is,
understand what Document Tasks are,
have captured values using the Document Intelligence workspace,
be able to locate the values,
and understand the different extraction modes.

This guide is the first chapter of a series. For a general overview and the link to the other chapters, go to the Quick Start Guide.

In this guide, we'll use Document Intelligence v3.1 with Document Intelligence Admin in a Vancouver instance.

Before starting

We assume here that you have a general understanding of what Document Intelligence does, from the Quick Start Guide or the docs, and that you have gathered some documents (you can use the documents attached to this article).

Best practice: Good input is key for AI training, make sure you understand the documents you intend to process, check with Subject Matter Experts if needed. Gather enough documents, at least 10 for each layout. The documents should be real-life examples of the documents to extract.

Use Case

The first step is to create a Use Case. As its name indicates, the Use Case reflects the business use case that you are addressing with Document Intelligence. It will be used for all the documents that share a common set of values to be extracted but are not necessarily in the same format. For example, Invoice is one use case, Identity Document is another one.

The Use Case also contains an AI model that is trained and improved over time.

Create a Use Case

Navigate to Document Intelligence > Document Data Extraction Administration > Use Cases, click on New use case. Provide a descriptive Name and Save.

You can set the value of the Target Table if you plan on using Flow Designer, for now, we'll focus on the Use Case itself.

Open the Use Case.

Add the Fields

Then, we create the Fields. Each field is a value to extract from the documents.

Ultimately, fields will be used to update fields in tables on the Now Platform to drive the end-to-end workflow.

Best practice: Only extract fields that are needed and will provide value for the workflow that is being automated.

Create Fields

Create the Fields from the Fields tab.

There are different types of Fields, with different purposes:

Single Fields are used to extract a single piece of information in the document. For example, a document number or a customer name.
Tables are used to extract lists or tables of items. A Table can have multiple columns. The number of items does not have to be known in advance, for example, one document can have 2 Line Items but another one will have 5.
Check box Lists are meant to extract a group of one or multiple check boxes. Each check box can either be checked or unchecked.
Single field groups are meant to extract values grouped together in the document (for example a Location with Address, City and Country). In a way, they are similar to a Table field except only one item can be extracted.

Fields also have different behaviors in regards to how they are handled by the Flow, we go into more detail in the next chapter of the series: Using Document Intelligence with Flow Designer.

Each Field can also be assigned a Data Type.

Selecting a Data Type greatly simplifies the conversion (or normalization) of a piece of text into typed data. The supported Data Types are:

Text
Date
Integer
Decimal
Float
Checkbox

Best practice: Assign a Data Type when possible, it'll reduce ambiguity.

A Field can be Required, in that case, during the validation, an agent would be able to easily identify the Required Fields and be warned if some of these fields don't have a value. Additionally, only Required Fields are taken into consideration for the Fully Automated mode.

Let's take a look at an example. Attached to this article is a set of invoices.

We create:

a Single Field "Invoice Date", of type Date
a Single Field "Invoice Number", of type Text
a Table Field, with the following Columns:
- Item, Text
- Quantity, Integer
- Line Total, Decimal

Once created, the Fields appear on the Fields tab of the Use Case.

Document Task

It is now time to test our use case. A Document Task can be created for each document to process. It is a container for the document and provides status information for the different stages of the processing.

The stages of the processing are:

Overview of the processing flow

Stage	Trigger mechanism	Document Task after completion
1. A task is created and the document uploaded	Manually (or with automation - See Using Document Intelligence with Flow Designer)	Status = New Is Processed = False Is Trained = False
2. The OCR + AI model runs on the document, finding values for the Fields	Automated when “Process Task” is clicked (or with automation - See Using Document Intelligence with Flow Designer)	Status = New Is Processed = True Is Trained = False
3. User validation	The user opens the Document Intelligence workspace with the “Show in DocIntel” button	Status = In Progress Is Processed = True Is Trained = False
3. User validation	The user validates data in the Document Intelligence workspace and clicks the “Submit” button	Status = Done Is Processed = True Is Trained = False
4. The AI model is automatically trained for continuous improvement	Automated when “Submit” is clicked, invisible to the user	Status = Done Is Processed = True Is Trained = True

The status information is contained in the Status, Is Processed and Is Trained fields and are used for monitoring or automating workflows (using Flow Designer).

Create a task

Navigate to the Document tasks tab on the Use Case and click on New document task. Provide the Document task name, attach one document by clicking on +Add File or by drag-and-dropping a file, then click on Add Extraction.

Technical note: Document Task [sys_di_task] is not a child of the Task [task] table used for incidents, etc.

Once created, the Document Task appears on the Document tasks tab on the Use Case.

Click on the name of the task and you'll see a new button “Open in Document Intelligence” to start the validation (stage 3 from the processing flow).

It takes a few minutes for the automated AI model to run on the document, until it is complete, a warning message will show in the Document Intelligence workspace.

When the task is ready, we can start the validation. For each Field, start typing the first letters of the value to extract, then select the value by identifying the right candidate in the document.

If multiple candidates have the same value, select the value that reflects the meaning of the field. For example, the same email address could be used for two semantically different fields.
If there are multiple values with the same meaning, select the value that is the most likely to be commonly found across all the documents. For example, an invoice might have the date at the beginning of the document and in the footer, but it is most likely to be found at the beginning of the document in other invoices, so select that value.

If the value is not present in the document, select “missing in the document”. This will train the model to recognize when a field is missing in the document.

Avoid selecting values from logos unless there is no alternative (because some logos are more legible than others).

For checkboxes, select the checkbox from the document and ensure that the value (checked or unchecked) is correct.

For tables, add as many rows as visible in the document and select the values for each of the columns.

Best practice: If multiple members of a team are doing the validation, create a labeling guide to ensure everybody has the same understanding of the fields and values to extract.

The experience is extensively documented in the docs, Use the Document Intelligence workspace to extract fields.

When all the values are validated, click on Submit.

After the task is completed, the AI model is running in the background, invisibly to the user, to learn from the validation process in order to improve its prediction for the next task.

While creating Document Tasks manually is a good way to test your Use Case, this process should be automated when building end-to-end workflows using Flow Designer.

Also, note that in this example we attached the document to the Document Task directly, but this is not the only way. If the attachment is on a different record, you can link the Document Task to that record using the Source Record field and its attachment will be used.

You'll have to use the classic view by navigating to Document Intelligence > Document Data Extraction > Document Tasks or ... > Create Document Tasks.

When creating Document Tasks from the classic view, make sure to click Process Task to start the process.

Confidence level

You can repeat the task creation process with other documents. Over time, predictions will get better, making the processing cycle faster.

Best practice: Don't create all your Document Tasks at the same time (or in a batch) before doing the validation. The AI model learns after the validation is done and the learning applies when the next Document Task is created.

Best practice: Do NOT train with the same document multiple times. Forcing the training with a single or too few documents can skew (or overfit) the model and negatively impact accuracy.

Field Values

After the task is completed, the extracted values are stored in the Field Value table [sys_di_extracted_value], to access them, navigate to Document Intelligence > Document Data Extraction > Field Values and filter with the Document Task.

Extraction Modes

So far, we have reviewed the documents in Recommendation mode, which is the default mode when starting. After we have reviewed a few documents, we can enable Auto-fill mode, in which values are populated in the Document Intelligence workspace, making the review process faster. To decide whether a Field is auto-filled, the confidence level of that Field value is compared to the Auto-fill threshold.

To bypass manual validation altogether, you can enable Fully Automated mode.

How does it work?

Every value for every Field to extract is predicted with a certain confidence level.

For Auto-fill mode: if the Auto-fill threshold is met for a Field, the value is pre-populated in the Document Intelligence workspace to be reviewed by an agent.

If the confidence level is lower than the Warning threshold, a warning sign is displayed next to the Field in the Document Intelligence workspace.

For Fully Automated mode: if the Fully-automated threshold is met for all Required Fields, the values are automatically extracted and the document doesn't need to be validated by an agent.
Documents that don’t meet the threshold can still be routed to agents for review to further train the AI model.

Verify that all the Fields marked as Required are indeed required for the Fully Automated mode.

When a Document Task is extracted without agent review, the Status changes directly to Done, and the Is Straight Through Processed field is true.

You can keep track of the Document Tasks that are fully automated by navigating to Document Intelligence > Document Data Extraction > Document Tasks and filtering on Is Straight Through Processed is true.

How to change the extraction mode?

The Data Extraction mode can be changed from the Use Case.

The process owner stays in control by defining the confidence threshold that matches their use case. This value is comprised between 0% and 100%. For example, an Auto-fill threshold value of 80% means that when a value is predicted with at least an 80% confidence level, the value is auto-filled.

Best practice: Move incrementally, start with Recommendation then Auto-fill before enabling Fully Automated mode.

Best practice: Train each layout one by one. Label a few documents from the same layout before introducing a new layout.

What threshold value to choose?

Observing the current confidence levels is a good starting point for knowing your current state. From there you can decide if the current state is mature for automation and set the value of the threshold to the confidence level value, or if more documents need to be processed to further train the model.

Best practice: Choose Threshold values based on the confidence levels that you can reach in your manual review.

Choose the threshold value carefully, having threshold values too low might not be accurate and will influence the reviewer to think the model “is wrong”.

This decision should also consider business objectives, results/variability and benchmarking against previous methods.

Can I continue to improve the model?

Any Document Task that doesn't meet the threshold will need to be reviewed. This is helpful when a new document layout is introduced for example. Document Tasks that are manually reviewed are used to further improve the model.

Typically, Document Tasks processed with Fully Automated mode will not be used for training. Fully Automated mode will get triggered if all Required fields have a confident level. It is still possible to continue reviewing the non-required fields to further improve the model. If a user reviews a document that was automated and resubmits it, that document can be used for training (the field Agent Input = True).

Exporting a Use Case

You can export the Use Case to migrate it to another instance. Use the Add to update set button to quickly capture all the Fields, Flows and the related trained AI models.

This process, along with duplicating and deleting a Use Case is well documented in the docs: Manage document extraction use cases.

Congratulations, you have completed the Document Intelligence introduction!

Make sure to review the current limits, such as file formats, size limits, page count, languages, etc., on the documentation site.

As a next step, you can learn how to build an end-to-end process using Document Intelligence with Flow Designer.

References

Documentation

Configuring Document Intelligence

FAQ

Document Intelligence FAQ

DorianK · ‎08-11-2022

This guide is great for retrieving and extracting data!

Do you think you can provide the "next use case here" or expand this example to use a Flow with this data? I imagine a customer has trained their model and have the extracted values in the di_extracted_value table, but the next question becomes "then what"?

Do you envision an admin uploading these docs or an agent? Or maybe the attachments come in from email? How does the E2E flow look once you have a completed model?

Let's say in your example with Purchase Order -> Is it going to be using an internal table to get that Purchase Order or someone is manually uploading that Purchase Order (and not using ITAM/Enterprise Asset/PSM)? After that, let's say you wanted to trigger an approval when a specific value extracted is greater than 100 (is this a valid use case)? Or is this just used for reporting and collecting information for now?

Looking at the generated flow from "Integration Steps" there are target table / fields, key groups (handling tables), and source records that come into play.

DorianK · ‎08-11-2022

Note: https://youtu.be/2R-gq-_q53s is a great video and touches on the flow a bit but there are some concepts that seem to be glossed over such as an orchestration table (invoice task). In the video, it takes the values and stores it in flow and then places the values on the orchestration table (and then you create a flow on that orchestration table to do things with the data). I feel like a flow diagram would be useful for these pieces or better understand if this process will change in the future.

Loic1 · ‎08-29-2022

Maybe you'd find this new article helpful (with a flow diagram) https://community.servicenow.com/community?id=community_article&sys_id=448028f1dbad9150d0dc3feb68961...