Extracting Text from PDF Attachments using Apache PDFBox using ServiceNow Midserver
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Every ServiceNow implementation eventually faces the same challenge: structured data arrives as unstructured PDF attachments. Purchase orders, invoices, shipping manifests, compliance documents — they arrive via email as PDFs, and someone has to manually re-key that data into ServiceNow records.
External OCR services like Google Vision, AWS Textract, or Azure Form Recognizer solve this, but they introduce cost, latency, data residency concerns, and external dependencies. What if you could do it entirely within ServiceNow?
This article shows you how. Using Apache PDFBox deployed as a JAR on the MID Server, you can extract text from any PDF attachment — no external services, no API keys, no data leaving your environment.
What You Will Build
- A MID Server Script Include that extracts text from PDF files page by page
- A cross-scope bridge pattern for reading attachment data from scoped applications
- A solution for ServiceNow’s gzip-compressed, chunked attachment storage
- Flow Designer integration for automated processing
- A text parser that converts raw extracted text into structured table records
Prerequisites
- ServiceNow instance with admin access
- A MID Server with status Up (Windows or Linux)
- A scoped application (or willingness to create one)
- Basic familiarity with Script Includes and Background Scripts
Architecture Overview
The Pipeline
The end-to-end flow has six stages:
# | Stage | What Happens |
1 | Email Arrives | Inbound email with PDF attachment lands in sys_email |
2 | Attachment Read | Global Script Include reads base64 chunks from sys_attachment_doc, stores in scoped wrapper table |
3 | Probe Sent | JavascriptProbe sends chunks to MID Server via ECC Queue |
4 | MID Processing | MID Server decodes base64, decompresses gzip, loads PDF via PDFBox, extracts text per page |
5 | Response Return | JSON result returns via ECC Queue with page text, total pages, and status |
6 | Record Creation | Scoped app parses text into structured records and inserts into target table |
Why This Architecture?
Three design decisions drive this architecture:
MID Server for processing: PDFBox is a Java library. The MID Server runs a JVM and supports custom JARs. This is the only place in ServiceNow where you can run arbitrary Java code.
ECC Queue for transport: The ECC Queue is ServiceNow’s built-in mechanism for communicating with MID Servers. JavascriptProbe wraps this cleanly.
Scoped app with global bridge: Scoped applications cannot access sys_attachment_doc or JavascriptProbe directly. A thin global Script Include bridges this gap while keeping all business logic in the scoped app.
Step 1: Install Apache PDFBox on the MID Server
Download the JAR
You need the PDFBox “app” bundle, which packages all dependencies into a single JAR:
Download URL:
https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.31/pdfbox-app-2.0.31.jar
File size: ~12 MB
Place the JAR
Copy the file to your MID Server’s extlib directory:
Windows: C:\MIDServer\agent\extlib\pdfbox-app-2.0.31.jar
Linux: /opt/servicenow/mid/agent/extlib/pdfbox-app-2.0.31.jar
Register in ecc_agent_jar (Critical)
🔍 Discovery: The MID Server’s FileSync process automatically deletes any JAR from extlib that is not registered in the ecc_agent_jar table. Simply copying the JAR is not enough — it will be removed on the next sync cycle.
To register:
- Navigate to ecc_agent_jar.do in your instance
- Create a new record with Name: pdfbox-app-2.0.31.jar
- Attach the JAR file to the record
- Save
Restart and Verify
# Restart MID Server
net stop "ServiceNow MID Server"
net start "ServiceNow MID Server"
# Verify JAR survives restart (PowerShell)
Get-ChildItem "C:\MIDServer\agent\extlib\pdfbox*"
Validate Class Loading (Background Script)
var probe = new JavascriptProbe("YOUR_MID_SERVER");
probe.setName("PDFBoxValidation");
probe.setJavascript(
"try {" +
" Packages.org.apache.pdfbox.pdmodel.PDDocument;" +
" '[OK] PDFBox loaded';" +
"} catch(e) { '[FAIL] ' + e; }"
);
var eccId = probe.create();
gs.info("ECC ID: " + eccId);
// Check response after 15 seconds:
var ecc = new GlideRecord("ecc_queue");
ecc.addQuery("response_to", eccId);
ecc.addQuery("queue", "input");
ecc.query();
if (ecc.next()) {
var xmlDoc = new XMLDocument2();
xmlDoc.parseXML("" + ecc.payload);
gs.info(xmlDoc.getNodeText("//results/result/output"));
}
✅ Expected output: [OK] PDFBox loaded
Step 2: Understanding ServiceNow Attachment Storage
Before you can send a PDF to the MID Server, you need to understand how ServiceNow stores attachment data. This is where most implementations fail.
How Attachments Are Stored
Attachment binary data lives in sys_attachment_doc, not sys_attachment. The data is:
- Split into chunks (multiple rows per attachment, ordered by position)
- Each chunk is independently base64 encoded
- The combined data is gzip compressed before encoding
🔍 Discovery 1: Each chunk in sys_attachment_doc is independently base64 encoded. You cannot concatenate the base64 strings and decode them as one — you must decode each chunk separately, then concatenate the raw bytes.
🔍 Discovery 2: ServiceNow gzip-compresses attachment data before base64 encoding. The telltale sign is a base64 string starting with “H4sI” (the base64 encoding of the gzip magic bytes 0x1F 0x8B). You must decompress after decoding.
The Correct Reading Sequence
- Query sys_attachment_doc where sys_attachment = attachment sys_id, ordered by position
- Collect base64 strings from each chunk
- Decode each chunk independently to raw bytes
- Concatenate all raw byte arrays
- Check first two bytes for gzip signature (0x1F, 0x8B)
- If gzipped, decompress using GZIPInputStream
- Result: the original PDF binary
Why Not Use GlideSysAttachment.getBytes()?
You might expect to use GlideSysAttachment.getBytes() or getContentStream(), but in scoped applications these methods are blocked by security restrictions. The only reliable cross-scope approach is reading sys_attachment_doc directly from a global Script Include.
Step 3: Create the Global Bridge Script Include
Navigate to System Definition > Script Includes (in Global scope):
Name: GlobalAttachmentHelper
Accessible from: All application scopes
Script
var GlobalAttachmentHelper = Class.create();
GlobalAttachmentHelper.prototype = {
initialize: function() {},
/*
* Read attachment data as JSON array of base64 chunks
* This runs in global scope where sys_attachment_doc is accessible
*/
getAttachmentChunksJson: function(attachmentSysId) {
var doc = new GlideRecord("sys_attachment_doc");
doc.addQuery("sys_attachment", attachmentSysId);
doc.orderBy("position");
doc.query();
var chunks = [];
while (doc.next()) {
chunks.push("" + doc.data);
}
return JSON.stringify(chunks);
},
/*
* Submit a JavascriptProbe to the MID Server
* JavascriptProbe is package-private and cannot be called from scoped apps
*/
submitMIDProbe: function(midServerName, probeName, script, paramsJson) {
var probe = new JavascriptProbe(midServerName);
probe.setName(probeName);
probe.setJavascript(script);
if (paramsJson) {
var params = JSON.parse(paramsJson);
for (var key in params) {
probe.addParameter(key, params[key]);
}
}
return probe.create();
},
/*
* Retrieve probe result from ECC Queue
*/
getProbeResult: function(eccOutputSysId) {
var ecc = new GlideRecord("ecc_queue");
ecc.addQuery("response_to", eccOutputSysId);
ecc.addQuery("queue", "input");
ecc.query();
if (ecc.next()) {
var xmlDoc = new XMLDocument2();
xmlDoc.parseXML("" + ecc.payload);
return xmlDoc.getNodeText("//results/result/output");
}
return null;
},
type: "GlobalAttachmentHelper"
};
Why Three Methods?
Method | Why It Must Be Global |
getAttachmentChunksJson | sys_attachment_doc has cross-scope read restriction. Scoped apps get access denied. |
submitMIDProbe | JavascriptProbe is package-private in the ServiceNow JVM. Cannot be instantiated from any scoped app. |
getProbeResult | Convenience method. Could technically run from scoped app but keeps all ECC Queue logic centralized. |
Step 4: Create the MID Server Text Extraction Script
Navigate to MID Server > Script Includes > New:
Name: PDFTextExtractor
Active: Checked
Script
var PDFTextExtractor = Class.create();
PDFTextExtractor.prototype = {
initialize: function() {
this.PDDocument = Packages.org.apache.pdfbox.pdmodel.PDDocument;
},
/*
* Extract text from PDF using base64-encoded chunks
* Handles: independent chunk decoding, gzip decompression, per-page extraction
*/
extractText: function(chunksJson) {
var response = { status: "success", pages: [], fullText: "", totalPages: 0 };
var document = null;
try {
// Step 1: Decode each chunk independently and concatenate bytes
var chunks = JSON.parse(chunksJson);
var decoder = Packages.java.util.Base64.getDecoder();
var baos = new Packages.java.io.ByteArrayOutputStream();
for (var i = 0; i < chunks.length; i++) {
var bytes = decoder.decode(chunks[i]);
baos.write(bytes, 0, bytes.length);
}
var allBytes = baos.toByteArray();
// Step 2: Check for gzip compression and decompress
var isGzip = (allBytes.length > 2
&& (allBytes[0] & 0xFF) == 0x1F
&& (allBytes[1] & 0xFF) == 0x8B);
var pdfBytes;
if (isGzip) {
var gzis = new Packages.java.util.zip.GZIPInputStream(
new Packages.java.io.ByteArrayInputStream(allBytes));
var out = new Packages.java.io.ByteArrayOutputStream();
var buf = Packages.java.lang.reflect.Array.newInstance(
Packages.java.lang.Byte.TYPE, 4096);
var n;
while ((n = gzis.read(buf)) != -1) {
out.write(buf, 0, n);
}
gzis.close();
pdfBytes = out.toByteArray();
} else {
pdfBytes = allBytes;
}
// Step 3: Load PDF and extract text page by page
document = this.PDDocument.load(
new Packages.java.io.ByteArrayInputStream(pdfBytes));
var stripper = new Packages.org.apache.pdfbox.text.PDFTextStripper();
var pageCount = document.getNumberOfPages();
response.totalPages = pageCount;
var allText = [];
for (var p = 1; p <= pageCount; p++) {
stripper.setStartPage(p);
stripper.setEndPage(p);
var pageText = "" + stripper.getText(document);
response.pages.push({ page: p, text: pageText });
allText.push(pageText);
}
response.fullText = allText.join("\n");
} catch (e) {
response.status = "error";
response.error = "" + e.message;
} finally {
if (document != null) document.close();
}
return JSON.stringify(response);
},
type: "PDFTextExtractor"
};
How It Works
# | Operation | Detail |
1 | Parse chunks JSON | Receives the array of base64 strings from the probe parameter |
2 | Decode independently | Each chunk decoded separately via java.util.Base64.getDecoder(), bytes concatenated |
3 | Detect gzip | Checks first two bytes for 0x1F 0x8B magic number |
4 | Decompress | GZIPInputStream reads compressed bytes, outputs original PDF binary |
5 | Load PDF | PDDocument.load() from ByteArrayInputStream (no temp file needed for text-only) |
6 | Extract text | PDFTextStripper processes each page independently, capturing text content |
7 | Return JSON | Structured response with status, per-page text, full concatenated text, total pages |
Step 5: Create the Scoped Application Components
Wrapper Table
Create a table in your scoped app to store attachment data pre-read from sys_attachment_doc. This eliminates repeated cross-scope access:
Label | Column Name | Type | Purpose |
Email Sys ID | email_sys_id | String (32) | Source email |
File Name | file_name | String (255) | Original filename |
File Type | file_type | String (20) | pdf / image |
Status | status | String (20) | Processing state |
ECC Queue ID | ecc_queue_id | String (32) | Probe tracking |
Chunks JSON | chunks_json | String (5M) | Base64 data |
Scoped Script Include: DocumentProcessor
var DocumentProcessor = Class.create();
DocumentProcessor.prototype = {
initialize: function() {
this.MID_SERVER = gs.getProperty("your_app.mid_server_name", "MIDServer");
},
submitTextExtraction: function(wrapperSysId) {
var gr = new GlideRecord("your_app_wrapper_table");
if (!gr.get(wrapperSysId)) return null;
var chunksJson = "" + gr.chunks_json;
var helper = new global.GlobalAttachmentHelper();
var script = 'var ext = new PDFTextExtractor();' +
'var result = ext.extractText(probe.getParameter("chunks"));' +
'result;';
var params = JSON.stringify({ chunks: chunksJson });
var eccId = helper.submitMIDProbe(
this.MID_SERVER, "PDFTextExtractor", script, params);
gr.ecc_queue_id = eccId;
gr.status = "processing";
gr.update();
return eccId;
},
getResult: function(eccOutputSysId) {
var helper = new global.GlobalAttachmentHelper();
var output = helper.getProbeResult(eccOutputSysId);
if (output) {
try { return JSON.parse(output); }
catch (e) { return { status: "error", error: "" + e.message }; }
}
return null;
},
type: "DocumentProcessor"
};
⚠ Note the global. prefix when calling GlobalAttachmentHelper. Without it, scoped apps cannot find global Script Includes.
Step 6: Parse Extracted Text into Records
Raw extracted text is unstructured. You need a parser tailored to your document format. Here is an example for pipe-delimited purchase order documents:
Sample Extracted Text
PO: PO-6003 - Mayo Surgical
Mike Ross ID-9667
Surgical Gloves - Type C3 | 1000373 | 0373-298G | 44
Surgical Forceps - Type C1 | 1000030 | 0030-585W | 43
Sarah Chen ID-2379
Sterile Syringe - Type D4 | 1000441 | 0441-826M | 45
Parser Script
parseExtractedText: function(fullText) {
var records = [];
var lines = fullText.split(/\r?\n/);
var currentPerson = "", currentId = "";
var poNumber = "", orgName = "";
for (var i = 0; i < lines.length; i++) {
var line = lines[i].trim();
if (!line) continue;
// Parse PO header: "PO: PO-6003 - Mayo Surgical"
var poMatch = line.match(/^PO:\s*([\w-]+)\s*-\s*(.+)/);
if (poMatch) {
poNumber = poMatch[1].trim();
orgName = poMatch[2].trim();
continue;
}
// Parse person: "Mike Ross ID-9667"
var personMatch = line.match(/^(.+?)\s+ID-(\S+)/);
if (personMatch) {
currentPerson = personMatch[1].trim();
currentId = "ID-" + personMatch[2].trim();
continue;
}
// Parse line item: "Name | UPN | Batch | Qty"
var parts = line.split("|");
if (parts.length >= 3) {
records.push({
po_number: poNumber,
org_name: orgName,
person_name: currentPerson,
person_id: currentId,
item_name: (parts[0] || "").trim(),
upn: (parts[1] || "").trim(),
batch: (parts[2] || "").trim(),
quantity: (parts[3] || "").trim()
});
}
}
return records;
},
✅ This parser handles the specific format shown. Adapt the regex patterns and field mappings to match your document format.
Step 7: Automate with Flow Designer
Subflow Design
# | Step | Detail |
1 | Submit Extraction | Custom action: reads wrapper table, sends probe to MID Server |
2 | Wait | 60 seconds for MID Server processing |
3 | Poll for Result | Do-the-following-until loop: check ECC Queue for response, wait 15s if pending |
4 | Process Result | Parse text, create structured records, update wrapper status |
Trigger: Business Rule on sys_email
Why not use the Inbound Email trigger in Flow Designer?
🔍 Discovery: Other email processing flows (like AI Agent Email Analyzer) can issue stop-processing directives that block Flow Designer Inbound Email triggers. A Business Rule on sys_email runs independently of the email action pipeline.
(function executeRule(current, previous) {
if (gs.getProperty("your_app.master_switch") !== "active") return;
if (current.type != "received") return;
if (current.subject.toString().toLowerCase().indexOf("trigger_word") < 0) return;
var helper = new GlobalAttachmentHelper();
helper.populateEmailAttachments("" + current.sys_id);
sn_fd.FlowAPI.getRunner()
.subflow("your_scope.your_subflow_name")
.inForeground()
.withInputs({ "email_sys_id": "" + current.sys_id })
.run();
})(current, previous);
Step 8: Test the Complete Pipeline
8A: Quick Validation (Background Script)
// 1. Find a PDF attachment
var att = new GlideRecord("sys_attachment");
att.addQuery("content_type", "application/pdf");
att.orderByDesc("sys_created_on");
att.setLimit(1);
att.query();
if (att.next()) gs.info("Attachment: " + att.sys_id + " | " + att.file_name);
// 2. Read chunks
var helper = new global.GlobalAttachmentHelper();
var chunksJson = helper.getAttachmentChunksJson("ATTACHMENT_SYSID");
var chunks = JSON.parse(chunksJson);
gs.info("Chunks: " + chunks.length);
// 3. Send to MID Server
var script = 'var ext = new PDFTextExtractor();' +
'var result = ext.extractText(probe.getParameter("chunks"));' +
'result;';
var eccId = helper.submitMIDProbe("YOUR_MID", "TextTest",
script, JSON.stringify({ chunks: chunksJson }));
gs.info("ECC ID: " + eccId);
// 4. Check result (run after 30 seconds)
var output = helper.getProbeResult("ECC_ID_HERE");
gs.info("Result: " + output);
8B: Full Email Test
- Send an email with subject containing your trigger keyword
- Attach a PDF document
- Wait 60-90 seconds
- Check Flow Designer > Executions
- Check your target table for created records
Troubleshooting
Problem | Solution |
ClassNotFoundException for PDFBox | JAR not in extlib, or not registered in ecc_agent_jar. Register and restart MID. |
JAR disappears after MID restart | FileSync deleting unregistered JARs. Must create ecc_agent_jar record with file attached. |
IllegalArgumentException on Base64 decode | Chunks have been concatenated as strings before decoding. Decode each chunk independently. |
ZipException: not in GZIP format | Chunks decoded independently but bytes not concatenated before decompression. Ensure bytes are joined first. |
GlobalAttachmentHelper undefined | Missing global. prefix. Use: new global.GlobalAttachmentHelper() |
JavascriptProbe access denied | Package-private class. Must be called from global scope Script Include, not from scoped app. |
sys_attachment_doc access denied | Cross-scope restriction. Read from global Script Include, store in scoped wrapper table. |
ECC Queue no response | MID Server may be processing. Increase wait time. Check MID Server logs for errors. |
Empty text extracted | PDF may be scanned image (not text-based). PDFBox PDFTextStripper only extracts text, not OCR. Consider combining with barcode scanning for image-based PDFs. |
Flow not triggering on email | Another email action issues stop-processing. Use Business Rule on sys_email instead. |
Key Takeaways
- Platform-native PDF processing is possible: Apache PDFBox on the MID Server eliminates external OCR dependencies for text-based PDFs.
- Attachment storage has hidden complexity: Gzip compression and independent chunk encoding are undocumented behaviors that you must handle explicitly.
- Cross-scope bridging is essential: A thin global Script Include wrapping sys_attachment_doc reads and JavascriptProbe calls keeps the scoped app clean while bypassing platform restrictions.
- FileSync will delete your JARs: The ecc_agent_jar table is the only way to persist custom JARs across MID Server restarts.
- Business Rules beat email triggers: For reliable inbound email processing, use a Business Rule on sys_email rather than Flow Designer Inbound Email triggers.
