Extracting Text from PDF Attachments using Apache PDFBox using ServiceNow Midserver

karthik65 · ‎04-15-2026

Every ServiceNow implementation eventually faces the same challenge: structured data arrives as unstructured PDF attachments. Purchase orders, invoices, shipping manifests, compliance documents — they arrive via email as PDFs, and someone has to manually re-key that data into ServiceNow records.

External OCR services like Google Vision, AWS Textract, or Azure Form Recognizer solve this, but they introduce cost, latency, data residency concerns, and external dependencies. What if you could do it entirely within ServiceNow?

This article shows you how. Using Apache PDFBox deployed as a JAR on the MID Server, you can extract text from any PDF attachment — no external services, no API keys, no data leaving your environment.

What You Will Build

A MID Server Script Include that extracts text from PDF files page by page
A cross-scope bridge pattern for reading attachment data from scoped applications
A solution for ServiceNow’s gzip-compressed, chunked attachment storage
Flow Designer integration for automated processing
A text parser that converts raw extracted text into structured table records

Prerequisites

ServiceNow instance with admin access
A MID Server with status Up (Windows or Linux)
A scoped application (or willingness to create one)
Basic familiarity with Script Includes and Background Scripts

Architecture Overview

The Pipeline

The end-to-end flow has six stages:

#	Stage	What Happens
1	Email Arrives	Inbound email with PDF attachment lands in sys_email
2	Attachment Read	Global Script Include reads base64 chunks from sys_attachment_doc, stores in scoped wrapper table
3	Probe Sent	JavascriptProbe sends chunks to MID Server via ECC Queue
4	MID Processing	MID Server decodes base64, decompresses gzip, loads PDF via PDFBox, extracts text per page
5	Response Return	JSON result returns via ECC Queue with page text, total pages, and status
6	Record Creation	Scoped app parses text into structured records and inserts into target table

Why This Architecture?

Three design decisions drive this architecture:

MID Server for processing: PDFBox is a Java library. The MID Server runs a JVM and supports custom JARs. This is the only place in ServiceNow where you can run arbitrary Java code.

ECC Queue for transport: The ECC Queue is ServiceNow’s built-in mechanism for communicating with MID Servers. JavascriptProbe wraps this cleanly.

Scoped app with global bridge: Scoped applications cannot access sys_attachment_doc or JavascriptProbe directly. A thin global Script Include bridges this gap while keeping all business logic in the scoped app.

Step 1: Install Apache PDFBox on the MID Server

Download the JAR

You need the PDFBox “app” bundle, which packages all dependencies into a single JAR:

Download URL:

https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.31/pdfbox-app-2.0.31.jar

File size: ~12 MB

Place the JAR

Copy the file to your MID Server’s extlib directory:

Windows: C:\MIDServer\agent\extlib\pdfbox-app-2.0.31.jar

Linux: /opt/servicenow/mid/agent/extlib/pdfbox-app-2.0.31.jar

Register in ecc_agent_jar (Critical)

🔍 Discovery: The MID Server’s FileSync process automatically deletes any JAR from extlib that is not registered in the ecc_agent_jar table. Simply copying the JAR is not enough — it will be removed on the next sync cycle.

To register:

Navigate to ecc_agent_jar.do in your instance
Create a new record with Name: pdfbox-app-2.0.31.jar
Attach the JAR file to the record
Save

Restart and Verify

# Restart MID Server

net stop "ServiceNow MID Server"

net start "ServiceNow MID Server"

# Verify JAR survives restart (PowerShell)

Get-ChildItem "C:\MIDServer\agent\extlib\pdfbox*"

Validate Class Loading (Background Script)

var probe = new JavascriptProbe("YOUR_MID_SERVER");

probe.setName("PDFBoxValidation");

probe.setJavascript(

"try {" +

" Packages.org.apache.pdfbox.pdmodel.PDDocument;" +

" '[OK] PDFBox loaded';" +

"} catch(e) { '[FAIL] ' + e; }"

);

var eccId = probe.create();

gs.info("ECC ID: " + eccId);

// Check response after 15 seconds:

var ecc = new GlideRecord("ecc_queue");

ecc.addQuery("response_to", eccId);

ecc.addQuery("queue", "input");

ecc.query();

if (ecc.next()) {

var xmlDoc = new XMLDocument2();

xmlDoc.parseXML("" + ecc.payload);

gs.info(xmlDoc.getNodeText("//results/result/output"));

}

✅ Expected output: [OK] PDFBox loaded

Step 2: Understanding ServiceNow Attachment Storage

Before you can send a PDF to the MID Server, you need to understand how ServiceNow stores attachment data. This is where most implementations fail.

How Attachments Are Stored

Attachment binary data lives in sys_attachment_doc, not sys_attachment. The data is:

Split into chunks (multiple rows per attachment, ordered by position)
Each chunk is independently base64 encoded
The combined data is gzip compressed before encoding

🔍 Discovery 1: Each chunk in sys_attachment_doc is independently base64 encoded. You cannot concatenate the base64 strings and decode them as one — you must decode each chunk separately, then concatenate the raw bytes.

🔍 Discovery 2: ServiceNow gzip-compresses attachment data before base64 encoding. The telltale sign is a base64 string starting with “H4sI” (the base64 encoding of the gzip magic bytes 0x1F 0x8B). You must decompress after decoding.

The Correct Reading Sequence

Query sys_attachment_doc where sys_attachment = attachment sys_id, ordered by position
Collect base64 strings from each chunk
Decode each chunk independently to raw bytes
Concatenate all raw byte arrays
Check first two bytes for gzip signature (0x1F, 0x8B)
If gzipped, decompress using GZIPInputStream
Result: the original PDF binary

Why Not Use GlideSysAttachment.getBytes()?

You might expect to use GlideSysAttachment.getBytes() or getContentStream(), but in scoped applications these methods are blocked by security restrictions. The only reliable cross-scope approach is reading sys_attachment_doc directly from a global Script Include.

Step 3: Create the Global Bridge Script Include

Navigate to System Definition > Script Includes (in Global scope):

Name: GlobalAttachmentHelper

Accessible from: All application scopes

Script

var GlobalAttachmentHelper = Class.create();

GlobalAttachmentHelper.prototype = {

initialize: function() {},

/*

* Read attachment data as JSON array of base64 chunks

* This runs in global scope where sys_attachment_doc is accessible

*/

getAttachmentChunksJson: function(attachmentSysId) {

var doc = new GlideRecord("sys_attachment_doc");

doc.addQuery("sys_attachment", attachmentSysId);

doc.orderBy("position");

doc.query();

var chunks = [];

while (doc.next()) {

chunks.push("" + doc.data);

}

return JSON.stringify(chunks);

},

/*

* Submit a JavascriptProbe to the MID Server

* JavascriptProbe is package-private and cannot be called from scoped apps

*/

submitMIDProbe: function(midServerName, probeName, script, paramsJson) {

var probe = new JavascriptProbe(midServerName);

probe.setName(probeName);

probe.setJavascript(script);

if (paramsJson) {

var params = JSON.parse(paramsJson);

for (var key in params) {

probe.addParameter(key, params[key]);

}

return probe.create();

},

/*

* Retrieve probe result from ECC Queue

*/

getProbeResult: function(eccOutputSysId) {

var ecc = new GlideRecord("ecc_queue");

ecc.addQuery("response_to", eccOutputSysId);

ecc.addQuery("queue", "input");

ecc.query();

if (ecc.next()) {

var xmlDoc = new XMLDocument2();

xmlDoc.parseXML("" + ecc.payload);

return xmlDoc.getNodeText("//results/result/output");

}

return null;

},

type: "GlobalAttachmentHelper"

};

Why Three Methods?

Method	Why It Must Be Global
getAttachmentChunksJson	sys_attachment_doc has cross-scope read restriction. Scoped apps get access denied.
submitMIDProbe	JavascriptProbe is package-private in the ServiceNow JVM. Cannot be instantiated from any scoped app.
getProbeResult	Convenience method. Could technically run from scoped app but keeps all ECC Queue logic centralized.

Step 4: Create the MID Server Text Extraction Script

Navigate to MID Server > Script Includes > New:

Name: PDFTextExtractor

Active: Checked

Script

var PDFTextExtractor = Class.create();

PDFTextExtractor.prototype = {

initialize: function() {

this.PDDocument = Packages.org.apache.pdfbox.pdmodel.PDDocument;

},

/*

* Extract text from PDF using base64-encoded chunks

* Handles: independent chunk decoding, gzip decompression, per-page extraction

*/

extractText: function(chunksJson) {

var response = { status: "success", pages: [], fullText: "", totalPages: 0 };

var document = null;

try {

// Step 1: Decode each chunk independently and concatenate bytes

var chunks = JSON.parse(chunksJson);

var decoder = Packages.java.util.Base64.getDecoder();

var baos = new Packages.java.io.ByteArrayOutputStream();

for (var i = 0; i < chunks.length; i++) {

var bytes = decoder.decode(chunks[i]);

baos.write(bytes, 0, bytes.length);

}

var allBytes = baos.toByteArray();

// Step 2: Check for gzip compression and decompress

var isGzip = (allBytes.length > 2

&& (allBytes[0] & 0xFF) == 0x1F

&& (allBytes[1] & 0xFF) == 0x8B);

var pdfBytes;

if (isGzip) {

var gzis = new Packages.java.util.zip.GZIPInputStream(

new Packages.java.io.ByteArrayInputStream(allBytes));

var out = new Packages.java.io.ByteArrayOutputStream();

var buf = Packages.java.lang.reflect.Array.newInstance(

Packages.java.lang.Byte.TYPE, 4096);

var n;

while ((n = gzis.read(buf)) != -1) {

out.write(buf, 0, n);

}

gzis.close();

pdfBytes = out.toByteArray();

} else {

pdfBytes = allBytes;

}

// Step 3: Load PDF and extract text page by page

document = this.PDDocument.load(

new Packages.java.io.ByteArrayInputStream(pdfBytes));

var stripper = new Packages.org.apache.pdfbox.text.PDFTextStripper();

var pageCount = document.getNumberOfPages();

response.totalPages = pageCount;

var allText = [];

for (var p = 1; p <= pageCount; p++) {

stripper.setStartPage(p);

stripper.setEndPage(p);

var pageText = "" + stripper.getText(document);

response.pages.push({ page: p, text: pageText });

allText.push(pageText);

}

response.fullText = allText.join("\n");

} catch (e) {

response.status = "error";

response.error = "" + e.message;

} finally {

if (document != null) document.close();

}

return JSON.stringify(response);

},

type: "PDFTextExtractor"

};

How It Works

#	Operation	Detail
1	Parse chunks JSON	Receives the array of base64 strings from the probe parameter
2	Decode independently	Each chunk decoded separately via java.util.Base64.getDecoder(), bytes concatenated
3	Detect gzip	Checks first two bytes for 0x1F 0x8B magic number
4	Decompress	GZIPInputStream reads compressed bytes, outputs original PDF binary
5	Load PDF	PDDocument.load() from ByteArrayInputStream (no temp file needed for text-only)
6	Extract text	PDFTextStripper processes each page independently, capturing text content
7	Return JSON	Structured response with status, per-page text, full concatenated text, total pages

Step 5: Create the Scoped Application Components

Wrapper Table

Create a table in your scoped app to store attachment data pre-read from sys_attachment_doc. This eliminates repeated cross-scope access:

Label	Column Name	Type	Purpose
Email Sys ID	email_sys_id	String (32)	Source email
File Name	file_name	String (255)	Original filename
File Type	file_type	String (20)	pdf / image
Status	status	String (20)	Processing state
ECC Queue ID	ecc_queue_id	String (32)	Probe tracking
Chunks JSON	chunks_json	String (5M)	Base64 data

Scoped Script Include: DocumentProcessor

var DocumentProcessor = Class.create();

DocumentProcessor.prototype = {

initialize: function() {

this.MID_SERVER = gs.getProperty("your_app.mid_server_name", "MIDServer");

},

submitTextExtraction: function(wrapperSysId) {

var gr = new GlideRecord("your_app_wrapper_table");

if (!gr.get(wrapperSysId)) return null;

var chunksJson = "" + gr.chunks_json;

var helper = new global.GlobalAttachmentHelper();

var script = 'var ext = new PDFTextExtractor();' +

'var result = ext.extractText(probe.getParameter("chunks"));' +

'result;';

var params = JSON.stringify({ chunks: chunksJson });

var eccId = helper.submitMIDProbe(

this.MID_SERVER, "PDFTextExtractor", script, params);

gr.ecc_queue_id = eccId;

gr.status = "processing";

gr.update();

return eccId;

},

getResult: function(eccOutputSysId) {

var helper = new global.GlobalAttachmentHelper();

var output = helper.getProbeResult(eccOutputSysId);

if (output) {

try { return JSON.parse(output); }

catch (e) { return { status: "error", error: "" + e.message }; }

}

return null;

},

type: "DocumentProcessor"

};

⚠ Note the global. prefix when calling GlobalAttachmentHelper. Without it, scoped apps cannot find global Script Includes.

Step 6: Parse Extracted Text into Records

Raw extracted text is unstructured. You need a parser tailored to your document format. Here is an example for pipe-delimited purchase order documents:

Sample Extracted Text

PO: PO-6003 - Mayo Surgical

Mike Ross ID-9667

Surgical Gloves - Type C3 | 1000373 | 0373-298G | 44

Surgical Forceps - Type C1 | 1000030 | 0030-585W | 43

Sarah Chen ID-2379

Sterile Syringe - Type D4 | 1000441 | 0441-826M | 45

Parser Script

parseExtractedText: function(fullText) {

var records = [];

var lines = fullText.split(/\r?\n/);

var currentPerson = "", currentId = "";

var poNumber = "", orgName = "";

for (var i = 0; i < lines.length; i++) {

var line = lines[i].trim();

if (!line) continue;

// Parse PO header: "PO: PO-6003 - Mayo Surgical"

var poMatch = line.match(/^PO:\s*([\w-]+)\s*-\s*(.+)/);

if (poMatch) {

poNumber = poMatch[1].trim();

orgName = poMatch[2].trim();

continue;

}

// Parse person: "Mike Ross ID-9667"

var personMatch = line.match(/^(.+?)\s+ID-(\S+)/);

if (personMatch) {

currentPerson = personMatch[1].trim();

currentId = "ID-" + personMatch[2].trim();

continue;

}

// Parse line item: "Name | UPN | Batch | Qty"

var parts = line.split("|");

if (parts.length >= 3) {

records.push({

po_number: poNumber,

org_name: orgName,

person_name: currentPerson,

person_id: currentId,

item_name: (parts[0] || "").trim(),

upn: (parts[1] || "").trim(),

batch: (parts[2] || "").trim(),

quantity: (parts[3] || "").trim()

});

}

return records;

},

✅ This parser handles the specific format shown. Adapt the regex patterns and field mappings to match your document format.

Step 7: Automate with Flow Designer

Subflow Design

#	Step	Detail
1	Submit Extraction	Custom action: reads wrapper table, sends probe to MID Server
2	Wait	60 seconds for MID Server processing
3	Poll for Result	Do-the-following-until loop: check ECC Queue for response, wait 15s if pending
4	Process Result	Parse text, create structured records, update wrapper status

Trigger: Business Rule on sys_email

Why not use the Inbound Email trigger in Flow Designer?

🔍 Discovery: Other email processing flows (like AI Agent Email Analyzer) can issue stop-processing directives that block Flow Designer Inbound Email triggers. A Business Rule on sys_email runs independently of the email action pipeline.

(function executeRule(current, previous) {

if (gs.getProperty("your_app.master_switch") !== "active") return;

if (current.type != "received") return;

if (current.subject.toString().toLowerCase().indexOf("trigger_word") < 0) return;

var helper = new GlobalAttachmentHelper();

helper.populateEmailAttachments("" + current.sys_id);

sn_fd.FlowAPI.getRunner()

.subflow("your_scope.your_subflow_name")

.inForeground()

.withInputs({ "email_sys_id": "" + current.sys_id })

.run();

})(current, previous);

Step 8: Test the Complete Pipeline

8A: Quick Validation (Background Script)

// 1. Find a PDF attachment

var att = new GlideRecord("sys_attachment");

att.addQuery("content_type", "application/pdf");

att.orderByDesc("sys_created_on");

att.setLimit(1);

att.query();

if (att.next()) gs.info("Attachment: " + att.sys_id + " | " + att.file_name);

// 2. Read chunks

var helper = new global.GlobalAttachmentHelper();

var chunksJson = helper.getAttachmentChunksJson("ATTACHMENT_SYSID");

var chunks = JSON.parse(chunksJson);

gs.info("Chunks: " + chunks.length);

// 3. Send to MID Server

var script = 'var ext = new PDFTextExtractor();' +

'var result = ext.extractText(probe.getParameter("chunks"));' +

'result;';

var eccId = helper.submitMIDProbe("YOUR_MID", "TextTest",

script, JSON.stringify({ chunks: chunksJson }));

gs.info("ECC ID: " + eccId);

// 4. Check result (run after 30 seconds)

var output = helper.getProbeResult("ECC_ID_HERE");

gs.info("Result: " + output);

8B: Full Email Test

Send an email with subject containing your trigger keyword
Attach a PDF document
Wait 60-90 seconds
Check Flow Designer > Executions
Check your target table for created records

Troubleshooting

Problem	Solution
ClassNotFoundException for PDFBox	JAR not in extlib, or not registered in ecc_agent_jar. Register and restart MID.
JAR disappears after MID restart	FileSync deleting unregistered JARs. Must create ecc_agent_jar record with file attached.
IllegalArgumentException on Base64 decode	Chunks have been concatenated as strings before decoding. Decode each chunk independently.
ZipException: not in GZIP format	Chunks decoded independently but bytes not concatenated before decompression. Ensure bytes are joined first.
GlobalAttachmentHelper undefined	Missing global. prefix. Use: new global.GlobalAttachmentHelper()
JavascriptProbe access denied	Package-private class. Must be called from global scope Script Include, not from scoped app.
sys_attachment_doc access denied	Cross-scope restriction. Read from global Script Include, store in scoped wrapper table.
ECC Queue no response	MID Server may be processing. Increase wait time. Check MID Server logs for errors.
Empty text extracted	PDF may be scanned image (not text-based). PDFBox PDFTextStripper only extracts text, not OCR. Consider combining with barcode scanning for image-based PDFs.
Flow not triggering on email	Another email action issues stop-processing. Use Business Rule on sys_email instead.

Key Takeaways

Platform-native PDF processing is possible: Apache PDFBox on the MID Server eliminates external OCR dependencies for text-based PDFs.
Attachment storage has hidden complexity: Gzip compression and independent chunk encoding are undocumented behaviors that you must handle explicitly.
Cross-scope bridging is essential: A thin global Script Include wrapping sys_attachment_doc reads and JavascriptProbe calls keeps the scoped app clean while bypassing platform restrictions.
FileSync will delete your JARs: The ecc_agent_jar table is the only way to persist custom JARs across MID Server restarts.
Business Rules beat email triggers: For reliable inbound email processing, use a Business Rule on sys_email rather than Flow Designer Inbound Email triggers.

Rafael Batistot · ‎04-20-2026

Hi @karthik65

Very interesting, how would I do the opposite? Write something into an already created PDF?

If this response was helpful, please mark it as Helpful and, if applicable, as Correct, this helps other users find accurate and useful information more easily.