Extracting Text from PDF Attachments using Apache PDFBox using ServiceNow Midserver

karthik65
Tera Guru

Every ServiceNow implementation eventually faces the same challenge: structured data arrives as unstructured PDF attachments. Purchase orders, invoices, shipping manifests, compliance documents — they arrive via email as PDFs, and someone has to manually re-key that data into ServiceNow records.

External OCR services like Google Vision, AWS Textract, or Azure Form Recognizer solve this, but they introduce cost, latency, data residency concerns, and external dependencies. What if you could do it entirely within ServiceNow?

This article shows you how. Using Apache PDFBox deployed as a JAR on the MID Server, you can extract text from any PDF attachment — no external services, no API keys, no data leaving your environment.

What You Will Build

  • A MID Server Script Include that extracts text from PDF files page by page
  • A cross-scope bridge pattern for reading attachment data from scoped applications
  • A solution for ServiceNow’s gzip-compressed, chunked attachment storage
  • Flow Designer integration for automated processing
  • A text parser that converts raw extracted text into structured table records

Prerequisites

  • ServiceNow instance with admin access
  • A MID Server with status Up (Windows or Linux)
  • A scoped application (or willingness to create one)
  • Basic familiarity with Script Includes and Background Scripts

 

 

Architecture Overview

The Pipeline

The end-to-end flow has six stages:

#

Stage

What Happens

1

Email Arrives

Inbound email with PDF attachment lands in sys_email

2

Attachment Read

Global Script Include reads base64 chunks from sys_attachment_doc, stores in scoped wrapper table

3

Probe Sent

JavascriptProbe sends chunks to MID Server via ECC Queue

4

MID Processing

MID Server decodes base64, decompresses gzip, loads PDF via PDFBox, extracts text per page

5

Response Return

JSON result returns via ECC Queue with page text, total pages, and status

6

Record Creation

Scoped app parses text into structured records and inserts into target table


Why This Architecture?

Three design decisions drive this architecture:

MID Server for processing: PDFBox is a Java library. The MID Server runs a JVM and supports custom JARs. This is the only place in ServiceNow where you can run arbitrary Java code.

ECC Queue for transport: The ECC Queue is ServiceNow’s built-in mechanism for communicating with MID Servers. JavascriptProbe wraps this cleanly.

Scoped app with global bridge: Scoped applications cannot access sys_attachment_doc or JavascriptProbe directly. A thin global Script Include bridges this gap while keeping all business logic in the scoped app.

 

 

Step 1: Install Apache PDFBox on the MID Server

Download the JAR

You need the PDFBox “app” bundle, which packages all dependencies into a single JAR:

Download URL:

https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.31/pdfbox-app-2.0.31.jar

 

File size: ~12 MB

Place the JAR

Copy the file to your MID Server’s extlib directory:

Windows: C:\MIDServer\agent\extlib\pdfbox-app-2.0.31.jar

Linux:   /opt/servicenow/mid/agent/extlib/pdfbox-app-2.0.31.jar


Register in ecc_agent_jar (Critical)

🔍 Discovery: The MID Server’s FileSync process automatically deletes any JAR from extlib that is not registered in the ecc_agent_jar table. Simply copying the JAR is not enough — it will be removed on the next sync cycle.

To register:

  1. Navigate to ecc_agent_jar.do in your instance
  2. Create a new record with Name: pdfbox-app-2.0.31.jar
  3. Attach the JAR file to the record
  4. Save

Restart and Verify

# Restart MID Server

net stop "ServiceNow MID Server"

net start "ServiceNow MID Server"

 

# Verify JAR survives restart (PowerShell)

Get-ChildItem "C:\MIDServer\agent\extlib\pdfbox*"

Validate Class Loading (Background Script)

var probe = new JavascriptProbe("YOUR_MID_SERVER");

probe.setName("PDFBoxValidation");

probe.setJavascript(

    "try {" +

    "  Packages.org.apache.pdfbox.pdmodel.PDDocument;" +

    "  '[OK] PDFBox loaded';" +

    "} catch(e) { '[FAIL] ' + e; }"

);

var eccId = probe.create();

gs.info("ECC ID: " + eccId);

 

// Check response after 15 seconds:

var ecc = new GlideRecord("ecc_queue");

ecc.addQuery("response_to", eccId);

ecc.addQuery("queue", "input");

ecc.query();

if (ecc.next()) {

    var xmlDoc = new XMLDocument2();

    xmlDoc.parseXML("" + ecc.payload);

    gs.info(xmlDoc.getNodeText("//results/result/output"));

}

Expected output: [OK] PDFBox loaded

 

 

Step 2: Understanding ServiceNow Attachment Storage

Before you can send a PDF to the MID Server, you need to understand how ServiceNow stores attachment data. This is where most implementations fail.

How Attachments Are Stored

Attachment binary data lives in sys_attachment_doc, not sys_attachment. The data is:

  1. Split into chunks (multiple rows per attachment, ordered by position)
  2. Each chunk is independently base64 encoded
  3. The combined data is gzip compressed before encoding

🔍 Discovery 1: Each chunk in sys_attachment_doc is independently base64 encoded. You cannot concatenate the base64 strings and decode them as one — you must decode each chunk separately, then concatenate the raw bytes.

🔍 Discovery 2: ServiceNow gzip-compresses attachment data before base64 encoding. The telltale sign is a base64 string starting with “H4sI” (the base64 encoding of the gzip magic bytes 0x1F 0x8B). You must decompress after decoding.

The Correct Reading Sequence

  1. Query sys_attachment_doc where sys_attachment = attachment sys_id, ordered by position
  2. Collect base64 strings from each chunk
  3. Decode each chunk independently to raw bytes
  4. Concatenate all raw byte arrays
  5. Check first two bytes for gzip signature (0x1F, 0x8B)
  6. If gzipped, decompress using GZIPInputStream
  7. Result: the original PDF binary

Why Not Use GlideSysAttachment.getBytes()?

You might expect to use GlideSysAttachment.getBytes() or getContentStream(), but in scoped applications these methods are blocked by security restrictions. The only reliable cross-scope approach is reading sys_attachment_doc directly from a global Script Include.

 

 

Step 3: Create the Global Bridge Script Include

Navigate to System Definition > Script Includes (in Global scope):

Name: GlobalAttachmentHelper

Accessible from: All application scopes

Script

var GlobalAttachmentHelper = Class.create();

GlobalAttachmentHelper.prototype = {

 

    initialize: function() {},

 

    /*

     * Read attachment data as JSON array of base64 chunks

     * This runs in global scope where sys_attachment_doc is accessible

     */

    getAttachmentChunksJson: function(attachmentSysId) {

        var doc = new GlideRecord("sys_attachment_doc");

        doc.addQuery("sys_attachment", attachmentSysId);

        doc.orderBy("position");

        doc.query();

        var chunks = [];

        while (doc.next()) {

            chunks.push("" + doc.data);

        }

        return JSON.stringify(chunks);

    },

 

    /*

     * Submit a JavascriptProbe to the MID Server

     * JavascriptProbe is package-private and cannot be called from scoped apps

     */

    submitMIDProbe: function(midServerName, probeName, script, paramsJson) {

        var probe = new JavascriptProbe(midServerName);

        probe.setName(probeName);

        probe.setJavascript(script);

        if (paramsJson) {

            var params = JSON.parse(paramsJson);

            for (var key in params) {

                probe.addParameter(key, params[key]);

            }

        }

        return probe.create();

    },

 

    /*

     * Retrieve probe result from ECC Queue

     */

    getProbeResult: function(eccOutputSysId) {

        var ecc = new GlideRecord("ecc_queue");

        ecc.addQuery("response_to", eccOutputSysId);

        ecc.addQuery("queue", "input");

        ecc.query();

        if (ecc.next()) {

            var xmlDoc = new XMLDocument2();

            xmlDoc.parseXML("" + ecc.payload);

            return xmlDoc.getNodeText("//results/result/output");

        }

        return null;

    },

 

    type: "GlobalAttachmentHelper"

};

Why Three Methods?

Method

Why It Must Be Global

getAttachmentChunksJson

sys_attachment_doc has cross-scope read restriction. Scoped apps get access denied.

submitMIDProbe

JavascriptProbe is package-private in the ServiceNow JVM. Cannot be instantiated from any scoped app.

getProbeResult

Convenience method. Could technically run from scoped app but keeps all ECC Queue logic centralized.

 

 

Step 4: Create the MID Server Text Extraction Script

Navigate to MID Server > Script Includes > New:

Name: PDFTextExtractor

Active: Checked

Script

var PDFTextExtractor = Class.create();

PDFTextExtractor.prototype = {

 

    initialize: function() {

        this.PDDocument = Packages.org.apache.pdfbox.pdmodel.PDDocument;

    },

 

    /*

     * Extract text from PDF using base64-encoded chunks

     * Handles: independent chunk decoding, gzip decompression, per-page extraction

     */

    extractText: function(chunksJson) {

        var response = { status: "success", pages: [], fullText: "", totalPages: 0 };

        var document = null;

        try {

            // Step 1: Decode each chunk independently and concatenate bytes

            var chunks = JSON.parse(chunksJson);

            var decoder = Packages.java.util.Base64.getDecoder();

            var baos = new Packages.java.io.ByteArrayOutputStream();

            for (var i = 0; i < chunks.length; i++) {

                var bytes = decoder.decode(chunks[i]);

                baos.write(bytes, 0, bytes.length);

            }

            var allBytes = baos.toByteArray();

 

            // Step 2: Check for gzip compression and decompress

            var isGzip = (allBytes.length > 2

                && (allBytes[0] & 0xFF) == 0x1F

                && (allBytes[1] & 0xFF) == 0x8B);

            var pdfBytes;

            if (isGzip) {

                var gzis = new Packages.java.util.zip.GZIPInputStream(

                    new Packages.java.io.ByteArrayInputStream(allBytes));

                var out = new Packages.java.io.ByteArrayOutputStream();

                var buf = Packages.java.lang.reflect.Array.newInstance(

                    Packages.java.lang.Byte.TYPE, 4096);

                var n;

                while ((n = gzis.read(buf)) != -1) {

                    out.write(buf, 0, n);

                }

                gzis.close();

                pdfBytes = out.toByteArray();

            } else {

                pdfBytes = allBytes;

            }

 

            // Step 3: Load PDF and extract text page by page

            document = this.PDDocument.load(

                new Packages.java.io.ByteArrayInputStream(pdfBytes));

            var stripper = new Packages.org.apache.pdfbox.text.PDFTextStripper();

            var pageCount = document.getNumberOfPages();

            response.totalPages = pageCount;

 

            var allText = [];

            for (var p = 1; p <= pageCount; p++) {

                stripper.setStartPage(p);

                stripper.setEndPage(p);

                var pageText = "" + stripper.getText(document);

                response.pages.push({ page: p, text: pageText });

                allText.push(pageText);

            }

            response.fullText = allText.join("\n");

 

        } catch (e) {

            response.status = "error";

            response.error = "" + e.message;

        } finally {

            if (document != null) document.close();

        }

        return JSON.stringify(response);

    },

 

    type: "PDFTextExtractor"

};

How It Works

#

Operation

Detail

1

Parse chunks JSON

Receives the array of base64 strings from the probe parameter

2

Decode independently

Each chunk decoded separately via java.util.Base64.getDecoder(), bytes concatenated

3

Detect gzip

Checks first two bytes for 0x1F 0x8B magic number

4

Decompress

GZIPInputStream reads compressed bytes, outputs original PDF binary

5

Load PDF

PDDocument.load() from ByteArrayInputStream (no temp file needed for text-only)

6

Extract text

PDFTextStripper processes each page independently, capturing text content

7

Return JSON

Structured response with status, per-page text, full concatenated text, total pages

 

 

Step 5: Create the Scoped Application Components

Wrapper Table

Create a table in your scoped app to store attachment data pre-read from sys_attachment_doc. This eliminates repeated cross-scope access:

Label

Column Name

Type

Purpose

Email Sys ID

email_sys_id

String (32)

Source email

File Name

file_name

String (255)

Original filename

File Type

file_type

String (20)

pdf / image

Status

status

String (20)

Processing state

ECC Queue ID

ecc_queue_id

String (32)

Probe tracking

Chunks JSON

chunks_json

String (5M)

Base64 data

Scoped Script Include: DocumentProcessor

var DocumentProcessor = Class.create();

DocumentProcessor.prototype = {

 

    initialize: function() {

        this.MID_SERVER = gs.getProperty("your_app.mid_server_name", "MIDServer");

    },

 

    submitTextExtraction: function(wrapperSysId) {

        var gr = new GlideRecord("your_app_wrapper_table");

        if (!gr.get(wrapperSysId)) return null;

 

        var chunksJson = "" + gr.chunks_json;

        var helper = new global.GlobalAttachmentHelper();

 

        var script = 'var ext = new PDFTextExtractor();' +

            'var result = ext.extractText(probe.getParameter("chunks"));' +

            'result;';

 

        var params = JSON.stringify({ chunks: chunksJson });

        var eccId = helper.submitMIDProbe(

            this.MID_SERVER, "PDFTextExtractor", script, params);

 

        gr.ecc_queue_id = eccId;

        gr.status = "processing";

        gr.update();

        return eccId;

    },

 

    getResult: function(eccOutputSysId) {

        var helper = new global.GlobalAttachmentHelper();

        var output = helper.getProbeResult(eccOutputSysId);

        if (output) {

            try { return JSON.parse(output); }

            catch (e) { return { status: "error", error: "" + e.message }; }

        }

        return null;

    },

 

    type: "DocumentProcessor"

};

⚠ Note the global. prefix when calling GlobalAttachmentHelper. Without it, scoped apps cannot find global Script Includes.

 

 

Step 6: Parse Extracted Text into Records

Raw extracted text is unstructured. You need a parser tailored to your document format. Here is an example for pipe-delimited purchase order documents:

Sample Extracted Text

PO: PO-6003 - Mayo Surgical

Mike Ross ID-9667

Surgical Gloves - Type C3 | 1000373 | 0373-298G | 44

Surgical Forceps - Type C1 | 1000030 | 0030-585W | 43

Sarah Chen ID-2379

Sterile Syringe - Type D4 | 1000441 | 0441-826M | 45

Parser Script

parseExtractedText: function(fullText) {

    var records = [];

    var lines = fullText.split(/\r?\n/);

    var currentPerson = "", currentId = "";

    var poNumber = "", orgName = "";

 

    for (var i = 0; i < lines.length; i++) {

        var line = lines[i].trim();

        if (!line) continue;

 

        // Parse PO header: "PO: PO-6003 - Mayo Surgical"

        var poMatch = line.match(/^PO:\s*([\w-]+)\s*-\s*(.+)/);

        if (poMatch) {

            poNumber = poMatch[1].trim();

            orgName = poMatch[2].trim();

            continue;

        }

 

        // Parse person: "Mike Ross ID-9667"

        var personMatch = line.match(/^(.+?)\s+ID-(\S+)/);

        if (personMatch) {

            currentPerson = personMatch[1].trim();

            currentId = "ID-" + personMatch[2].trim();

            continue;

        }

 

        // Parse line item: "Name | UPN | Batch | Qty"

        var parts = line.split("|");

        if (parts.length >= 3) {

            records.push({

                po_number: poNumber,

                org_name: orgName,

                person_name: currentPerson,

                person_id: currentId,

                item_name: (parts[0] || "").trim(),

                upn: (parts[1] || "").trim(),

                batch: (parts[2] || "").trim(),

                quantity: (parts[3] || "").trim()

            });

        }

    }

    return records;

},

This parser handles the specific format shown. Adapt the regex patterns and field mappings to match your document format.

 

 

Step 7: Automate with Flow Designer

Subflow Design

#

Step

Detail

1

Submit Extraction

Custom action: reads wrapper table, sends probe to MID Server

2

Wait

60 seconds for MID Server processing

3

Poll for Result

Do-the-following-until loop: check ECC Queue for response, wait 15s if pending

4

Process Result

Parse text, create structured records, update wrapper status

Trigger: Business Rule on sys_email

Why not use the Inbound Email trigger in Flow Designer?

🔍 Discovery: Other email processing flows (like AI Agent Email Analyzer) can issue stop-processing directives that block Flow Designer Inbound Email triggers. A Business Rule on sys_email runs independently of the email action pipeline.

(function executeRule(current, previous) {

    if (gs.getProperty("your_app.master_switch") !== "active") return;

    if (current.type != "received") return;

    if (current.subject.toString().toLowerCase().indexOf("trigger_word") < 0) return;

 

    var helper = new GlobalAttachmentHelper();

    helper.populateEmailAttachments("" + current.sys_id);

 

    sn_fd.FlowAPI.getRunner()

        .subflow("your_scope.your_subflow_name")

        .inForeground()

        .withInputs({ "email_sys_id": "" + current.sys_id })

        .run();

})(current, previous);

 

 

Step 8: Test the Complete Pipeline

8A: Quick Validation (Background Script)

// 1. Find a PDF attachment

var att = new GlideRecord("sys_attachment");

att.addQuery("content_type", "application/pdf");

att.orderByDesc("sys_created_on");

att.setLimit(1);

att.query();

if (att.next()) gs.info("Attachment: " + att.sys_id + " | " + att.file_name);

 

// 2. Read chunks

var helper = new global.GlobalAttachmentHelper();

var chunksJson = helper.getAttachmentChunksJson("ATTACHMENT_SYSID");

var chunks = JSON.parse(chunksJson);

gs.info("Chunks: " + chunks.length);

 

// 3. Send to MID Server

var script = 'var ext = new PDFTextExtractor();' +

    'var result = ext.extractText(probe.getParameter("chunks"));' +

    'result;';

var eccId = helper.submitMIDProbe("YOUR_MID", "TextTest",

    script, JSON.stringify({ chunks: chunksJson }));

gs.info("ECC ID: " + eccId);

 

// 4. Check result (run after 30 seconds)

var output = helper.getProbeResult("ECC_ID_HERE");

gs.info("Result: " + output);

8B: Full Email Test

  1. Send an email with subject containing your trigger keyword
  2. Attach a PDF document
  3. Wait 60-90 seconds
  4. Check Flow Designer > Executions
  5. Check your target table for created records

 

 

Troubleshooting

Problem

Solution

ClassNotFoundException for PDFBox

JAR not in extlib, or not registered in ecc_agent_jar. Register and restart MID.

JAR disappears after MID restart

FileSync deleting unregistered JARs. Must create ecc_agent_jar record with file attached.

IllegalArgumentException on Base64 decode

Chunks have been concatenated as strings before decoding. Decode each chunk independently.

ZipException: not in GZIP format

Chunks decoded independently but bytes not concatenated before decompression. Ensure bytes are joined first.

GlobalAttachmentHelper undefined

Missing global. prefix. Use: new global.GlobalAttachmentHelper()

JavascriptProbe access denied

Package-private class. Must be called from global scope Script Include, not from scoped app.

sys_attachment_doc access denied

Cross-scope restriction. Read from global Script Include, store in scoped wrapper table.

ECC Queue no response

MID Server may be processing. Increase wait time. Check MID Server logs for errors.

Empty text extracted

PDF may be scanned image (not text-based). PDFBox PDFTextStripper only extracts text, not OCR. Consider combining with barcode scanning for image-based PDFs.

Flow not triggering on email

Another email action issues stop-processing. Use Business Rule on sys_email instead.

 

 

Key Takeaways

  1. Platform-native PDF processing is possible: Apache PDFBox on the MID Server eliminates external OCR dependencies for text-based PDFs.
  2. Attachment storage has hidden complexity: Gzip compression and independent chunk encoding are undocumented behaviors that you must handle explicitly.
  3. Cross-scope bridging is essential: A thin global Script Include wrapping sys_attachment_doc reads and JavascriptProbe calls keeps the scoped app clean while bypassing platform restrictions.
  4. FileSync will delete your JARs: The ecc_agent_jar table is the only way to persist custom JARs across MID Server restarts.
  5. Business Rules beat email triggers: For reliable inbound email processing, use a Business Rule on sys_email rather than Flow Designer Inbound Email triggers.
0 REPLIES 0