Extract Data from PDF stored in sys_attachment table.

Community Alums · ‎04-18-2023

Is there a way to extract Data from PDF stored in sys_attachment table ?

Ankur Bawiskar · ‎04-18-2023

@Community Alums

Within ServiceNow platform there is nothing which allows you to read data from PDF file.

Regards,
Ankur
✨ Certified Technical Architect || ✨ 9x ServiceNow MVP || ✨ ServiceNow Community Leader

sahilnegi · ‎03-25-2025

its 2025 is there any way or tool to extract the data of pdf, or we can use any api to extract the data??

sahilnegi · ‎03-26-2025

Yes, it is possible to extract data from a PDF stored in the sys_attachment table. You can use the Google Gemini API for this purpose. Below are the steps to achieve this:

Get the Google Gemini API key (it's free).
Store the API key in the sys_properties table in your instance.
Run the following script in the background, replacing the sys_id with the ID of your PDF file:

var att = new GlideRecord('sys_attachment');
att.get('357118d18364ee10aaa4c8d6feaad3c6');

var rawBytes = new GlideSysAttachment().getBytes(att);
if (rawBytes) {
    var generateRequest = new sn_ws.RESTMessageV2();
    generateRequest.setEndpoint('https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=' + gs.getProperty('gemini_api_key'));
    generateRequest.setHttpMethod('POST');
    generateRequest.setRequestHeader('Content-Type', 'application/json');
    generateRequest.setRequestBody(JSON.stringify({
        "contents": [{
            "parts": [
                {
                    "inlineData": {
                        "mimeType": "application/pdf",
                        "data": new Packages.java.lang.String(Packages.org.apache.commons.codec.binary.Base64.encodeBase64(rawBytes)).toString()
                    }
                },
                {
                    "text": "Extract all text from this PDF and return it as plain text."
                }
            ]
        }]
    }));

    var generateResponse = generateRequest.execute();
    if (generateResponse.getStatusCode() == 200) {
        gs.info('Extracted Text: ' + JSON.parse(generateResponse.getBody()).candidates[0].content.parts[0].text);
    } else {
        gs.error('Failed: ' + generateResponse.getBody());
    }
} else {
    gs.error('No content retrieved');
}

This script will help you extract text from the PDF.

If you find this solution helpful, please accept it and mark it as useful.

Thanks,
Sahil Negi

AyanshN · ‎08-07-2025

Im not sure why this is return me a error

*** Script: Failed: {
  "error": {
    "code": 400,
    "message": "The document has no pages.",
    "status": "INVALID_ARGUMENT"
  }
}