Finding and deleting orphaned attachments

Janel · ‎07-19-2023

Looks like I can't create articles anymore so I guess I'll post it as a question.

2025-01-29 Edit: As of Washington, there is now an out-of-box way to delete orphaned attachments...

...which ServiceNow is keeping really quiet about. Maybe this article will resurface one day.

https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB1706592

In the mean time, Mark Roethof was able to cover some of the out-of-box job here: https://www.servicenow.com/community/now-platform-articles/purge-orphan-attachments-custom-solution-...

Recently, we got to go through a lovely exercise of drastically reducing our storage footprint. I'd like to share how we are able to identify orphaned records on sys_attachment and delete them.

The biggest issue with orphans on this (or any) table is that it could happen at any time to any attachment record. A record from 2018 could be orphaned in 2024 and there is no way to know about via the sys_attachment record. There is no value change to the record, it still has a table_name and table_sys_id populated, but that table_sys_id is a deleted record. Not a chance you could find those in a list view.

So how do you know what is legit and what should be deleted?

Check every record.

First, we wanted to know where the orphans were coming from and fix those processes where we could. What tables are we commonly finding orphans on?

Since sys_attachment is so big, I manually ran this script in 6-month intervals and at the end, I had it updating a record so I had somewhere to hold on to the data.

Unfortunately, because an attachment could be orphaned at any time, I had to keep repeating this from the oldest record to the new one to get a few days worth of data.

//Capture how many attachments we have that have a table_name and table_sys_id populated, but not valid records.
var start_date = '2023-06-27';
var end_date = '2023-07-06';

var valid_attachments = new GlideRecord('sys_attachment');
valid_attachments.addEncodedQuery("table_nameNOT LIKEZZ_YY^table_nameNOT LIKEinvisible^table_name!=placeholder^sys_created_onBETWEENjavascript&colon;gs.dateGenerate('" + start_date + "','00:00:00')@javascript&colon;gs.dateGenerate('" + end_date + "','23:59:59')");
valid_attachments.orderBy('sys_created_on');
valid_attachments.query();

var skipped = 0;
var deleted = 0;
var orphaned_tables = []; //Capture number of times a table appears
var orphaned_ids = ''; //Only use for testing
while (valid_attachments.next()) {
    if (!gs.tableExists(valid_attachments.table_name)) { //If the table_name is invalid, so nothing
        //orphaned_tables.push('!table-' +valid_attachments.table_name); //Or capture it!
        continue;
    }
    var record_check = new GlideRecord(valid_attachments.table_name);
    if (record_check.get(valid_attachments.table_sys_id) && record_check.isValidRecord()) {
        skipped++;
        continue;
    } else {
        deleted++;
        orphaned_tables.push(valid_attachments.getValue('table_name'));
        //orphaned_ids += valid_attachments.sys_id + ','; //Uncomment for a SMALL amount if you want to look at some specific test records.
        //valid_attachments.deleteRecord();  //We chose to delete the records as we went.
    }
}

orphaned_tables.sort();
//We want to know how many times the same tables appear in our orphaned_table list.
var table = null; //The table we want to count on.
var table_count = 0;
var final_list = '';
for (var i = 0; i < orphaned_tables.length; i++) {
    if (orphaned_tables[i] != table) { //If the current element is not equal to the currently set table we're checking on, that means we moved on to the next table and we capture the final count now.
        if (table_count > 0) {
            final_list += table + ': ' + table_count + '\n';
        }
        table = orphaned_tables[i]; //Reset this to the next table to start counting on.
        table_count = 1; //And go back to one
    } else {
        table_count++; //If the current element is the same table, add to the counter.
    }
}
if (table_count > 0) {
    final_list += table + ': ' + table_count + '\n';
}

var text = '\n' + gs.nowDateTime() + " Valid attachment check: " + start_date + " to " + end_date + "\nValid: " + skipped + '\nDeleted: ' + deleted + '\n\n' + final_list + '-------- -------- -------- -------- \n'; // +orphaned_ids;

var updateStory = new GlideRecord('rm_story'); //Instead of trying to log all this, update the Description in STRY0057314
updateStory.get('544229f7c3a2a9d4caa7b1deb00131db');
updateStory.description += text;
updateStory.update();

The above helped us learn where we were seeing orphans come from and we were able to start addressing them. (mostly from archiving and ml_model_artifact).

Now that we figured out and fixed where some of them were coming from, we knew this was just going to be a thing going forward and needed to address them.

Next, I made a scheduled job that crawls the sys_attachment_table in increments of 200,000 records every hour and verifies if the record the attachment is linked to is valid or not. If it isn't, then delete it. This is very similar to above, just running in automated chunks (now that I think about it, I could have used this same setup below to automate the above with dates instead of manually adjusting the dates).

Adjust the values for what works for you and as always, test in a sub-prod first!

You will need 2 system properties to keep track of which chunk of records to look at (yes, this will shift slightly as records are added or deleted through normal means).

Both properties should be integers.

Property 1 (called "orphaned.attachment.cleanup.window1") should be set to zero

Property 2 (called "orphaned.attachment.cleanup.window2") should be set to the first chunk of records you want to delete (I started ours at 200,000).

Now you'll need the scheduled job.

Name: Delete orphaned attachments.

Run: Periodically

Repeat Interval 1 hour

//We get many orphaned attachments on sys_attachment where table_sys_id left with a dead sys_id.  Find these records and attempt to look them up.  If there is not a valid record found, delete the attachment.
//We know archiving leaves behind attachments and so does ml_model_artifact (there is a PRB for that).

//When an attachment gets orphaned there is no update to the attachment record so we can't do something simple like, look for updates made yesterday and only check those.  An attachment from 2018 can get orphaned in 2024 and we'd never know about it unless we look at all the attachments.  So we do that hourly and in chunks using chooseWindow to do paging.  This is better than setLimit because setLimit will just look at the same first chunk of records over and over.  We need to paginate.

var attach_total = new GlideAggregate('sys_attachment');
attach_total.addEncodedQuery("table_nameNOT LIKEZZ_YY^table_nameNOT LIKEinvisible^table_name!=placeholder");
attach_total.addAggregate('COUNT');
attach_total.query();
var total = 0;
attach_total.next();
total = attach_total.getAggregate('COUNT'); //We need to get the total so we know when to reset our windows sizes.


var chunky = 200000; //This is the number of records were will do at a time.
var window1 = Number(gs.getProperty('orphaned.attachment.cleanup.window1')); //This will be where we left off from last run
var window2 = Number(gs.getProperty('orphaned.attachment.cleanup.window2')); //This will be the next number of pages set by the last run.


var valid_attachments = new GlideRecord('sys_attachment');
valid_attachments.addEncodedQuery("table_nameNOT LIKEZZ_YY^table_nameNOT LIKEinvisible^table_name!=placeholder"); //ZZ_YY tables are fields that contain images (like catalog item list images), invisible is for OOB things, and placeholder has a ton of records that we have a table cleaner for already.
valid_attachments.chooseWindow(window1, window2);
valid_attachments.orderBy('sys_created_on');
valid_attachments.query();

while (valid_attachments.next()) {
    if (!gs.tableExists(valid_attachments.table_name)) { //If the table doesn't exist, abort.  This is up to you how you want to handle these.
        continue;
    }
    var record_check = new GlideRecord(valid_attachments.table_name);
    if (record_check.get(valid_attachments.table_sys_id) && record_check.isValidRecord()) { //If the record is valid...
        continue; //Do nothing and continue to the next one.
    } else {
        valid_attachments.deleteRecord();  //If the record is not valid, then delete it.
    }
} //End of while

//When window2 exceeds the total number of records, we reset the system properties.

//We get the properties via GlideRecord so we can use autoSysFields(false) and not flood the versions list with a zillion things since it runs hourly.
//If you don't care that the versions list will be huge, you can instead delete this and just do gs.setProperty('property.name', value);
var prop1 = new GlideRecord('sys_properties');
prop1.get('name', 'orphaned.attachment.cleanup.window1');
var prop2 = new GlideRecord('sys_properties');
prop2.get('name', 'orphaned.attachment.cleanup.window2');

if (total - window2 < 0) { //Once our total number of records minus our lagest pagnation number is less than zero, we restart everything.
    prop1.value = 0;
    prop2.value = chunky;
} else { //If we're not restart, then set our proeprties to the updated windows.
    prop1.value = window2.toFixed(0);
    prop2.value = (window2 + chunky).toFixed(0);
}
//If you don't care about the versions list, you can just use setProperty above instead.
prop1.autoSysFields(false); //Don't track as an update
prop1.update();
prop2.autoSysFields(false); //Don't track as an update
prop2.update();

And this has been working pretty well for us. 200,000 records is quick to get through and we haven't seen any sort of performance impact.

I hope this helps you identify and manage orphaned attachments!

Janel · ‎07-26-2023

Something more like this.

//Looking on sys_attachment happens above here.

var checkOnThese = {};
while (valid_attachments.next()) {
    if (!gs.tableExists(valid_attachments.table_name)) {
        continue;
    }
    //gs.print("table: " +valid_attachments.table_name);
    if(!checkOnThese[valid_attachments.table_name]){
        checkOnThese[valid_attachments.table_name] = {
            'table': valid_attachments.table_name+'',
            'ids': []
        };
    }
        checkOnThese[valid_attachments.table_name].ids.push(valid_attachments.table_sys_id+'');
} //End of while

var arrayUtil = new ArrayUtil();
for(var k in checkOnThese){
    checkOnThese[k].ids = arrayUtil.unique(checkOnThese[k].ids); //A record could have multiple attachments so we sort these down to only unique values.
    gs.print("Before: " +k + ": " +checkOnThese[k].ids.length);
    var findValids = new GlideRecord(checkOnThese[k].table);
    findValids.addQuery('sys_id', "IN", checkOnThese[k].ids);
    findValids.query();
    while(findValids.next()){
        var index = checkOnThese[k].ids.toString().split(',').indexOf(findValids.sys_id+'');
        checkOnThese[k].ids.splice(index, 1);
    }

if(checkOnThese[k].ids.length > 0){
    var deleteBaddies = new GlideRecord(checkOnThese[k].table);
    deleteBaddies.addQuery('table_sys_id', 'IN', checkOnThese[k].ids);
    //deleteBaddies.deleteMultiple();
}
}

A single record could have multiple attachments on it, resulting it being pushed into the array twice. I tried using ArrayUtils to sort that out before pushing, but using .contains or indexOf, it is always returning false/-1. So instead I replaced the array using .unique before looking anything up.

Janel · ‎07-27-2023

I tried both in a background script and compared in production (larger sample size), without deleting any records.

Against 200,000 records,
- The multiple queries way was taking over 17 minutes (I was cancelling the transactions at 15 minutes because it was taking too long to run).
- The new GlideRecord method was taking about 12 minutes (as a scheduled job it only takes anywhere from 2 to 8 minutes).
Against 100,000 records,
- The multiple queries way was taking around 13 minutes.
- The new GlideRecord method was taking about 6 minutes.
Against 75,000 records,
- The multiple queries way was taking around 7 minutes 30 seconds.
- The new GlideRecord method was taking about 4 minutes 30 seconds.
Against 50,000 records,
- The multiple queries way was taking around 3 minutes.
- The new GlideRecord method was taking about 3 minutes.
Against 25,000 records,
- The multiple queries way was taking around 48 seconds.
- The new GlideRecord method was taking about 1 minute 30 seconds.

Why 200k records? For us, this allows us to get through our entire attachment table in about a week when we run it hourly (record total / 168 hours).

Based on the above, you can see where the differences between the methods happen and where doing the multiple querie s way starts to take a performance hit. I believe the arrays and query sizes are just too big after 50k records.

To sum it up, if you plan on looking at a small batch of records (less than 50k), doing the multiple queries way is the way to go. If you plan on doing larger batches, the new GlideRecord method (original post) is the way to go.

@Edvin-KaraliusUnless you have something to add to the updated script from yesterday, I'll update the original post to include the 2 methods.

Edvin-Karalius · ‎07-27-2023

Wow! Super cool tests! @Janel
I find it weird tho that lower count of records were faster for multiple queries.
All runs I've done, even at like < 1000 records, a single query is always faster compared to a loop of queries.

Ofc you are doing a lot more other stuff in the script in general, so might be something else holding the speed down.

Did you also try to run the scripts multiple times to get a precise avg time?
I've noticed each run can sometimes vary in time, depending on the load of the instance in general. Wonder if that test with 25k records was maybe delayed due to something else running on the instance.

Janel · ‎07-28-2023

Yeah, the tests were run multiple times during different periods (some at a busier server time, some at a calmer) and the results above were the averages. The calmer time of day only shaved off about 30 seconds on both methods.

It sounds like a memory thing for the larger data sets with the multiple queries. I have to do a bunch of queries because I won't know what tables I will need to look at so I cannot define them up front. So instead, I have to store the table and the records to look at in the object's array (I used what you posted above).

I believe it has to do with the amount of data being held in memory that leads to the eventual trade-off of performance.

Edvin-Karalius · ‎07-28-2023

Nah memory on the server is not the issue, those servers have like a minimum of 128gb ram. The 100k record query takes up anything between 50-500mb (depending on the record size). It's not even close.