- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2023 10:16 AM
I hope people are bookmarking/copying your object example.
So after you get everything into an array, sorted by tables, when you do the big query against all the sys_ids, the issue you're left with is that the final query will only return valid records, when we're looking for invalid records.
I'd have to either keep track of the valid sys_id's in a new object and compare later or remove the the valid sys_id from the initial array (slice is your friend, but you still need the index), so I'm only left with the invalid ones.
Then after all that, I'd still need to take care of the invalid attachment. Now I have to do another query to go back to sys_attachment, query for the table_sys_ids, then delete the record. If you grouped it, doing a deleteMultiple() may end up with performance issues (depending on how much there is to do, but it has not been my friend lately) and I've witnessed it leaving orphans on sys_attachment_doc.
I still think you're going to run into memory issues trying to hang on to and manipulate that much data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2023 11:10 AM - edited 07-26-2023 11:40 AM
Something more like this.
//Looking on sys_attachment happens above here.
var checkOnThese = {};
while (valid_attachments.next()) {
if (!gs.tableExists(valid_attachments.table_name)) {
continue;
}
//gs.print("table: " +valid_attachments.table_name);
if(!checkOnThese[valid_attachments.table_name]){
checkOnThese[valid_attachments.table_name] = {
'table': valid_attachments.table_name+'',
'ids': []
};
}
checkOnThese[valid_attachments.table_name].ids.push(valid_attachments.table_sys_id+'');
} //End of while
var arrayUtil = new ArrayUtil();
for(var k in checkOnThese){
checkOnThese[k].ids = arrayUtil.unique(checkOnThese[k].ids); //A record could have multiple attachments so we sort these down to only unique values.
gs.print("Before: " +k + ": " +checkOnThese[k].ids.length);
var findValids = new GlideRecord(checkOnThese[k].table);
findValids.addQuery('sys_id', "IN", checkOnThese[k].ids);
findValids.query();
while(findValids.next()){
var index = checkOnThese[k].ids.toString().split(',').indexOf(findValids.sys_id+'');
checkOnThese[k].ids.splice(index, 1);
}
if(checkOnThese[k].ids.length > 0){
var deleteBaddies = new GlideRecord(checkOnThese[k].table);
deleteBaddies.addQuery('table_sys_id', 'IN', checkOnThese[k].ids);
//deleteBaddies.deleteMultiple();
}
}
A single record could have multiple attachments on it, resulting it being pushed into the array twice. I tried using ArrayUtils to sort that out before pushing, but using .contains or indexOf, it is always returning false/-1. So instead I replaced the array using .unique before looking anything up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2023 08:33 AM
I tried both in a background script and compared in production (larger sample size), without deleting any records.
- Against 200,000 records,
- The multiple queries way was taking over 17 minutes (I was cancelling the transactions at 15 minutes because it was taking too long to run).
- The new GlideRecord method was taking about 12 minutes (as a scheduled job it only takes anywhere from 2 to 8 minutes).
- Against 100,000 records,
- The multiple queries way was taking around 13 minutes.
- The new GlideRecord method was taking about 6 minutes.
- Against 75,000 records,
- The multiple queries way was taking around 7 minutes 30 seconds.
- The new GlideRecord method was taking about 4 minutes 30 seconds.
- Against 50,000 records,
- The multiple queries way was taking around 3 minutes.
- The new GlideRecord method was taking about 3 minutes.
- Against 25,000 records,
- The multiple queries way was taking around 48 seconds.
- The new GlideRecord method was taking about 1 minute 30 seconds.
Why 200k records? For us, this allows us to get through our entire attachment table in about a week when we run it hourly (record total / 168 hours).
Based on the above, you can see where the differences between the methods happen and where doing the multiple querie s way starts to take a performance hit. I believe the arrays and query sizes are just too big after 50k records.
To sum it up, if you plan on looking at a small batch of records (less than 50k), doing the multiple queries way is the way to go. If you plan on doing larger batches, the new GlideRecord method (original post) is the way to go.
@Edvin-KaraliusUnless you have something to add to the updated script from yesterday, I'll update the original post to include the 2 methods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2023 08:43 AM
Wow! Super cool tests! @Janel
I find it weird tho that lower count of records were faster for multiple queries.
All runs I've done, even at like < 1000 records, a single query is always faster compared to a loop of queries.
Ofc you are doing a lot more other stuff in the script in general, so might be something else holding the speed down.
Did you also try to run the scripts multiple times to get a precise avg time?
I've noticed each run can sometimes vary in time, depending on the load of the instance in general. Wonder if that test with 25k records was maybe delayed due to something else running on the instance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-28-2023 05:48 AM
Yeah, the tests were run multiple times during different periods (some at a busier server time, some at a calmer) and the results above were the averages. The calmer time of day only shaved off about 30 seconds on both methods.
It sounds like a memory thing for the larger data sets with the multiple queries. I have to do a bunch of queries because I won't know what tables I will need to look at so I cannot define them up front. So instead, I have to store the table and the records to look at in the object's array (I used what you posted above).
I believe it has to do with the amount of data being held in memory that leads to the eventual trade-off of performance.