Dealing with duplicate records in an effective and efficient manner

Dan Martinez

Overview

Inevitably, at some point in the lifetime of your platform you will end up having duplicates. If those are just a few, you should rely on removing them with a preview before doing so to prevent some unexpected collateral damage done. If we are talking about the CMDB and you have configured the IRE properly for the inserts then you should focus on de-duplication tasks, but if we are talking about enough duplicates as to be unable to safely remove them manually and it's outside the CMDB scope, you can always follow the strategy shown below to do it in a effective and efficient way.

1. The strategy

Before tackling the situation, we need to plan how to proceed and follow a structured approach that has been proven to be successful. Then we need to define in which order we will be applying the solution and finally what that solution will be. We will break down these aspects in the sections below.

a) Planning the solution

We must take a strategy of dividing the solution into two actions to be carried out. These actions need to prevent future damage and correct the damage caused so far:

Preventing future damage: This means stopping the records from being duplicated by patching the source of the duplicates
Correcting the damage caused until now: This means removing/disabling the duplicate records without losing references created against them

If we only fix the damage caused until now, we will carry on creating further damage. If we only patch the cause without correcting the past and current data we will be leaving garbage in our instance.

Once we have located the cause of the duplicates and have developed the patch we can focus on correcting the duplicates. Bear in mind that there might be several sources for the duplicates, and all of them need to be fixed before correcting the duplicates. Otherwise, these will continue appearing.

b) Organising the release

As we have just mentioned, always focus on patching the source or sources of the issue first. The bigger the damage caused, the more complex it will be to be corrected and the longer it will take you to revert it. When the source is located, create a patch (i.e: update set containing the changes) to be deployed in production, after being thoroughly tested, stopping the issue from occurring again.

The way in which we typically correct duplicates is by running a Fix Script, so in case we need to rerun it because the issue was caused again in the future we could do it easily. Nonetheless, depending on the situation we may need to follow a different approach. For instance, if we have duplicate CIs we should aim at using de-duplication tasks. In this document we are describing a generic duplication, so we will focus on Fix Scripts, but there may be a specific approach for your use case depending on each case.

Fix Scripts can be found under “All > System Definition > Fix Scripts” and allow us to keep a library of scripts we can store in our instance to be used regularly. If you think your duplicate issue is a one-off, be wary, sometimes we think we have found the source causing the duplication and that’s either not true or it’s not the only cause. Keeping the script as a Fix Script will allow us to run it again in the future should the issue happened again. Another reason why the issue may reappear even if patched completely is due to someone reverting our changes during an upgrade or simply someone re-enabling a feature that caused it because they weren’t aware of the issue it caused. To prevent this last use case, focus on governance to avoid people changing your instance without control.

Only once the patch is deployed in production you can then run your Fix Script to correct the duplicate data. If you do it the other way around there is a small possibility of creating a few extra duplicate records in the interim.

For instance, let’s say you run your Fix script at 9:00 and then patch at 9:30. Yes, the script will be correcting the old data issues, but any duplicate record created between 9:00 and 9:30 will stay in your instance. In the imagine shown below, you can see the red area (9:00 – 9:30) where duplicates may still be caused.

However, if you patch first, and wait for the patch to be completely committed before running the Fix Script, your script will then locate and correct all issues. Given no new issues will be created because the source or sources of the issue were previously corrected, your duplicates will all be corrected. This is the ideal situation, which is shown below:

It is important not to run the Fix Script until the patch is completely committed, otherwise the same situation described before may occur. Have a look at this situation in the representation below. You can see there’s a red area where duplicates may still be created but our script may not be able to correct them.

c) Designing the script algorithm

We have mentioned you need one, but we haven’t said what this script needs to do. If you thought all you needed was looking for duplicates and removing them, you are off from a good solution. The reason is that those duplicate records have existed for some time (minutes, hours, days, weeks…) therefore surely some users or other automated behaviours have created references against them.

Imagine we have duplicate users, and some agents have added both user accounts to watchlists or have set the caller against both user accounts (the right one and the duplicate one).

If we simply removed the duplicate account, we would be losing the references, leaving blank fields pointing to nowhere and removing information that we already had. As you can see in the image below, this would be the end state of our records if we just did that:

If someone went to INC98765 they could see any reference to John Smith has been lost. In case it was the watchlist that could be a minor damage, but in case such reference was the “Caller” we wouldn’t know who had the issue and John Smith would be really disappointed with the service since he keeps having such issue, but nobody is helping him any longer.

Repointing those references is vital if we are to preserve the health of our instance. So, from the previous situation we aim at having this one instead, redirecting all incidents pointing to the duplicate user to the one we will be preserving instead:

At this point, you may be wondering how you are going to know how many records point to the one you want to remove. This is where you need to rely on the Dictionary. To do so, you need to go to “All > System Definition > Dictionary”.

Then filter all fields where the field “Reference” points to the type of record you need to fix. In our example that would be the User table (“sys_user”). This will give you a list of all the fields (References and Lists) in the system that could potentially be storing a reference to your record.

Hardcoding these fields in the script is not a good idea. Bear in mind that we are dynamically obtaining these reference/list fields because a Fix Script will be preserved in the system to be, potentially, used again at some point. If a new table is created, either custom or due to an application/plugin being installed and you hardcoded these reference/list fields, you will not be paying attention to the new fields that could be storing any new reference. Not only that, but fields can also be removed, so your script could be crashing due to a missing field.

Now we need to write a script that:

Locates the record to be kept per group of duplicate records
Locates the duplicates of such record to be removed/disabled
Queries the dictionary with the query stated above to find potential fields that contain those duplicates
Queries each table checking if the table contains any reference to any of the duplicates mentioned in step number 2, and if so, replace them by the record located in step number 1
Deletes/deactivates the duplicate records in number 2

Optionally, in case those duplicate records had some extra/updated information the right record should have instead you could also take advantage of the situation to “merge” the information in the duplicate records with the right record to enhance it. For example, imagine during our last synchronisation of users, the duplicate record obtained the latest information about John’s location. We could either delete/disable John’s duplicate user record and wait until tomorrow to have his user record updated or, if such information was critical, copy the latest updates for non-unique fields onto the user record to be preserved. It is important to highlight we must never copy information from unique fields because those fields will, obviously, be different and that doesn’t mean they are updated versions. Let’s take “user_name” as an example. John Smith’s regular account may have “jsmith” as the “user_name” and John’s duplicate account may have “jsmith1” because “jsmith” was already taken. If we override “jsmith” with “jsmith1” we would be blocking John’s access given he is not aware of such change.

d) Testing the script

Once the script is ready, it needs to be thoroughly tested in dev first by the developers with the duplicate data in there. If this issue didn’t occur in dev, you should then export the issue from prod to dev to be able to test your script. However, there is a caveat with this approach. Please, validate your company policy allows you to do that. Some of these records may contain sensitive/confidential information that must never be taken out of your production environment. If that was the case, you should anonymise the data first and import it in the development environment once nobody could recognise it anymore. Also, remember to bring records that reference those duplicates to test if these are also being redirected.

After importing the data, either real or anonymised, the recommendation is that you limit the script reach to 1 record first by using ".setLimit( 1 )" and check the results manually. The reason for this is that sometimes scripts may have collateral effects. If you damage all the data during the script execution you will:

Lose time waiting for the script to go through all the records that need to be deleted.
1. With one record you may quickly realise the script is wrong already, saving 99% of the execution time
Have to reimport data again
1. Not just the data with the duplicate records, but also the other records that reference them
Potentially cause any further collateral damage you may not even be able to realise about
1. Scripts could cause business rules, flows or even cascade effects after deletion to trigger, changing more records than those you expected in your instance

Once you have executed it with a single record, if all went well, you can now increase the limit to 10, check the results again and carry on increasing the number until you feel confident enough to move to test and repeat. Finally, you will be ready to run it in production.

Be mindful that if the script isn’t good enough you may be even causing a further issue than the one you had, so please, ensure the script is flawless before running it in production.

Using ATF is a great idea especially given you will be leaving this Fix Script in the system and should be retested in the future if you need to rerun it.

2. Summarising

Duplicate records can seem easy to be eradicated from your instance, especially for an unexperienced developer/administrator, but there are some aspects that need to be considered if you want to do it right. Remember to patch the source/s of the issue, prepare a script to delete/disable duplicate records, redirect references and ensure this script works in lower environments before running it in production to avoid potential extra damage.

Finally, you must bear in mind Configuration Items have their own approach to get rid of duplicates and you should explore that approach first.

Please like 👍 and share 🌍 this article to your colleagues if you found it useful.

Dealing with duplicate records in an effective and efficient manner

Overview

1. The strategy

a) Planning the solution

b) Organising the release

c) Designing the script algorithm

d) Testing the script

2. Summarising

Now Create Retirement FAQs and Introduction to the Best Practices site

Data at the Core—Integrations, Federations, and Workflow Data Fabric

Suspicious activity