Community Code Snippets: Scripted String Matching Techniques - Part 1

sabell2012 · ‎08-22-2018

NOTE: MY POSTINGS REFLECT MY OWN VIEWS AND DO NOT NECESSARILY REPRESENT THE VIEWS OF MY EMPLOYER, ACCENTURE.

DIFFICULTY LEVEL: ADVANCED
Assumes having taken the class SSNF and has good advanced level of knowledge and/or familiarity with Scripting in ServiceNow.

This particular topic was something that keeps coming up on my radar: I needed a way to match strings! I needed several approaches. I needed them to be consistent! Thus began my serious search to build a tight little library of functions that would give me something to use when needed.

What I looked at:

JavaScript .includes
JavaScript regular expressions (.test)
JavaScript .search
JavaScript .match
JavaScript .substring
Roll-my-own string match combinations (I need to find another hobby)
Several Levenshtein match techniques (in JavaScript - and also off the web) (part 2 article)
Several Jaro-Winkler weighted match techniques (in JavaScript - off the web) (part 3 article)
Smith-Waterman weighted match technique (in JavaScript - off the web) (part 4 article)

Note: All examples were run in my Personal Instance on the Vancouver version in Scripts - Background or Fix Scripts.

JavaScript Search Match Capabilities

// 1. includes (introduced in ES6) WORKED!
var sourceString = "string to search for pattern";
var pattern = "sea";
gs.info('1.' + sourceString.includes(pattern));

// 2. RegExp: test WORKED
var sourceString = "string to search for pattern";
var pattern = /sea/;  // no quotes here
gs.info('2.' + pattern.test(sourceString));

//3. sourceString.search WORKED, returns start location
var sourceString = "string to search for pattern";
var pattern = /sea/;
gs.info('3.' + sourceString.search(pattern));

// 4. string.match WORKED, retrieves an array of matches
var sourceString = "string to search for pattern";
var pattern = /sea/;
gs.info('4.' + sourceString.match(pattern));

// 5. string.indexOf WORKED!
var sourceString = "string to search for pattern";
var pattern = "sea";
gs.info('5.' + (sourceString.indexOf(pattern) !== -1));

Would result in:

*** Script: 1.true
*** Script: 2.true
*** Script: 3.10
*** Script: 4.sea
*** Script: 5.true

So these are all fine and good, but what if I wanted something more substantial that returned a likelihood of a match? In other words: a weighted match?

This is the point where instead of doing my usual due diligence and search the web, I decided to play around with rolling my own. I started my own string-match Script Include library.

Equals

// are two values the same?
function checkEquals(stringToMatch, target) {
   return stringToMatch.toLowerCase() == target.toLowerCase();
}

That was pretty straightforward, but it is a single very constrained match.

Match first and last characters

When taken into account as the next check this one might be useful to throw out any that are definitely not a match. Well, it sounded like a good idea, but I began thinking of dozens of scenarios that were legit and wouldn't pass this test.

// match first and last character (as part of a set of checks)
function matchFirstAndLast(stringToMatch, target) {
   return (stringToMatch[0] == target[0] && stringToMatch[stringToMatch.length - 1] == target[target.length - 1]);
}

Match contains the string

Well, how about checking both strings against each other? Perhaps one is contained in the other? Another good idea down in flames. Immediately I started to think of how this wouldn't work all that great. Think about it.

// is one value in the other?
function matchContains(stringToMatch, target) {
    return (target.indexOf(stringToMatch) > -1) || (stringToMatch.indexOf(target) > -1);
}

Check first and last part of a string

Okay, so that last one wasn't all that great, but maybe I could salvage a couple of these approaches and come up with something more sophisticated. Here I made use of searching forward for a partial string match, and then searching backward for the same string. The idea is to see if there is a minimal match.

The idea: Do a character-by-character match looking for contiguous characters. If you find a contiguous set begin the comparison. If they continue to match then, um, continue. If they stop matching then kick out and see what the percentage of matches were. Also, provide for a reverse matching as well - to start at the end of the strings and search backwards. This seemed all good for a method until I started to do some testing.

const EXTRACTBETWEENCNANDCOMMA = /cn=(.*?),.*/;
const REMOVECOMMASANDSPACES = /[.,\s]/g;

// check first part of string, or check last part of string 
// starting with the first/last character see if there is a partial match
function matchInTotarget(stringToMatch, target, toReverse) {
    stringToMatch = stringToMatch.toLowerCase().replace(REMOVECOMMASANDSPACES, '');
    target = target.toLowerCase().replace(REMOVECOMMASANDSPACES, '');
    
    toReverse = gs.nil(toReverse) ? false : toReverse;
	var directionTag = 'Start-to-End';
    if (toReverse) {
        stringToMatch = stringToMatch.split('').reverse().join(); // the join isn't really necessary
        target = target.split('').reverse().join();
		directionTag = 'End-to-Start';
    }
//    gs.info('---> matchInTotarget. {0}, {1}, {2}, {3}', 
//        [stringToMatch.toString(), target.toString(), target.length, stringToMatch.length]);

    // do a character by character match
    var found = 0;
    for (var i=0; i < target.length; i++) {
        if (i > stringToMatch.length) {
			// past the length of the stringToMatch
            //gs.info('i > stringToMatch.length. stopping');
            break;
        }
//        gs.info('---> {0}, {1}', [stringToMatch[i], target[i]]);
        if (target[i] == stringToMatch[i]) {
            found++;
        }
        else {
            // not contiguous
            //gs.info('not contiguous. stopping');
            break;
        }
    }
    
    percentage = parseFloat(found / target.length) * 100.00;
    gs.info('---> match result ({4}) {1}, {2}, {3}: {0}', 
        [percentage, stringToMatch.toString(), target.toString(), found, directionTag]);
}

Rotating search with partial string match

That was okay, but it just didn't cut-the-mustard! Okay, okay, not to give up yet! I decided to try morphing this into a slightly different approach. Perhaps look for part of the string to match and see if it exists in the target string. Then begin walking character by character looking for exact matching and then kicking out when they differ? That sounded good, right? Actually, it wasn't bad, but it wasn't all that great either. It only worked on a small percentage of what data I was working with (I was trying to match users coming in via a transform map from an Active Directory source - and yes there was a reason why - work it out 😊 ).

I came up with this:

const EXTRACTBETWEENCNANDCOMMA = /cn=(.*?),.*/;
const REMOVECOMMASANDSPACES = /[.,\s]/g;
const PARTIALLENGTH = 4;

// match partial string - N-gram
// partial length for initial match=4.  find source in target (char 1, char 2 etc.) first 4, second 4, 3rd 4.
// when found then do a character by character until kick out. then do percentage.
function matchPartial(stringToMatch, target) {
    stringToMatch = stringToMatch.toLowerCase().replace(REMOVECOMMASANDSPACES, '');
    target = target.toLowerCase().replace(REMOVECOMMASANDSPACES, '');

    // search through all the characters of the stringToMatch string using four characters
    // if an initial match is found then search character-by-character against the 
    // target. If nothing greater than four is found, then search the next four, etc.
    var done = false;
    var count = 0;
    // don't try to do anything past the end of the stringToMatch string with offset.
    for (var i=0; i < stringToMatch.length - PARTIALLENGTH; i++) {
        var searchString = stringToMatch.substring(i, i + PARTIALLENGTH);
        var foundLocation = target.indexOf(searchString);
//        gs.info('---> searchString: {0}, foundLocation: {1}', [searchString, foundLocation]);
        
        count = 0;
        // ok we found a match on a partial string.
        if (foundLocation > -1) {
            var offset = i + PARTIALLENGTH;
            for (var j=foundLocation + PARTIALLENGTH; j < target.length; j++) {
                // past the end of the stringToMatch. stop.
                if (offset > stringToMatch.length - 1) {
                    //gs.info('---> past end of stringToMatch: j:{0}, offset:{1}, stringToMatch.length:{2}, count:{3}. stopping.',
                    //    [j, offset, stringToMatch.length, count]);
                    done = true;
                    break;
                }
                // see if we have a contiguous match
                if (stringToMatch[offset] == target[j]) {
                    count++;
                    offset++;
                }
//                 else if (count > PARTIALLENGTH) {
//                     // not congiuous, but we might have a decent match, kick out
//                     done = true;
//                     break;
//                 }
                else {
                    // not contiguous. check the next four characters.
                    //gs.info('---> not contiguous. stopping');
                    break;
                }
            }
        }
        if (done || count > PARTIALLENGTH) {
            break;
        }
    }

    var result = 0;
    // we might have a partial match
    if (count > 0) {
        result = ((count + PARTIALLENGTH) / target.length) * 100;
        gs.info('---> partial match result for {1}, {2}: {0} - {3}', 
            [result, stringToMatch, target, count]);
    }
    else {
        gs.info('---> nothing found for partial match result for {1}, {2}: {0} - {3}', 
            [result, stringToMatch, target, count]);
    }
}

Testing Results

I developed a Fix Script that would exercise these functions and determine if they were any good.

Using the following data and calls:

// dummy data
var target = {
    email: 'buddy.bajeebers@test.com',
    source: 'CN=Buddy Bajeebers (jr),OU=On Hold Accounts,DC=corp,DC=test,DC=org',
    givenname: 'Buddy',
    surname: 'Bajeebers',
    samaccountname: 'luchador.d.bajeebers',
    name: 'buddy.bajeebers',
    userPrincipleName: 'luchador.d.bajeebers@test2.com'
};

var stringToMatch = {
    av_accentureemail: 'buddy.bajeebers@test.com',
    source: 'CN=buddy.bajeebers,OU=FTE,OU=People,DC=corp,DC=test,DC=org',
    givenname: 'Buddy',
    name: 'buddy.bajeebers',
    samaccountname: 'buddy.bajeebers',
    surname: 'Bajeebers',
    userPrincipleName: 'buddy.bajeebers@test.com'
};

// check if source is the same
var sourceEquals = checkEquals(stringToMatch.source, target.source);
gs.info('---> 1. EQUALS. source matched: ' + sourceEquals);

// check if email is the same as av_accentureemail
var emailEquals = checkEquals(stringToMatch.av_accentureemail, target.email);
gs.info('---> 2. EQUALS. email matched: ' + emailEquals);

// check if givenname is the same
var givenNameEquals = checkEquals(stringToMatch.givenname, target.givenname);
gs.info('---> 3. EQUALS. givenname matched: ' + givenNameEquals);

// check if surname is the same
var surnameEquals = checkEquals(stringToMatch.surname, target.surname);
gs.info('---> 4. EQUALS. surname matched: ' + surnameEquals);

// check if user principlename is the same
var userPrincipleNameEquals = checkEquals(stringToMatch.userPrincipleName, target.userPrincipleName);
gs.info('---> 5. EQUALS. userPrincipleName matched: ' + userPrincipleNameEquals);

// check if name is the same
var nameEquals = checkEquals(stringToMatch.name, target.name);
gs.info('---> 6. EQUALS. name matched: ' + nameEquals);

var stringToMatchSource = stringToMatch.source.toLowerCase().replace(EXTRACTBETWEENCNANDCOMMA, '$1');
var targetSource = target.source.toLowerCase().replace(EXTRACTBETWEENCNANDCOMMA, '$1');

var found = matchFirstAndLast(stringToMatchSource, targetSource);
gs.info('---> 1. FIRST and LAST. source, result for {1}, {2}: {0}', [found, stringToMatchSource, targetSource]);
found = matchFirstAndLast(stringToMatch.samaccountname, target.samaccountname);
gs.info('---> 2. FIRST and LAST. sam acct name, result for {1}, {2}: {0}', [found, stringToMatch.samaccountname, target.samaccountname]);

// check from first to last character
matchInTotarget(stringToMatchSource, targetSource, false); // first to last
// check from last to first character
matchInTotarget(stringToMatchSource, targetSource, true); // last to first

matchInTotarget(stringToMatch.samaccountname, target.samaccountname, false);
matchInTotarget(stringToMatch.samaccountname, target.samaccountname, true);

matchPartial(stringToMatchSource, targetSource);
matchPartial(stringToMatch.samaccountname, target.samaccountname);
matchPartial(stringToMatch.userPrincipleName, target.userPrincipleName);

My results:

*** Script: ---> 1. EQUALS. source matched: false
*** Script: ---> 2. EQUALS. email matched: true
*** Script: ---> 3. EQUALS. givenname matched: true
*** Script: ---> 4. EQUALS. surname matched: true
*** Script: ---> 5. EQUALS. userPrincipleName matched: false
*** Script: ---> 6. EQUALS. name matched: true
*** Script: ---> 1. FIRST and LAST. source, result for buddy.bajeebers, buddy bajeebers (jr): false
*** Script: ---> 2. FIRST and LAST. sam acct name, result for buddy.bajeebers, luchador.d.bajeebers: false
*** Script: ---> match result (Start-to-End) buddybajeebers, buddybajeebers(jr), 14.0: 77.77777777777779
*** Script: ---> match result (End-to-Start) s,r,e,b,e,e,j,a,b,y,d,d,u,b, ),r,j,(,s,r,e,b,e,e,j,a,b,y,d,d,u,b, 0: 0.0
*** Script: ---> match result (Start-to-End) buddybajeebers, luchadordbajeebers, 0: 0.0
*** Script: ---> match result (End-to-Start) s,r,e,b,e,e,j,a,b,y,d,d,u,b, s,r,e,b,e,e,j,a,b,d,r,o,d,a,h,c,u,l, 18.0: 51.42857142857142
*** Script: ---> partial match result for buddybajeebers, buddybajeebers(jr): 77.77777777777779 - 10.0
*** Script: ---> partial match result for buddybajeebers, luchadordbajeebers: 50.0 - 5.0
*** Script: ---> partial match result for buddybajeebers@testcom, luchadordbajeebers@test2com: 51.85185185185185 - 10.0

Note: I have attached the Fix Script, for testing, to this article for download. You accept the install as-is and unsupported if you decide to use it.

So the best I was able to attain was a 77.78% probability of a match.I had other exact matches and the combination of all those matching vs. not matching would probably give me a much higher probability of a match. Not bad, but it could be much much better I am thinking!

This is the point where, after beating myself vigorously in the head with my keyboard for several minutes, I broke down and began researching string matching algorithms on the web. I should have started there, and could have not try to re-invent the wheel. One of the first things I found was that I was on the way to re-inventing the wheel with that last try. There was an algorithm called: N-gram that is somewhat close to what I was doing. It too was insufficient, but I felt better about myself after finding that. It STILL would have taken me a couple of years plus a couple additional of educational degrees in math to begin arriving at what people had already done. Sigh.

So what did I find? Well, I will detail that in the next couple of articles. 😊

Enjoy!

Steven Bell.

If you find this article helps you, don't forget to log in and mark it as "Helpful"!

Originally published on: 08-22-2018 4:53 AM

I updated the code and brought the article into alignment with my new formatting standard.