Do You Suffer from Abnormal Data?

SlightlyLoony · ‎01-16-2011

Well, nearly everyone does — it's a universal affliction, like the flu, dentists, and the occasional retribution of the ninth tlatoani. You just can't get away from it.

What am I babbling about?

Do you have fields in your system that your users, external data sources, or discovery processes (yes, even ours!) just can't seem to get "right"?

It could be something as simple as user names: perhaps some people enter them with all capital letters, or your LDAP server supplies them with only the first letter capitalized (so "Smith-Fitzsimmons" becomes "Smith-fitzsimmons). Just think of all the ways that someone can mangle a name … and more than likely, your system has examples of it.

Or perhaps your pet abnormality is with the CPU Type field. Those pesky data sources keep insisting on filling in all sorts of noise you'd rather not see — you want "Xeon", and discovery gives you "Intel Xeon L4533 step 6.7.23.423". Makes it hard to search and report on.

Others are offended by the unnatural values returned by discovery processes for a computer's RAM size. To be fair, these discovery processes are just reporting what the operating system tells them — but still, who on earth would imagine that a computer had (say) 2043 MB of RAM? Kinda hard to buy RAM in that size, you know? Of course that's really 2048 MB (or 2GB) of RAM, but try convincing discovery of this obvious truth.

And then there's one that's high on everybody's list: company names. You get a new appreciation for the unreasoning creativity of mankind when you see how many ways one can write the name of "Dell". Who knew that inventive souls would write "Dell International", "Dell Computers", "Dell, Inc.", "Dell Ltd.", "Dell, Incorporated", "Dell Corporation", and a couple dozen others instead of those simple four letters? Not only do you end up with all those different ways to write "Dell", but you also end up with multiple company records where you really only want one.

If your system has anything like the above examples, then you're suffering from the heartbreak of Abnormal Data.

Click "Read More" below...

And starting with the Winter 2011 release (coming to a browser near you in less than a month!), we've got the advanced ~~homeopathic~~ automatic cure: Field Normalization! But if you've never run into anything like this before, it will likely need some 'splaining…

This post is the first in a series that will explore in depth exactly how Field Normalization works, and how you can use it. Here I'm going to introduce what it does, without diving into how. For now, just think of it as magic that you can install as a plugin.

Let's start with that CPU Type example, as it's nice and simple. If you looked at all the different values that various data sources and people have entered, you might see (in part) something like the left column in the table below. The right column shows what you might like to see instead of all that junk:

What I have	What I want
Intel Xeon L4533 step 6.7.23.423	Xeon
Xeon L2534 4.5.44.1	Xeon
Intel Atom micro I step 142	Atom
SunSparc 7a 6.239638	Sparc
Pentium III step 73	Pentium
Intel Pentium V beta 5	Pentium

In this situation, Field Normalization does what you'd hope: it automatically turns the junk on the left into the nice, neat stuff on the right. Essentially it keeps a map to show which righthand value matches which lefthand values. Don't worry about the details here (and there are quite a few of them) — the bottom line is that with a reasonable amount of configuration work and a minimal amount of ongoing work, Mr. Clean will take care of your abnormal CPU Type data, leaving it squeaky clean. Better yet, he won't just clean up new data being input — he'll also go back and clean up the mess that's there already!

Let's look at a different kind of abnormal data: those sloppily capitalized names. You might have things like this:

What I have	What I want
smith	Smith
LUDDY	Luddy
LoonY II	Loony
Barking-badger, iii	Barking-Badger

This field is different than the first example in that it isn't practical to make a mapping between the values on the left and the values on the right — there are too darned many possibilities. Instead, you need some kind of automatic transformation of the lefthand values into the righthand values. Field Normalization does exactly that. Again, it will not only transform newly entered data, but it will go back and clean up all that mess in your database.

I'll hurt your brain with one last thing that Field Normalization does. Remember the company name example above? It isn't enough, in that case, to make all 27 entries that mean "Dell" say "Dell" — 'cause you'd still have 27 entries for Dell. What you really need is a way to moosh all those 27 entries into a single entry. But you can't just delete 26 of the entries, because there may be references to them from any of dozens of places within the system. You need a kind of magical mooshing that makes all the references point to the one remaining Dell entry, instead of the 26 that get deleted. By now I hope you won't be surprised when you hear that Field Normalization does that.

There are many details and nuances to acquaint you with, and I'll have a series of posts over the next few days and weeks to do just that. As always, please feel free to ask questions (in comments to the posts, please — that way everybody gets to see the question and the answer!).

The cure for your abnormal data is almost here!

Do You Suffer from Abnormal Data?

2026 MVP Applications are open—we invite you to apply today!

Now Create Retirement FAQs and Introduction to the Best Practices site

Making use of AI Skills: Problem Affinity