- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Today's example is one that someone threw at me a couple of years ago. They had a set of data that was sort of comma-separated values. It did have several columns of data values, each separated by commas. What was odd about it was that instead of using Microsoft's quoting convention, they'd invented their own. In their coding convention, any one of five different characters could be used for quoting the value: a quote, a dollar sign, an apostrophe, a percent sign, or an at sign (any of "$'%@). Apparently the idea was to eliminate the need to escape values inside the quotes — I guess they figured it wasn't very likely that you'd have a string with all five of those characters in it!
So imagine you had some sample helpdesk ticket data, encoded with this wild scheme:
from,title
$Pelosi, Nancy$,"Please, right now, stop people from posting hateful photos of me!"
%Moore, Michael%,'@&!%$!!!'
@Plumber, Joe@,"Some Drano, applied creatively, would do wonders for Congress! Wanna help?"
What sort of a regex would it take to extract the bits we really want?
The fellow who gave me the problem (at right) thought long and hard about it, and finally came up with this ginormous brain-bender:
/^(?:"(.*?)"|'(.*)'|\$(.*)\$|%(.*)%|@(.*)@),(?:"(.*?)"|'(.*)'|\$(.*)\$|%(.*)%|@(.*)@)$/m
That's 88 characters of madness — but it worked. I'll leave it to you to puzzle out how it works. My friend challenged me to beat it. I couldn't run away from that!
Here's what I came up with:
/^([$"%'@])(.*?)\1,([$"%'@])(.*?)\3$/m
Just 38 characters, and much easier to understand if you know the slightly obscure little regex feature called backreferences. Let's pick it apart...
There are two sections that are nearly duplicates of each other. The first one of these is ([$"%'@])(.*?)\1. What's going on in there?
The first bit captures a single character that is one of our quote characters: ([$"%'@]). That's only going to match if the string has one of those characters at the beginning.
The next bit captures any character, reluctantly: (.*?).
Next is the obscure part: \1. That's our new best friend, the backreference. A backslash followed by a number means "substitute the matching capture group" — in this case, capture group 1. That would be the capture group that captured the leading quote character. So if the leading quote character happened to be a dollar sign, this backreference would be turned into a dollar sign, and we'll match the trailing dollar sign. Holy obscure obscentities, Batman!
Just to complete the thought, there are two nearly identical sections there. The first one has a backreference to capture group 1, for the reasons given above. The second section has a backreference to capture group 3, because its leading quote character was captured by that capture group. If I call the first one of these sections "a", and the second "b", then this is what whole regex looks like: /^a,b$/m. That's pretty easy to understand: look for beginning of line, one quoted string, a comma, another quoted string, and the end of line. Easy peasey lemon squeezy!
Now you've joined the sparse ranks of those who can explain a regex backreference — and what it's good for. That and $5 will get you a fine cup o'joe at your neighborhood Barstucks...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.