SlightlyLoony
Tera Contributor
Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
08-20-2010
06:33 AM
A common use for regular expressions is to extract some text from a larger piece of text, based on some delimiters that define the range of text to pull out. For instance, I might want to extract the text surrounded by parentheses, as in this sentence:
This is a test to see if (maybe) we can extract the text within some (pairs of) parentheses.
This is easy to do with regular expressions, but it involves a trick or two. Do you know how to do it?
Here's how you might try doing this (especially if you've never run into this particular challenge before):
var text = 'This is a test to see if (maybe) we can extract the text within some (pairs of) parentheses.';
var regex = /\((.*)\)/g;
var match;
while (match = regex.exec(text)) {
gs.log(match[1]);
}
The regex variable contains the search for parentheses with anything at all between them. The "\(" and "\)" specify that we're looking for parentheses (we have to escape them with a backslash because parentheses are special characters for regular expressions). The other parentheses define our capture group, wherein we hope to capture what's between the parentheses in our input string. The "g" at the end means it's a global regular expression — it's going to repeat the search each time we call exec() on the regex variable. So we run it, we might expect to see this output:
maybe
pairs of
But instead we get this:
maybe) we can extract the text within some (pairs of
What's going on here?
The answer lies in the ".*" inside our capture parentheses. The dot means "match any character" and the asterisk means "match any number" (from 0 to infinity). Ok, you say, that's exactly what I meant! But what you might not know is that the asterisk is a greedy quantifier — it's going to match as many characters as possible. So what's happening is that it matches all the characters between the first "(" and the last ")". It's a greedy little bugger.
Fortunately regular expressions include the opposite of greedy quantifiers: reluctant quantifiers. These do exactly what we need in this circumstance: they will match the least number of characters that it can in order to make the match. To turn a greedy quantifier into a reluctant quantifier is easy: just append a "?", as below:
var text = 'This is a test to see if (maybe) we can extract the text within some (pairs of) parentheses.';
var regex = /\((.*?)\)/g;
var match;
while (match = regex.exec(text)) {
gs.log(match[1]);
}
If you run that code, you'll get exactly what you wanted...
- 448 Views
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.