|
Regular expressions, sometimes referred to as regex, grep, or
pattern matching, can be a very powerful tool and a tremendous
time-saver with a broad range of application. As an extended form of
find-and-replace, you can use a regular expression to do things such
as perform client-side validation of email addresses and phone
numbers, search multiple documents for strings and patterns you wish
to change or remove, or extract a list of links from source code.
Regex is supported by most languages and tools, but because there
can be varying implementations, this article will cover basic
principles that are commonly used.
Literals and Metacharacters
If you've seen a regular expression before and thought it looked
like alien space-algebra, it does, but have no fear - you'll be
fluent in alien space-algebra in no time! To make the most of the
power of regex, you need to be familiar with a few classifications
of characters. Literals are normal text characters and can
include whitespace (tabs, spaces, newlines, etc.). Unless modified
by a metacharacter, a literal will match itself on a one-for-one
basis. Metacharacters' power lies in how they are arranged
and interpreted as wildcards. Metacharacters can be escaped with a
backslash (\) to find instances of themselves, for instance, if you
need to find a caret (^) or a backslash, as well as used in nested
groups or other combinations.
Below is a list of some metacharacters and character classes for
a quick glance - each will be explained in further detail with
examples. Keep in mind that a "match" can be as simple as a single
character or as complex as a sequence of literals and metacharacters
in nested and compounded combinations.
| Metacharacter |
Match |
| \ |
the escape character - used to find an instance of a
metacharacter like a period, brackets, etc. |
| . (period) |
match any character except newline |
| x |
match any instance of x |
| ^x |
match any character except x |
| [x] |
match any instance of x in the bracketed range - [abxyz]
will match any instance of a, b, x, y, or z |
| | (pipe) |
an OR operator - [x|y] will match an instance of x or
y |
| () |
used to group sequences of characters or matches |
| {} |
used to define numeric quantifiers |
| {x} |
match must occur exactly x times |
| {x,} |
match must occur at least x times |
| {x,y} |
match must occur at least x times, but no more than y
times |
| ? |
preceding match is optional or one only, same as
{0,1} |
| * |
find 0 or more of preceding match, same as {0,} |
| + |
find 1 or more of preceding match, same as {1,} |
| ^ |
match the beginning of the line |
| $ |
match the end of a line |
Detailed descriptions of regex operators
Within these descriptions, x is used as a placeholder for
examples - x can be an actual x or it can be an entire sequence like
href="http://www.evolt.org", <DIV>, or ((\.\.)?/[a-z]+\.jpg).
. - Matches any one character except newline and is generally
used with quantifiers, which will be explained below. For instance,
.{3} would find three-letter words
x - Matches any instance of x and can include specific character
sets or ranges, for instance, [wxyz] would match any instance of w,
x, y, or z, but not wz, yx, or other combinations of the given
character set, unless it was followed by a quantifier.
^x - Matches any character that is not x and can also be used in
a range. For example, <[^abel]+> would match one or more
letters that are not a, b, e, or l, and which are surrounded by <
and >, thus it would match <font> but not
<table>.
[x] - Matches any character in the given range. Examples of a
range would be the expression [0-9], which would find a single
digit, or [a-z], which would find a single lower case character. You
can combine ranges as well - [A-Za-z0-9] will find a single upper or
lower case character or digit. You may also combine ranges with
commas, such as [0-3, 5-8] which would find any digit that isn't 4
or 9.
| - The OR operator can be used at the character level or
combined in sequences. [x|y] will find instances of x or y and you
aren't limited to just two objects - [w|x|y|z] is perfectly
valid.
() - Parentheses are used to group operators much like basic
algebra and are also used to delineate a backreference, which is the
way you can do replaces with matches. (Backreferences get their own
section below). A simple example would look something like:
www\.([a-z]+)\.com which will find
www.anycharactersathroughzhere.com.
{} - Curly brackets (or braces) are used to define numeric
quantifiers, which allow you to specify the optional, minimum, or
maximum number of occurrences in the match. x{3} would find exactly
3 occurrences of x. x{3,} matches on at least 3 occurrences
of x. x{3,5} matches at least 3 occurrences of x and no more than
5.
? - The preceding match is optional or must match exactly one
time. An example would be: ((\.\.)?/[a-z]+\.jpg) which matches a
path to an image file ending in .jpg and could start with a ../ or
just a /. A ./ or ../../ would fail to match that particular
expression.
* - Matches the preceding character or group 0 or more times.
Note that this is not the same as the use of the ? listed above. z*
can match no z, z, or for those readers who have already fallen
asleep, zzzzzzzzzzzzzzzzzzzzzzz.
+ - Matches the preceding character or group 1 or more times. In
comparison to the previous example, z+ would have to match at least
z or zz or zzz and so on.
^ - Used to force a match to the beginning of a line. Note that
this is not the same as a character exclusion such as [^xyz], which
would match any characters that are not x, y, or z. ^Hello would
match at the beginning of a line such as Hello Chris and
would not match Chris said Hello.
$ - Used to force a match to the end of a line. $end would match
at the end of a line such as This is the end and would not
match end this article already!
The various operators and metacharacters listed above are pretty
standard across most implementations of regex. POSIX class names and
character class shorthands are shortcuts to specify character types
like digits, whitespace, and so on.
POSIX (Portable Operating System Interface) classes should be
more consistent across languages and applications but there may not
be an exact parallel between certain class shorthands and POSIX
classes, and either class type may not always be fully supported. If
they are supported, POSIX classes can be useful since they have a
little more precision when it comes to things like whitespace and
other non-alphanumeric characters.
| POSIX Class |
Match |
| [:alnum:] |
alphabetic and numeric characters |
| [:alpha:] |
alphabetic characters |
| [:blank:] |
space and tab |
| [:cntrl:] |
control characters |
| [:digit:] |
digits |
| [:graph:] |
non-blank (not spaces and control characters) |
| [:lower:] |
lowercase alphabetic characters |
| [:print:] |
any printable characters |
| [:punct:] |
punctuation characters |
| [:space:] |
all whitespace characters (includes [:blank:], newline,
carriage return) |
| [:upper:] |
uppercase alphabetic characters |
| [:xdigit:] |
digits allowed in a hexadecimal number (i.e. 0-9, a-f,
A-F) |
| Character class |
Match |
| \d |
matches a digit, same as [0-9] |
| \D |
matches a non-digit, same as [^0-9] |
| \s |
matches a whitespace character (space, tab, newline,
etc.) |
| \S |
matches a non-whitespace character |
| \w |
matches a word character |
| \W |
matches a non-word character |
| \b |
matches a word-boundary (NOTE: within a class, matches a
backspace) |
| \B |
matches a non-wordboundary |
Think dif{2}erently
Many Macintosh applications can easily handle regular
expressions, but that's not what I mean here. The philosophy of
regex is one of surgical precision and extreme logic, and you have
to play by the rules. Like doing a complex database query, you have
to know exactly what you want and exactly how to get it or you'll
end up with either way more data than you need or not enough
information. The concepts of AND, OR, wildcards, and the liberal use
of parentheses all come into play with regex. You have to carefully
create an expression that meets your needs but is neither too
restrictive nor too inclusive or the dark side of regular
expressions will rear its ugly head.
A warning about "greediness"
With true power, comes an unhealthy dose of greed. Regular
expressions are very greedy. They may seem nice and friendly, but
they'll take all they can get. What this means is that a regex will
try to match as much as it can, since it's not smart enough to stop
on the earliest possible match. It assumes you want the "whole
thing", which is why you need to create a surgical strike of an
expression. You can take care of a broken toe by amputating
above the knee, but then where does that leave you? (Hopping mad,
probably).
A great example of regex greediness is the expression:
<a href=".*">.*</a>
At first glance, it appears this expression will find an href tag
(having no extra attributes) with a reference containing just about
any URL, followed by ">, then anything in the link text, then the
closing </a>. You could use this to get a list of all the
links in a web page. Sounds useful and looks mostly harmless, right?
What you end up with is something like this:
<a href="http://sample.url.here">Click
this!</a>. Some text goes <a
href="../text.htm">here</a>. Maybe several paragraphs go
here. More text goes <a
href="/less/is/more.htm">here</a>. Another big block of
text, text, and more text. <a href="end.htm">The
End</a>
The reason you get a whole block of text mixed with links as a
single match instead of a simple list containing each link is
because the sub-expression .* is where the greed kicks in. The .*
really does mean "match anything" so it merrily goes along
until it can't match anything else, which matches up to the very
last </a> it can find and grabs everything in between along
the way. It started at the toe and went straight to the thigh,
without even thinking about slowing down at the knee.
Here's where we put a splint on the toe instead of amputating the
whole leg. Break down the parts of this expression:
<a href="[^"]+">[^<]+</a>
You start with the <a href=" and then you see [^"]+">. If
you've been following along with the rules, you know that this means
find at least one of any character except a double-quote,
then find the first instance of a double-quote, then a
>. The same principle applies to the next part -
[^<]+</a> finds at least one of any character except a
<, then matches the first literal instance of </a>. Search
with this expression and you get a nice short list of complete href
tags. Conquer the greed! A clear understanding of the rules of regex
and the various operators is paramount and it will take patience as
well as experimentation with your logic to learn to tune an
expression to yield exactly what you need.
Backreferences
Using a backreference is how you finally get to witness the real
power of regular expressions. Extracting a list of links from a page
of source is useful, but nowhere as useful as being able to do
something with that data. Parentheses can be used to "remember" a
subexpression, and a backreference in the form of \digit is
how you refer to that particular group. Parentheses are counted from
left to right within the expression, so the first open parentheses
group has a backreference of \1, the second has a backreference of
\2, and so on. You can use the memory-like functionality of a
backreference in a replace string.
A good example of this uses the href expression from above. You
can get a list of complete hrefs from some source with the
expression <a href="[^"]+">[^<]+</a>. Let's say you
need to find all external links on a web site and remove the href
tag, but leave the link text intact, and we'll assume for this
example that none of your local links start with http://. You would
add parentheses to your expression like this:
<a
href="http://[^"]+">([^<]+)</a>
You would then perform a find with this expression and simply
replace with \1. The parentheses "memorize" the link text and the \1
calls it into the replace, leaving you with just the link text e.g.
some text about <a
href="http://www.evolt.org">evolt</a> results in
some text about evolt.
A more interesting example might be a transposition using more
than one backreference. Pretend you have a text list of web site
users in the form of LastName, FirstName and you want a list of
names in a FirstName LastName format. The expression, ([^,]+),\s(.+)
would find Spruck, Chris, since ([^,]+), matches any number of
characters that aren't a comma, followed by a comma, then a space,
then (.+) finds any number of characters again. Notice where I
placed both sets of parentheses. To change Spruck, Chris to the
preferred format, you would replace that with \2\s\1, yielding Chris
Spruck.
When you're doing replaces, it's very important that you test
your expressions on backup copies of files, or even a dummy test
file of your own creation, so if your expression is off by a
parenthesis or something else, you haven't ruined your files
permanently. Once you know your expression works on a sample, then
go ahead and work on all your files. If you do run an expression
that gives you unintended results, you can probably run another one
again to correct the mishap. Don't ask how I know this.
Sometimes it may also be useful to run more than one expression
over the same set of data to make it easier to catch every last bit
that you need with a second expression. For instance, you might want
to add quotes to all your tag attributes if some are unquoted, then
run another expression that somehow operates based on the quotes.
A few practical examples
Get a list of IP addresses from a server log:
(\d{1,3}\.){3}\d{1,3} - This expression will find three instances
of a one to three digit number followed by a period, then one to
three more digits, e.g. 206.159.10.1
Find doubled words in text such as "Rate this article high high,
please!":
\s([A-Za-z]+)\s\1 - This expression will match a space, followed
by a word of any length (which is later recalled by using the
parentheses for a backreference), then a space again. The
backreference, \1, then picks up the second instance of the same
word. You could then simply replace the match with \1, which will
remove the second instance of the word.
Remove FONT tags from your web pages:
<(FONT|font)([
]([a-zA-Z]+)=("|')[^"\']+("|'))*[^>]+>([^<]+)(</FONT>|</font>)
and replace with the backreference \6 - This expression looks quite
complicated, but I wanted to show an example with some more involved
logic. A simpler example that finds the same string will follow this
explanation. <(FONT|font) accounts for an upper or lower case
tag. ([ ]([a-zA-Z]+)= matches a space followed by any attribute name
and an =. The next subexpression, ("|')[^"\']+("|'), finds the
leading double or single quotes on the attribute(s), then any
attribute value that's not a double or single quote, i.e. Arial, +5,
#c3d4ff, etc., then the closing double or single quote. Notice that
the subexpression for the entire attribute is enclosed in
parentheses and followed by an asterisk - ([
]([a-zA-Z]+)=("|')[^"\']+("|'))*. This allows you to find a tag with
either no attributes or any number that may exist. [^>]+> then
matches anything up to the first > (similar to the "greediness"
example above). The backreference is defined next as ([^<]+),
which will capture any text between the opening and closing font
tags, and is referred to as \6 because it's the sixth parenthetical
group in the entire expression. Then (</FONT>|</font>)
accounts for the closing font tag in either case.
<(FONT|font)[^>]*>[^<]*(</FONT>|</font>)
is a simpler example that accomplishes the same thing as the
expression explained above. The difference is that it is much less
picky about what is between the font tags, so if you have
inconsistent tag syntax, it will probably capture the various
instances you may have. On the other hand, if you have any extra
junk characters in your search data, you may catch things you didn't
intend, which is why you should test your expressions ahead of
time.
A brief history of the 31 Flavors
There are a number of applications and languages that support
regular expressions, but unfortunately, not all of them support
regex in quite the same way. Although regular expressions had their
origins in neurophysiology in the 1940s and were developed by
theoretical mathematicians in the 1950s and 1960s, the evolution and
subsequent divergence of regex implementations was due to the
independent development of various Unix tools such as grep, awk,
sed, Emacs, and others. [1]
Today, it's probably safe to say that Perl has the most robust
regex engine in common use. Other languages and applications that
have some form of regex support or pattern matching (and this by no
means is a complete list) include: JavaScript, VBScript, PHP,
Python, Tcl, Java, C, Macromedia Dreamweaver/Ultradev, ColdFusion
and ColdFusion Studio, BBEdit, NoteTabPro, TextPad, UltraEdit, the
XML Schema and XPath Recommendations, the various Unix tools used
for text processing and their clones, and just about any modern
application with a Find function.
Conclusion
Regular expressions are a powerful tool to keep in your web belt.
They can appear daunting, but by learning a few simple rules, you
can save yourself from hours of time doing manual find-and-replaces
the slow, boring way.
I'll close with what may be the world's first (and undoubtedly
the world's worst) regular expression joke:
What did one regex say to the other?
.+
Other Resources
[1] Mastering Regular Expressions - Friedl http://www.regexlib.com/ www.webreference.com/js/column5/
All the regular expressions in this article were tested using
ColdFusion Studio 4.5.2, so you may encounter slight differences in
different applications or languages. Thanks to Sean Palmer for
some expression testing.
| Chris' favorite regular expression is a smirk
with an optional wink. He lives in Atlanta, Georgia and dreams
of being back on the coast. He probably needs more info in his
bio. (He almost never refers to himself in the third person.)
Suggestions can be sent to sprocket@members.evolt.org.
|
|