03/22/2002
Site Development

Regular Expression Basics

By Chris Spruck (sprocket)

User since:	09/18/2000
Last login:	05/09/2002
Articles written:	1
Average rating:	4.74
Total ratings:	27

Regular expressions, sometimes referred to as regex, grep, or pattern matching, can be a very powerful tool and a tremendous time-saver with a broad range of application. As an extended form of find-and-replace, you can use a regular expression to do things such as perform client-side validation of email addresses and phone numbers, search multiple documents for strings and patterns you wish to change or remove, or extract a list of links from source code. Regex is supported by most languages and tools, but because there can be varying implementations, this article will cover basic principles that are commonly used.

Literals and Metacharacters

If you've seen a regular expression before and thought it looked like alien space-algebra, it does, but have no fear - you'll be fluent in alien space-algebra in no time! To make the most of the power of regex, you need to be familiar with a few classifications of characters. Literals are normal text characters and can include whitespace (tabs, spaces, newlines, etc.). Unless modified by a metacharacter, a literal will match itself on a one-for-one basis. Metacharacters' power lies in how they are arranged and interpreted as wildcards. Metacharacters can be escaped with a backslash (\) to find instances of themselves, for instance, if you need to find a caret (^) or a backslash, as well as used in nested groups or other combinations.

Below is a list of some metacharacters and character classes for a quick glance - each will be explained in further detail with examples. Keep in mind that a "match" can be as simple as a single character or as complex as a sequence of literals and metacharacters in nested and compounded combinations.

Metacharacter	Match
\	the escape character - used to find an instance of a metacharacter like a period, brackets, etc.
. (period)	match any character except newline
x	match any instance of x
^x	match any character except x
[x]	match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z
\| (pipe)	an OR operator - [x\|y] will match an instance of x or y
()	used to group sequences of characters or matches
{}	used to define numeric quantifiers
{x}	match must occur exactly x times
{x,}	match must occur at least x times
{x,y}	match must occur at least x times, but no more than y times
?	preceding match is optional or one only, same as {0,1}
*	find 0 or more of preceding match, same as {0,}
+	find 1 or more of preceding match, same as {1,}
^	match the beginning of the line
$	match the end of a line

Detailed descriptions of regex operators

Within these descriptions, x is used as a placeholder for examples - x can be an actual x or it can be an entire sequence like href="http://www.evolt.org", <DIV>, or ((\.\.)?/[a-z]+\.jpg).

. - Matches any one character except newline and is generally used with quantifiers, which will be explained below. For instance, .{3} would find three-letter words

x - Matches any instance of x and can include specific character sets or ranges, for instance, [wxyz] would match any instance of w, x, y, or z, but not wz, yx, or other combinations of the given character set, unless it was followed by a quantifier.

^x - Matches any character that is not x and can also be used in a range. For example, <[^abel]+> would match one or more letters that are not a, b, e, or l, and which are surrounded by < and >, thus it would match <font> but not <table>.

[x] - Matches any character in the given range. Examples of a range would be the expression [0-9], which would find a single digit, or [a-z], which would find a single lower case character. You can combine ranges as well - [A-Za-z0-9] will find a single upper or lower case character or digit. You may also combine ranges with commas, such as [0-3, 5-8] which would find any digit that isn't 4 or 9.

() - Parentheses are used to group operators much like basic algebra and are also used to delineate a backreference, which is the way you can do replaces with matches. (Backreferences get their own section below). A simple example would look something like: www\.([a-z]+)\.com which will find www.anycharactersathroughzhere.com.

{} - Curly brackets (or braces) are used to define numeric quantifiers, which allow you to specify the optional, minimum, or maximum number of occurrences in the match. x{3} would find exactly 3 occurrences of x. x{3,} matches on at least 3 occurrences of x. x{3,5} matches at least 3 occurrences of x and no more than 5.

? - The preceding match is optional or must match exactly one time. An example would be: ((\.\.)?/[a-z]+\.jpg) which matches a path to an image file ending in .jpg and could start with a ../ or just a /. A ./ or ../../ would fail to match that particular expression.

* - Matches the preceding character or group 0 or more times. Note that this is not the same as the use of the ? listed above. z* can match no z, z, or for those readers who have already fallen asleep, zzzzzzzzzzzzzzzzzzzzzzz.

+ - Matches the preceding character or group 1 or more times. In comparison to the previous example, z+ would have to match at least z or zz or zzz and so on.

^ - Used to force a match to the beginning of a line. Note that this is not the same as a character exclusion such as [^xyz], which would match any characters that are not x, y, or z. ^Hello would match at the beginning of a line such as Hello Chris and would not match Chris said Hello.

$ - Used to force a match to the end of a line. $end would match at the end of a line such as This is the end and would not match end this article already!

The various operators and metacharacters listed above are pretty standard across most implementations of regex. POSIX class names and character class shorthands are shortcuts to specify character types like digits, whitespace, and so on.

POSIX (Portable Operating System Interface) classes should be more consistent across languages and applications but there may not be an exact parallel between certain class shorthands and POSIX classes, and either class type may not always be fully supported. If they are supported, POSIX classes can be useful since they have a little more precision when it comes to things like whitespace and other non-alphanumeric characters.

POSIX Class	Match
[:alnum:]	alphabetic and numeric characters
[:alpha:]	alphabetic characters
[:blank:]	space and tab
[:cntrl:]	control characters
[:digit:]	digits
[:graph:]	non-blank (not spaces and control characters)
[:lower:]	lowercase alphabetic characters
[:print:]	any printable characters
[:punct:]	punctuation characters
[:space:]	all whitespace characters (includes [:blank:], newline, carriage return)
[:upper:]	uppercase alphabetic characters
[:xdigit:]	digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F)

Character class	Match
\d	matches a digit, same as [0-9]
\D	matches a non-digit, same as [^0-9]
\s	matches a whitespace character (space, tab, newline, etc.)
\S	matches a non-whitespace character
\w	matches a word character
\W	matches a non-word character
\b	matches a word-boundary (NOTE: within a class, matches a backspace)
\B	matches a non-wordboundary

Think dif{2}erently

Many Macintosh applications can easily handle regular expressions, but that's not what I mean here. The philosophy of regex is one of surgical precision and extreme logic, and you have to play by the rules. Like doing a complex database query, you have to know exactly what you want and exactly how to get it or you'll end up with either way more data than you need or not enough information. The concepts of AND, OR, wildcards, and the liberal use of parentheses all come into play with regex. You have to carefully create an expression that meets your needs but is neither too restrictive nor too inclusive or the dark side of regular expressions will rear its ugly head.

A warning about "greediness"

With true power, comes an unhealthy dose of greed. Regular expressions are very greedy. They may seem nice and friendly, but they'll take all they can get. What this means is that a regex will try to match as much as it can, since it's not smart enough to stop on the earliest possible match. It assumes you want the "whole thing", which is why you need to create a surgical strike of an expression. You can take care of a broken toe by amputating above the knee, but then where does that leave you? (Hopping mad, probably).

A great example of regex greediness is the expression:

<a href=".*">.*</a>

At first glance, it appears this expression will find an href tag (having no extra attributes) with a reference containing just about any URL, followed by ">, then anything in the link text, then the closing </a>. You could use this to get a list of all the links in a web page. Sounds useful and looks mostly harmless, right? What you end up with is something like this:

<a href="http://sample.url.here">Click this!</a>. Some text goes <a href="../text.htm">here</a>. Maybe several paragraphs go here. More text goes <a href="/less/is/more.htm">here</a>. Another big block of text, text, and more text. <a href="end.htm">The End</a>

The reason you get a whole block of text mixed with links as a single match instead of a simple list containing each link is because the sub-expression .* is where the greed kicks in. The .* really does mean "match anything" so it merrily goes along until it can't match anything else, which matches up to the very last </a> it can find and grabs everything in between along the way. It started at the toe and went straight to the thigh, without even thinking about slowing down at the knee.

Here's where we put a splint on the toe instead of amputating the whole leg. Break down the parts of this expression:

<a href="[^"]+">[^<]+</a>

You start with the <a href=" and then you see [^"]+">. If you've been following along with the rules, you know that this means find at least one of any character except a double-quote, then find the first instance of a double-quote, then a >. The same principle applies to the next part - [^<]+</a> finds at least one of any character except a <, then matches the first literal instance of </a>. Search with this expression and you get a nice short list of complete href tags. Conquer the greed! A clear understanding of the rules of regex and the various operators is paramount and it will take patience as well as experimentation with your logic to learn to tune an expression to yield exactly what you need.

Backreferences

Using a backreference is how you finally get to witness the real power of regular expressions. Extracting a list of links from a page of source is useful, but nowhere as useful as being able to do something with that data. Parentheses can be used to "remember" a subexpression, and a backreference in the form of \digit is how you refer to that particular group. Parentheses are counted from left to right within the expression, so the first open parentheses group has a backreference of \1, the second has a backreference of \2, and so on. You can use the memory-like functionality of a backreference in a replace string.

A good example of this uses the href expression from above. You can get a list of complete hrefs from some source with the expression <a href="[^"]+">[^<]+</a>. Let's say you need to find all external links on a web site and remove the href tag, but leave the link text intact, and we'll assume for this example that none of your local links start with http://. You would add parentheses to your expression like this:

<a href="http://[^"]+">([^<]+)</a>

You would then perform a find with this expression and simply replace with \1. The parentheses "memorize" the link text and the \1 calls it into the replace, leaving you with just the link text e.g. some text about <a href="http://www.evolt.org">evolt</a> results in some text about evolt.

A more interesting example might be a transposition using more than one backreference. Pretend you have a text list of web site users in the form of LastName, FirstName and you want a list of names in a FirstName LastName format. The expression, ([^,]+),\s(.+) would find Spruck, Chris, since ([^,]+), matches any number of characters that aren't a comma, followed by a comma, then a space, then (.+) finds any number of characters again. Notice where I placed both sets of parentheses. To change Spruck, Chris to the preferred format, you would replace that with \2\s\1, yielding Chris Spruck.

When you're doing replaces, it's very important that you test your expressions on backup copies of files, or even a dummy test file of your own creation, so if your expression is off by a parenthesis or something else, you haven't ruined your files permanently. Once you know your expression works on a sample, then go ahead and work on all your files. If you do run an expression that gives you unintended results, you can probably run another one again to correct the mishap. Don't ask how I know this.

Sometimes it may also be useful to run more than one expression over the same set of data to make it easier to catch every last bit that you need with a second expression. For instance, you might want to add quotes to all your tag attributes if some are unquoted, then run another expression that somehow operates based on the quotes.

A few practical examples

Get a list of IP addresses from a server log:

(\d{1,3}\.){3}\d{1,3} - This expression will find three instances of a one to three digit number followed by a period, then one to three more digits, e.g. 206.159.10.1

Find doubled words in text such as "Rate this article high high, please!":

\s([A-Za-z]+)\s\1 - This expression will match a space, followed by a word of any length (which is later recalled by using the parentheses for a backreference), then a space again. The backreference, \1, then picks up the second instance of the same word. You could then simply replace the match with \1, which will remove the second instance of the word.

Remove FONT tags from your web pages:

<(FONT|font)([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*[^>]+>([^<]+)(</FONT>|</font>) and replace with the backreference \6 - This expression looks quite complicated, but I wanted to show an example with some more involved logic. A simpler example that finds the same string will follow this explanation. <(FONT|font) accounts for an upper or lower case tag. ([ ]([a-zA-Z]+)= matches a space followed by any attribute name and an =. The next subexpression, ("|')[^"\']+("|'), finds the leading double or single quotes on the attribute(s), then any attribute value that's not a double or single quote, i.e. Arial, +5, #c3d4ff, etc., then the closing double or single quote. Notice that the subexpression for the entire attribute is enclosed in parentheses and followed by an asterisk - ([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*. This allows you to find a tag with either no attributes or any number that may exist. [^>]+> then matches anything up to the first > (similar to the "greediness" example above). The backreference is defined next as ([^<]+), which will capture any text between the opening and closing font tags, and is referred to as \6 because it's the sixth parenthetical group in the entire expression. Then (</FONT>|</font>) accounts for the closing font tag in either case.

<(FONT|font)[^>]*>[^<]*(</FONT>|</font>) is a simpler example that accomplishes the same thing as the expression explained above. The difference is that it is much less picky about what is between the font tags, so if you have inconsistent tag syntax, it will probably capture the various instances you may have. On the other hand, if you have any extra junk characters in your search data, you may catch things you didn't intend, which is why you should test your expressions ahead of time.

A brief history of the 31 Flavors

There are a number of applications and languages that support regular expressions, but unfortunately, not all of them support regex in quite the same way. Although regular expressions had their origins in neurophysiology in the 1940s and were developed by theoretical mathematicians in the 1950s and 1960s, the evolution and subsequent divergence of regex implementations was due to the independent development of various Unix tools such as grep, awk, sed, Emacs, and others. [1]

Today, it's probably safe to say that Perl has the most robust regex engine in common use. Other languages and applications that have some form of regex support or pattern matching (and this by no means is a complete list) include: JavaScript, VBScript, PHP, Python, Tcl, Java, C, Macromedia Dreamweaver/Ultradev, ColdFusion and ColdFusion Studio, BBEdit, NoteTabPro, TextPad, UltraEdit, the XML Schema and XPath Recommendations, the various Unix tools used for text processing and their clones, and just about any modern application with a Find function.

Conclusion

Regular expressions are a powerful tool to keep in your web belt. They can appear daunting, but by learning a few simple rules, you can save yourself from hours of time doing manual find-and-replaces the slow, boring way.

I'll close with what may be the world's first (and undoubtedly the world's worst) regular expression joke:

What did one regex say to the other?

Other Resources

[1] Mastering Regular Expressions - Friedl
http://www.regexlib.com/
www.webreference.com/js/column5/

All the regular expressions in this article were tested using ColdFusion Studio 4.5.2, so you may encounter slight differences in different applications or languages. Thanks to Sean Palmer for some expression testing.

Chris' favorite regular expression is a smirk with an optional wink. He lives in Atlanta, Georgia and dreams of being back on the coast. He probably needs more info in his bio. (He almost never refers to himself in the third person.) Suggestions can be sent to sprocket@members.evolt.org.

Rating

This article currently has a rating of 4.74 out of 27 total ratings.

Would you like to rate this article? Article ratings encourage evolt.org contributors to keep writing the good stuff. Help make our authors feel wanted! You need to log in to rate an article. You can enter your username and password on the right. If you aren't a member of evolt.org yet, register and become a part of the community!

Reader Comments

[Link]

sweet

taftman wrote on 03/22/2002 at 11:49 AM

Great article, thanks.
- rob

[Link]

regexes and the amazing vim

jeduthun wrote on 03/22/2002 at 5:40 PM

Included in "various Unix tools used for text processing and their clones" is the text editor vi, and its modern day equivalent, vim (vi improved). I learned most of what I know about regular expressions from using vim (which comes in Linux and Windows flavors, among others). Its documentation has very thorough coverage of regexes and their usage. I have started to use vim at work for almost all my everyday editing, partially because of its strong regex support.

Anyhow, if you want to learn regexes, vim makes a good playground, and you can't beat the cost. Open up a file and then type in

:%s/searchregex/replaceregex/g

to run a search/replace on a whole file using regular expressions. (The % means 'all lines in the file' and the g means 'all instances on each line' -- both parameters are customizible.)

In vim, you can type

:help regex

for the full manual on regular expressions.

[Link]

call me stupid if you like but...

notabene wrote on 03/24/2002 at 1:46 PM

... I didn't get this:

What did one regex say to the other?
.+

Don't laugh, will you? :-)

[Link]

"joke" explained

sprocket wrote on 03/24/2002 at 3:06 PM

.+ would match at least any one character, so the content of the punchline is really irrelevant. The fact that the punchline can be anything, due to the expression, is the joke itself. Now that I've made it completely unfunny by explaining it, any other questions? Maybe I should just stick to articles. :-)

Chris

[Link]

aaaaaaaaaaaaaaaah

notabene wrote on 03/24/2002 at 5:09 PM

He he.

He he he.

[Link]

more regex jokes

jeduthun wrote on 03/25/2002 at 9:26 AM

Just so Chris can't claim authorship of the world's only regex joke...

What regex are you most likely to see at Christmas?

[^L]

Why couldn't Chris try out the regular expressions he created until he left home?

His mom wouldn't let him play with matches.

[Link]

please, stop

jesteruk wrote on 03/25/2002 at 8:46 PM

They are the worst jokes i ever heard in my life. Trust me, that's an achievement there, my uncle was a vicar who liked to think of himself as a funny man *shudders at the memories*

Brilliant article mate, clear, concise, well contructed and supported. Well done, keep it up. Just leave out the jokes - please... hehe.

-J

Add your comment:

Want to join the conversation? You need to log in to post a comment! You can enter your username and password on the right. If you aren't a member of evolt.org yet, register and become a part of the discussion!

Search evolt.org

Submit

Submit articles, news or reviews for publication on evolt

Contact

Email evolt.org

Recent Jobs

Highest Rated Articles

Regular Expression B…
4.74 after 27 ratings
Usable Forms (for an…
4.65 after 23 ratings
Links & JavaScript L…
4.63 after 32 ratings
Inside the evolt.org…
4.61 after 49 ratings
Accessing the inter…
4.58 after 19 ratings

[…more]

Recent Hot Discussions