A.J. Jacobs wrote: “I think there’s something to the idea that the divine dwells more easily in text than in images.” Since there has been an explosion in the amount of text, sacred and secular, available to anyone with an Internet connection, the need for processing tools has grown. A regular expression is a type of text string shorthand to describe a search pattern. It can be used to find text which matches a pattern within a larger text, to replace the matching text, or to split the matching text into groups. Regular expressions’ power for extracting specific text from documents resides in their ability to replace many lines of code with as little as one line.
Although there is no official standard for regular expressions, there are several types. The most common is the Perl-compatible version, but there are also Java, Python, Ruby and other compatible versions.
It should be noted that the power of regular expressions comes at the cost of a steep learning curve which often causes frustration on the part of the beginning user. Fortunately, there exist tools for working with regular expressions.
One such tool is RegexBuddy which is compatible with Windows XP, Vista, 7, 8, and 8.1 and VMware, Parallels, and CrossOver Office. It works with all the most common regular expression versions, and offers syntax help in the creation of regular expressions and a test against sample text by highlighting areas of match to aid in troubleshooting. There is also a replace button to make addition of replacement text easier and a split button which treats the expression as a separator. Highlighting areas where there is no match allows use of a debugging feature.
Other tools are RegexPal, which supports only JavaScript version, and RegexMagic, which is more prescriptive in that it allows the user to describe the text desired to be matched and it generates the regular expression.
The components of regular expressions can be classified into several types which may be used in combinations. A summary of these is shown in Table 1 below.
Character classes identify broad types of characters such as words, digits, white space, control characters, hexadecimal and octal character sets. The negation of a character class can sometimes be done by substituting an upper case letter. For example, “\d” identifies digits whereas “\D” identifies non-digits. Other character classes are: “\w” for words and “\s” for spaces. “\c” is used to match control characters in the ASCII table.
Ranges are needed for two other character classes — hexadecimal and octal characters. A range is identified by brackets with the letters or digits included with a dash between which any value is included. Thus, for hexadecimals, “[a-fA-F0-0]” is used and, for octals, “0[0-7]”. Groups are identified within parenthesis. A pipe can be used within a group to identify alternatives, such that (a|b) matches either “a” or “b”.
Quantifiers are used to identify the number of matches.
- An asterisk “*”, question mark “?”, and plus “+” match a character zero or more times, zero or one times, or one or more times the previous character, respectively.
- A dot “.” is a wildcard which matches any character.
- A number in brackets signifies the number of matches such as {2} for exactly 2 matches. {2,5} signifies from 2 to 5 matches.
- Anchors identify boundaries of text.
- A “^”, “\A”, and “\b” match the start of a line, string, and word boundary, respectively,
- A “$” and “\Z”, match the end of a line and string, respectively.
- A “\B” is used to match text which is not a word boundary.
Lookarounds are related to anchors in that they define position in the text. They check whether text can be matched without actually matching it. There are four types: positive lookahead “(?=)”, positive lookbehind “(?<=)”, negative lookahead “(?!)”, and negative lookbehind “(?!=)”.
- The lookaheads check whether the text inside the lookahead occurs to the right of the regular expression engine position, moving from left to right.
- For the lookbehinds, the regular expressions look to the left, moving from right to left.
- The positive types look for matches
- The negative types looks for non-matches.
Escapes are a method of treating metacharacters literally, rather than as special characters. Common metacharacters are: ^ $ ( ) < [ { \ | > . * + ? . These are escaped by placing a backslash “\” in front of them.
Special characters also begin with the backslash “\” and are used to represent nonprinting characters such as “\f” for form feed, “\n” for line feed, “\r” for carriage return, “\t” for horizontal tab, and “\v” for vertical tab.
String replacement is often an important feature to use once matching text is found. The use of “$n” where n identifies a capturing group is an easy method to do this. Other options include using shorthands such as : “$` ” for references before a matched string, “$’ “ for after a matched string, “$+” for the last matched string, “$_” for the entire input string, or “$&” for the entire matched group.
Pattern modifiers provide additional power by allowing functionality such as “g” for a global match or “i” for case insensitive matches.
Regular expressions are a powerful text matching tool that can be used to simplify coding and to find text which matches a pattern within a larger text. Several examples are shown in Table 2.
Reference
Regular Expressions Cookbook Second Edition, Jan Goyvaerts & Steven Levithan, O’Reilly Media, Inc., 2012.
Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at editor@ScientificComputing.com.
Table 1: Regular Expression Components
Group |
Component |
Definition |
Character class
|
/d |
Digits |
/w |
Words |
|
/s |
Spaces |
|
/c |
Control Characters |
|
Ranges
|
[a-z] |
Range between letters |
(,,,) |
Groups |
|
a|b |
a or b |
|
Quantifiers
|
* |
zero or more matches |
? |
zero or one matches |
|
+ |
one or more matches |
|
Anchors
|
^ |
Start of line |
\A |
Start of string |
|
\b |
Word boundary |
|
$ |
End of line |
|
\Z |
End of string |
|
Metacharacters |
^ $ ( ) < [ { \ | > . * + ? |
Special characters needing backslash in front to be treated literally |
Special characters
|
\f |
Form feed |
\n |
Line feed |
|
\r |
Carriage return |
|
\t |
Horizontal tab |
|
\v |
Vertical tab |
|
String replacement
|
$n |
Identifies capture group |
$` |
Reference before matched string |
|
$’ |
Reference after matched string |
|
$+ |
Reference last matched string |
|
$_ |
Reference entire input string |
|
$& |
Reference entire matched string |
|
Pattern modifiers
|
g |
Global match |
i |
Case insensitive |
Table 2: Regular Expression Examples
Search goal |
Regular Expression |
Email addresses |
^S+@\S+$ |
URL |
^(https?|ftp|file)://.+$ |
JPG or PNG image |
([^\s]+(?=\.(jpg|png))\.\2 |
US Zip codes |
^[0-9](5)(?:-[0-9]{4}?$ |
Dates |
(\d{1,2}\/\d{1,2}\/\d{4} |