Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

The Power of Regular Expressions

By R&D Editors | January 22, 2015

Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. A.J. Jacobs wrote: “I think there’s something to the idea that the divine dwells more easily in text than in images.” Since there has been an explosion in the amount of text, sacred and secular, available to anyone with an Internet connection, the need for processing tools has grown. A regular expression is a type of text string shorthand to describe a search pattern. It can be used to find text which matches a pattern within a larger text, to replace the matching text, or to split the matching text into groups. Regular expressions’ power for extracting specific text from documents resides in their ability to replace many lines of code with as little as one line.

Although there is no official standard for regular expressions, there are several types. The most common is the Perl-compatible version, but there are also Java, Python, Ruby and other compatible versions.

It should be noted that the power of regular expressions comes at the cost of a steep learning curve which often causes frustration on the part of the beginning user. Fortunately, there exist tools for working with regular expressions.

One such tool is RegexBuddy which is compatible with Windows XP, Vista, 7, 8, and 8.1 and VMware, Parallels, and CrossOver Office. It works with all the most common regular expression versions, and offers syntax help in the creation of regular expressions and a test against sample text by highlighting areas of match to aid in troubleshooting. There is also a replace button to make addition of replacement text easier and a split button which treats the expression as a separator. Highlighting areas where there is no match allows use of a debugging feature.

Other tools are RegexPal, which supports only JavaScript version, and RegexMagic, which is more prescriptive in that it allows the user to describe the text desired to be matched and it generates the regular expression.

The components of regular expressions can be classified into several types which may be used in combinations. A summary of these is shown in Table 1 below.

Character classes identify broad types of characters such as words, digits, white space, control characters, hexadecimal and octal character sets. The negation of a character class can sometimes be done by substituting an upper case letter. For example, “\d” identifies digits whereas “\D” identifies non-digits. Other character classes are: “\w” for words and “\s” for spaces. “\c” is used to match control characters in the ASCII table.

Ranges are needed for two other character classes — hexadecimal and octal characters. A range is identified by brackets with the letters or digits included with a dash between which any value is included. Thus, for hexadecimals, “[a-fA-F0-0]” is used and, for octals, “0[0-7]”. Groups are identified within parenthesis. A pipe can be used within a group to identify alternatives, such that (a|b) matches either “a” or “b”. 

Quantifiers are used to identify the number of matches.

  • An asterisk “*”, question mark “?”, and plus “+” match a character zero or more times, zero or one times, or one or more times the previous character, respectively.
  • A dot “.” is a wildcard which matches any character.
  • A number in brackets signifies the number of matches such as {2} for exactly 2 matches. {2,5} signifies from 2 to 5 matches.
  • Anchors identify boundaries of text.
  • A “^”, “\A”, and “\b” match the start of a line, string, and word boundary, respectively,
  • A “$” and  “\Z”, match the end of a line and string, respectively.
  • A “\B” is used to match text which is not a word boundary.

Lookarounds are related to anchors in that they define position in the text. They check whether text can be matched without actually matching it. There are four types: positive lookahead “(?=)”, positive lookbehind “(?<=)”, negative lookahead “(?!)”, and negative lookbehind “(?!=)”.

  • The lookaheads check whether the text inside the lookahead occurs to the right of the regular expression engine position, moving from left to right.
  • For the lookbehinds, the regular expressions look to the left, moving from right to left.
  • The positive types look for matches
  • The negative types looks for non-matches.

Escapes are a method of treating metacharacters literally, rather than as special characters. Common metacharacters are: ^ $ ( ) < [ { \ | > . * + ? . These are escaped by placing a backslash “\” in front of them.

Special characters also begin with the backslash “\” and are used to represent nonprinting characters such as “\f” for form feed, “\n” for line feed, “\r” for carriage return, “\t” for horizontal tab, and “\v” for vertical tab.

String replacement is often an important feature to use once matching text is found. The use of “$n” where n identifies a capturing group is an easy method to do this. Other options include using shorthands such as : “$` ” for references before a matched string, “$’ “ for after a matched string, “$+” for the  last matched string, “$_” for the entire input string, or “$&” for the entire matched group.

Pattern modifiers provide additional power by allowing functionality such as “g” for a global match or “i” for case insensitive matches.

Regular expressions are a powerful text matching tool that can be used to simplify coding and to find text which matches a pattern within a larger text. Several examples are shown in Table 2.

Reference

Regular Expressions Cookbook Second Edition, Jan Goyvaerts & Steven Levithan, O’Reilly Media, Inc., 2012.

Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at [email protected].

 

Table 1: Regular Expression Components

Group

Component

Definition

Character class

 

/d

Digits

/w

Words

/s

Spaces

/c

Control Characters

Ranges

 

[a-z]

Range between letters

(,,,)

Groups

a|b

a or b

Quantifiers

 

*

zero or more matches

?

zero or one matches

+

one or more matches

Anchors

 

^

Start of line

\A

Start of string

\b

Word boundary

$

End of line

\Z

End of string

Metacharacters

^ $ ( ) < [ { \ | > . * + ?

Special characters needing backslash in front to be treated literally

Special characters

 

\f

Form feed

\n

Line feed

\r

Carriage return

\t

Horizontal tab

\v

Vertical tab

String replacement

 

$n

Identifies capture group

$`

Reference before matched string

$’

Reference after matched string

$+

Reference last matched string

$_

Reference entire input string

$&

Reference entire matched string

Pattern modifiers

 

g

Global match

i

Case insensitive

 

Table 2: Regular Expression Examples

Search goal

Regular Expression

Email addresses

^S+@\S+$

URL

^(https?|ftp|file)://.+$

JPG or PNG image

([^\s]+(?=\.(jpg|png))\.\2

US Zip codes

^[0-9](5)(?:-[0-9]{4}?$

Dates

(\d{1,2}\/\d{1,2}\/\d{4}

 

 

Related Articles Read More >

Why IBM predicts quantum advantage within two years
Aardvark AI forecasts rival supercomputer simulations while using over 99.9% less compute
This week in AI research: Latest Insilico Medicine drug enters the clinic, a $0.55/M token model R1 rivals OpenAI’s $60 flagship, and more
How the startup ALAFIA Supercomputers is deploying on-prem AI for medical research and clinical care
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE