Regular Expressions

Here are my notes summarizing pages of manuals and web pages on regular expressions.

Topics this page:

Summary

Metacharacters

Engines

Try It!

General Rules

Examples

Error Recovery

Your comments???

Site Map
About this site

Why Do We Care About This? ^ V

A regular expression is a "formula" for matching strings that follow some pattern in order to operate on a subject character string.

Text in HTML, log files, text files containing data, etc. are parsed in order to validate for correct formatting, to extract substrings, or to replace content.

The Perl ("Practical Extraction and Report Language") language has become popular partly because of its extensive support for regular expressions. Perl allows you to embed regular expressions in file tests, control loops, output formats, and everything else.

http://www.wikiwand.com/en/Regular_expression

The term regular expression is often abbreviated as "regex" or "regexes" in plural.

Regular-Expressions.info

Steve Ramsay's Guide to Regular Expressions

Learning to Use Regular Expressions, by David Mertz also discusses advanced Regular Expression Extensions such as Non-greedy quantifiers, backreferences, and lookahead assertions.

Rx Cookbook at ActiveState has contributions from several people.

Regexp Power Part I (June 06, 2003) and Part II (July 01, 2003) by Simon Cozens

Steve Mansour's A Tao of Regular Expressions compares differences in expressions for various tools.

Five Habits for Successful Regular Expressions by Tony Stubblebine describes how you can test regular expressions in PHP, Perl, and Python.

Get this book discounted from Amazon Regular Expression Pocket Reference ( O'Reilly, August 2003) by Tony Stubblebine provides a concise "memory jogger" that you won't be embarassed to carry around.

Get this book discounted from Amazon Teach Yourself Regular Expressions in 10 Minutes (Sams; February 28, 2004) by Ben Forta

Get this book discounted from Amazon Beginning Regular Expressions (Wrox Press, 2005) by Andrew Watt

Try It Now

TIP: The easiest way to learn this is to take a hands-on approach and try some patterns. Test and debug regular expressions using these tools.

Download or clone RegexExplained and see it used by its author @LeaVerou at
VIDEO: /Reg(exp){2}lained/: Demystifying Regular Expressions presented live at the O'Reilly Fluent conference May 2012.

RegexPal.com parses JavaScript on a web page.

Use the Regex Coach to graphically experiment with (Perl-compatible) regular expressions interactively. Dr. Edmund Weitz wrote this for use on Windows and Linux systems to show how Common Lisp can be practical using the LispWorks IDE and cross-platform CAPI toolkit.

Regular Expression Tester parses within ASP.NET.

$40 RegExBuddy is a Windows program.

Engines

Beware that mime and vendor competitive urges have engendered several versions of regular expressions:

The historical Simple Basic Regular Expression (BRE) notation described as part of the regexp() function in the XSH specification, which provide backward compatibility, but which may be withdrawn from a future specification set.
- The GNU operating system's regex package are available via ftp at ftp.gnu.org.
Compilers of programming languages Perl, Python, Emacs, Tcl, and .NET use a backtracking regular expression matcher that incorporates a traditional Nondeterministic Finite Automaton (NFA) engine.
- RX: The Regex Debugger is written for Perl developers.
- Video on regex for Python
- Part 3

The standardized POSIX NFAs is slower.

Utility programs initially developed for unix -- awk, egrep, and lex -- use a faster, but more limited, pure regular expression Deterministic Finite Automaton (DFA) engine.

The Extended Regular Expressions (ERE) version complies with the internationalized ISO/IEC 9945-2:1993 standard. It matches based on the bit pattern used for encoding the character, not on the graphic representation of the character (which may represent more than one bit pattern).

Microsoft's .NET Framework regular expressions are said to be compatible with Perl 5 regular expressions, but include features not yet seen in other implementations, such as right-to-left matching and on-the-fly compilation.

Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation.

For example, the match routine of the C language library, accepts strings that are interpreted as regular expressions.

Regex Patterns

Instead of custom-written coding (looping through each line and invoking sub-string functions), regex methods refer to a pattern of characters to vary its searching and matching.

This video shows how files containing different date formats can't be parsed using just the sub-string function alone, which is a dangeroudly blunt tool.

Patterns comprises two basic character types available from a standard keyboard (not using Greek alphas, lambdas, etc. like mathematicians do):

literal (normal) text characters such as 0 thru 9 or a thru z; and
Metacharacters specify filtering. enabling a powerful, flexible, and efficient method for processing text. However, their compactness make them easier to create than to read.

JOKE: Some call regex expressions "ASCII puke" because it looks like a jumble of letters and numbers.

The Kleene Star * (Wild Card) Metacharacter

The development of regular expressions is first traced back to the work of Kleene (some pronounce like "clean knee") -- Stephen Cole Kleene (1904-1994), an American mathematician and theoretical computer scientist at Princeton and U. of Wisconsin-Madison.

For this reason, the "*" wildcard character used in computer searches is formally known as a "Kleene star."

Kleene's text-manipulation tools used by the Unix platform include ed, vi text editor, and grep file search utilities made used his notations for “the algebra of regular sets.”

The use of < and > enclosing text is formally known as a "Kleene closure".

Basic Metacharacters

There are 12 of them.

Meta-
character Operator
Name Matches Example regular expression

. period any single character except NUL. r.t would match the strings rat, rut, r t, but not root (two o's) nor the Rot in Rotten (upper case R).

* Kleene star>, asterisk, wildcard zero or more occurences of the character immediately preceding. .* means match any number of any characters.

$ dollar currency anchor end of a line. weasel$ would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels."
When the $ operator is the last operator of a regular expression or immediately follows a right parenthesis, it must be proceeded by a backslash \.

^ circumflex or caret anchor beginning of a string/line. ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the" .

[ ]
[c1-c2]
[^c1-c2] square brackets any one of the characters between the brackets. r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] matches any characters except 2, 6, 9, and upper case letters.

[^c1-c2] caret within square brackets the complement range -- any character except those in the range following the caret as the first character after the opening bracket. [^269A-Z] will match any characters except 2, 6, 9, and upper case letters.
When the ^ operator is the first operator of a regular expression or the first character inside brackets, it must be preceded by a backslash.

\ back slash This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character.
Operators inside brackets do not need to be preceded by a backslash.

\< \> left slash and arrow the beginning (\<) or end (\>) or a word. \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications.

 left slash and parentheses the expression between $ and $ as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9.

| pipe (alternation) Or two conditions together. (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications.

+ plus sign one or more occurences of the character or regular expression immediately preceding. 9+ matches 9, 99, or 999. NOTE: this metacharacter is not supported by all applications.

\{i\}
\ {i,j\} braces a specific number of instances or instances within a range of the preceding character. A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234.
[0-9]\{4,6\} matches any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is supported by Robot's C-VU language but not by all applications.

? question mark Matches 0 or 1 occurence of the character or regular expression immediately preceding. ? is equivalent to {0,1}. NOTE: this metacharacter is supported by IBM/Rational Robot's C-VU language but not by all applications. Question marks are optionally used to specify Non-greedy quantifiers. For example, "/A[A-Z]*?B/" means "match an A, followed by only as many capital letters as are needed to find a B."

In addition, VU regular expressions can include ASCII control characters in the range 0 to 7F hex (0 to 127 decimal).

Regex processes only ASCII character set and does not process Unicode (UTF-8).

Backward Slash Extended MetaCharacters

One of the ways people are confused with regular expressions is the use of a backward slash \ character.

For an analogy that you many already know, in Windows command line terminals, people use dir *.txt /s to look for text files in subdirectories. The asterisk or star character is a wildcard. The /s specifies processing of sub-folders.

With regex, the same parsing would be specified by .*\.txt, with a back-slash in front of the dot for the escape character for the dot before txt since the dot has another meaning within regex expressions.

The dot character . is used in regex to represent any one character.

Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

Extended

Liks C and Java programs, regex programs use \ as an escape character to denote use of special characters as plain text. These additional escape tags are recognized within Ruby regex:

\A Beginning of a string

\b Word boundary

\B Non-word boundary

\d digit, same as {0..9}

\D Non-digit

\s Whitespace [\t\r\n]

\S Non-Whitespace

\w Word character

\W Non-Word character

\z End of a string

\Z End of string, before nl

[10:00] To specify digits (numbers) [0-9]:

\d

[10:48] To specify letters, numbers, and underscore, use shortcut:

\w

[14:34] To match hex codes containing 3 or 6 numbers of hex code in CSS color specification such as #abc, #f00, #BADA55, #C0FE56

/^#[a-f\d]{3}){1,2}$/i.test(str);

This matches letters between a-f or a digit {3} times, repeated {1,2} once or twice.

Double Backslash Regex in LoadRunner

The double backslash is required in C language programs invoking regex because both C and regex "consume" a backslash as an escape character.

LoadRunner has this function which creates a parameter named "selected_value":

char *str = " ... the html text here ...";

web_reg_save_param_regexp(
	"ParamName=selected_value",
	"RegExp=<select name=\"Regulatory Code_0\"[\\s\\S]*?<option .*? selected>(.*?)</option>",
	LAST );

The [\\s\\S] means match any white space and any non white space character = any character (because no Perl like "s" modifier available).

Introduced with VuGen 12 is a new function:

char *str = " ... the html text here ...";

lr_save_param_regexp(str, strlen(str),
	"RegExp=... the regex here ...",
	"ResultParam=selected_value",
	LAST);

General Rules

Winrunner TSL regular expressions have the following characteristics:

The concatenation of single-character operators matches the concatenation of the characters individually matched by each of the single-character operators.

Parentheses () can be used within a regular expression for grouping single-character operators. A group of single-character operators can be used anywhere one single-character operator can be used - for example, as the operand of the * operator.

Parentheses and the following non-ordinary operators have special meanings in regular expressions. They must be preceded by a backslash if they are to represent themselves:

Examples of Regular Expressions

This regular expression matches any day of the week:

This matches simple dates against 1 or 2 digits for the month, 1 or 2 digit for the day, and either 2 or 4 digits for the year. Matches: [4/5/91], [04/5/1991], [4/05/89] Non-Matches: [4/5/1]

((\d{2})|(\d))\/((\d{2})|(\d))\/((\d{4})|(\d{2}))

This identifies incorrect 24 hour time in the format hh:mm:

/((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9])/

Validate a number between 1 and 255, such as an IP octet:

^([1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])$

This breaks down a Uniform Resource Identifier (URI) into its component parts. (from ActiveState quoting Appendix B of IETF RFC 2396)

my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if
  $uri =~ m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};

$1 = http: $2 = http (the scheme) $3 = //www.ics.uci.edu $4 = www.ics.uci.edu (the authority) $5 = /pub/ietf/uri/ (the path) $6 = $7 = (the query) $8 = #Related $9 = Related (the fragment)

Validate an ip address in the form 255.255.255.255 -- if it were combined with the email pattern above, the error above would not exist. Of course, the best way to test an email address is to send e-mail to it:

^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$

Validates date in the US m/d/y format from 1/1/1600 - 12/31/9999. The days are validated for the given month and year. Leap years are validated for all 4 digits years from 1600-9999, and all 2 digits years except 00 since it could be any century (1900, 2000, 2100). Days and months must be 1 or 2 digits and may have leading zeros. Years must be 2 or 4 digit years. 4 digit years must be between 1600 and 9999. Date separator may be a slash (/), dash (-), or period (.)

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Validate passwords to be at least 4 characters, no more than 8 characters, and must include at least one upper case letter, one lower case letter, and one numeric digit.

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$

Validate major credit card numbers from Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits.

^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$

This will Use extended grep for a valid MAC address, such as [01:23:45:67:89:ab], [01:23:45:67:89:AB], [fE:dC:bA:98:76:54] with colons seperating octets. It will ignore strings too short or long, or with invalid characters, such as [01:23:45:67:89:ab:cd], [01:23:45:67:89:Az], [01:23:45:56:]. It will accept mixed case hexadecimal.

^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$

This matches the name of any state in the United States:

[ACF-IK-PR-W][a-y]{2,4}[a-y][CDIJMVY]?[a-z]{0,7}

But you probably use a drop-down list rather than making people type them out.

This Perl script (from Craig Berry) uses a pattern to validate British Royal Mail codes used in the UK. Each code has 2 parts: the outward (first) part cannot contain any character in "CIKMOV."

use strict;
my @patterns = ('AN NAA', 'ANN NAA', 'AAN NAA', 'AANN NAA',
                'ANA NAA', 'AANA NAA', 'AAA NAA');
foreach (@patterns) {
  s/A/[A-Z]/g;
  s/N/\\d/g;
  s/ /\\s?/g;
}
my $re = join '|', @patterns;
while (<>) {
  print /^(?:$re)$/o ? "valid\n" : "invalid\n";
}

Alternately, the RegEx:

(AB|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GU|H|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|MK|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|Y|ZE)([1-9]|[1-9][0-9]) [1-9][A-Z]{2}

The RegEx for verifying Canadian postal codes:

[ABCEGHJKLMNOPRSTWXYZ]{1}[ABCEGHJKLMNPRSTWXYZ]{1}[ ][0-9]{2}[ ][0-9]{2}[ ][0-9]{2}[ ][ABCD]{1}

This matches any hexadecimal number with a decimal value of 1 to 4 digits in the range 0 to 65535:

[a-fA-F0-9]{1,4}

Visibone's FREE Regular Expressions detailed cheatsheet

$30 regexbuddy allows you to easily create, understand and test regex patterns for C# and VB.NET. It includes a library of expressions.

Altova.com XML Regular Expressions Edit Regular Exp's for XML Schema XML Editor,

Get this book discounted from Amazon Regular Expression Recipes: A Problem-Solution Approach (APress ) by Nathan A. Good

Error Recovery with Regular Expressions

If a VU regular expression contains an error, when you run a suite, TestManager writes the message to stderr output prefixed with the following header:

sqa7vui#xxx: fatal orig type error: tname: sname, line lineno

where #xxx identifies the user ID (not present if 0), fatal signifies that error recovery is not possible (otherwise not present), orig specifies the error origination (user, system, server, or program), and type specifies the general error category (initialization, argument parsing, script initialization, or runtime). If the error occurred during execution of a script (run-time category), tname specifies the name of the script being executed when the error occurred, sname specifies the name of the VU source file that contains the VU statement causing the error, and lineno specifies the line number of this VU statement in the source file. Note that the source file information will not be available if the script's source cross-reference section has been stripped.

If a run-time error occurs due to an improper regular expression pattern in the match library function, a diagnostic message of the following form follows the header:

Regular Expression Error = errno

where errno is an error code which indicates the type of regular expression error. The following table lists the possible errno values and explains each.

errno Explanation

2 Illegal assignment form. Character after )$ must be a digit.Example: "([0-9]+)$x"

3 Illegal character inside braces. Expecting a digit.Example: "x{1,z}"

11 Exceeded maximum allowable assignments. Only $0 through $9 are valid.Example: "([0-9]+)$10"

30 Missing operand to a range operator (? {m,n} + *).Example: "?a"

31 Range operators (? {m,n} + *) must not immediately follow a left parenthesis.Example: "(?b)"

32 Two consecutive range operators (? {m,n} + *) are not allowed.Example: "[0-9]+?"

34 Range operators (? {m,n} + *) must not immediately follow an assignment operation.Example: "([0-9]+)$0{1-4}"Correction: "(([0-9]+)$0){1-4}"

36 Range level exceeds 254.Example: "[0-9]{1-255}"

39 Range nesting depth exceeded maximum of 18 during matching of subject string.

41 Pattern must have non-zero length.Example: ""

42 Call nesting depth exceeded 80 during matching of subject string.

44 Extra comma not allowed within braces.Example: "[0-9]{3,4,}"

46 Lower range parameter exceeds upper range parameter.Example: "[0-9]{4,3}"

49 '\0' not allowed within brackets, or missing right bracket.Example: "[\0] or [0-9"

55 Parenthesis nesting depth exceeds maximum of 18.Example: "(((((((((((((((((((x)))))))))))))))))))"

56 Unbalanced parentheses. More right parentheses than left parentheses.Example: "([0-9]+)$1)"

57 Program error. Please report.

70 Program error. Please report.

90 Unbalanced parentheses. More left parentheses than right parentheses.Example: "(([0-9]+)$1"

91 Program error. Please report.

100 Program error. Please report.

Get this book discounted from Amazon $15 Regular Expressions with .NET [PDF] by Dan Appleman (Feb. 2002: O'Reilly)

Get this book discounted from Amazon Real World Regular Expressions with Java 1.4 (APress ) by Mehran Habibi

.NET regex tools list by Mike Gunderloy at Larkware

Mastering Regular Expressions (O'Reilly, 1997) by Jeffrey E. F. Freidl is both a detailed tutorial and a detailed reference work on regular expression syntax.

Get this book discounted from Amazon Regular Expression Recipes for Windows Developers: A Problem-Solution Approach (Apress � 2005, 400 pages) by Nathan A. Good explains how to use over 100 of the most popular real-world regular expressions in a concise way.

C# Coding Example

The C# languge provides a System.Test.RegularExpressions library:

	System.Test.RegularExpressions;

This provides the Regex constructor which instatiate a regex class:

		var regex = new Regex( pattern );

Use the Match method defined within Regex on the subject text to generate a match object:

		var match = new regex.Method( subject );

See what came back:

		Console.WriteLine( match.Success );

This code would go inside code to define a command-line program named MatchTest.exe:

CREDITS:

Dan Sullivan's 3.25 hour .NET Regex video course on Pluralsight shows how additional C# programming enhances additional logic for handling groups, etc..

Programming Languages
Java Programming
Applications Development

Your rating of this page:
Low High

Your comments on this topic, please:

Publish this comment publicly

Your first name:

Your family name:

Your location (city, country):

Your Email address:

Email me updates

Top of Page

Thank you!