How I may help
LinkedIn Profile Email me!

Reload this page Regular Expressions

Here are my notes summarizing pages of manuals and web pages on regular expressions.

 

Topics this page:

  • Summary
  • Metacharacters
  • General Rules
  • Examples
  • Error Recovery
  • Your comments???
  •  

    Site Map List all pages on this site 
    About this site About this site 
    Go to first topic Go to Bottom of this page


    Set screen Summary

      A regular expression is a formula for matching strings that follow some pattern. They provide a mechanism to extract a subset from a character string. For many applications that deal with strings (such as HTML processing, log file parsing, and HTTP header parsing), regular expressions are an indispensable tool.

      The extensive pattern-matching notation of regular expressions allows you to quickly parse large amounts of text to find specific character patterns to:

      • extract, edit, replace, or delete text substrings; and to
      • add the extracted strings to a collection in order to generate a report.

      The match routine of the C language library, for example, accepts strings that are interpreted as regular expressions.

      The Perl ("Practical Extraction and Report Language") language has become popular partly because of its extensive support for regular expressions. Perl allows you to embed regular expressions in file tests, control loops, output formats, and everything else.

      The development of regular expressions is first traced back to the work of Stephen Cole Kleene (1904-1994), an American mathematician and theoretical computer scientist at Princeton and U. of Wisconsin-Madison.

      Text-manipulation tools on the Unix platform (including ed, vi, and grep file search utilities) made use of his notations for “the algebra of regular sets.” For this reason, the "*" wildcard character used in computer searches is formally known as a "Kleene star." (The use of < and > enclosing text is formally known as a "Kleene closure")

      Time and vendor competitive urges has evolved several versions of regular expressions:

      • The historical Simple Basic Regular Expression (BRE) notation described as part of the regexp() function in the XSH specification, which provide backward compatibility, but which may be withdrawn from a future specification set.
      • The Extended Regular Expressions (ERE) version complies with the internationalized ISO/IEC 9945-2:1993 standard. It matches based on the bit pattern used for encoding the character, not on the graphic representation of the character (which may represent more than one bit pattern).
      • Microsoft's .NET Framework regular expressions include features not yet seen in other implementations, such as right-to-left matching and on-the-fly compilation.

      Perl, Python, Emacs, Tcl, and .NET use a backtracking regular expression matcher that incorporates a traditional Nondeterministic Finite Automaton (NFA) engine.

      Awk, egrep, and lex use a faster, but more limited, pure regular expression Deterministic Finite Automaton (DFA) engine.

      The standardized POSIX NFAs is slower.

      The language comprises two basic character types: literal (normal) text characters and metacharacters. The special characters used as metacharacters in regular expressions enable a powerful, flexible, and efficient method for processing text.

      However, their compactness make them easier to create than to read.

     

      Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Basic Metacharacters

      Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

      Meta-
      character
      Operator
      Name
      MatchesExample regular expression
      .
      period any single character except NUL. r.t would match the strings rat, rut, r t, but not root (two o's) nor the Rot in Rotten (upper case R).
      *
      Kleene star, asterisk, wildcard zero or more occurences of the character immediately preceding. .* means match any number of any characters. 
      $
      dollar currency anchor end of a line. weasel$ would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels."

      When the $ operator is the last operator of a regular expression or immediately follows a right parenthesis, it must be proceeded by a backslash \.

      ^
      circumflex or caret anchor beginning of a string/line. ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the"
      [ ] 
      [c1-c2]
      [^c1-c2]
      square brackets any one of the characters between the brackets. r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] matches any characters except 2, 6, 9, and upper case letters.
      [^c1-c2]
      caret within square brackets the complement range -- any character except those in the range following the caret as the first character after the opening bracket. [^269A-Z] will match any characters except 2, 6, 9, and upper case letters.

      When the ^ operator is the first operator of a regular expression or the first character inside brackets, it must be preceded by a backslash.

      \
      back slash This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character.

      Operators inside brackets do not need to be preceded by a backslash.

      \< \>
      left slash and arrow the beginning (\<) or end (\>) or a word. \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications.
      \( \)
      left slash and parentheses the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9.
      |
      pipe (alternation) Or two conditions together. (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications.
      +
      plus sign one or more occurences of the character or regular expression immediately preceding. 9+ matches 9, 99, or 999. NOTE: this metacharacter is not supported by all applications.
      \{i\}
      \ {i,j\}
      braces a specific number of instances or instances within a range of the preceding character. A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234.
      [0-9]\{4,6\} matches any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is supported by Robot's C-VU language but not by all applications.
      ?
      question mark Matches 0 or 1 occurence of the character or regular expression immediately preceding. ? is equivalent to {0,1}. NOTE: this metacharacter is supported by IBM/Rational Robot'sanother page on this site C-VU language but not by all applications. Question marks are optionally used to specify Non-greedy quantifiers. For example, "/A[A-Z]*?B/" means "match an A, followed by only as many capital letters as are needed to find a B."

      In addition, VU regular expressions can include ASCII control characters in the range 0 to 7F hex (0 to 127 decimal).

    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Extended Characters

      These additional tags are recognized within Ruby regex:

      \A Beginning of a string
      \b Word boundary
      \B Non-word boundary
      \d digit, same as {0..9}
      \D Non-digit
      \s Whitespace
      \S Non-Whitespace
      \w Word character
      \W Non-Word character
      \z End of a string
      \Z End of string, before nl

    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen General Rules

      Winrunner TSLanother page on this site regular expressions have the following characteristics:

      • The concatenation of single-character operators matches the concatenation of the characters individually matched by each of the single-character operators.
      • Parentheses () can be used within a regular expression for grouping single-character operators. A group of single-character operators can be used anywhere one single-character operator can be used - for example, as the operand of the * operator.
      • Parentheses and the following non-ordinary operators have special meanings in regular expressions. They must be preceded by a backslash if they are to represent themselves:
    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Examples of Regular Expressions

      This regular expression matches any day of the week:

      ((Mon)|(Tues)|(Wednes)|(Thurs)|(Fri)|(Satur)|(Sun))day

      This matches simple dates against 1 or 2 digits for the month, 1 or 2 digit for the day, and either 2 or 4 digits for the year. Matches: [4/5/91], [04/5/1991], [4/05/89] Non-Matches: [4/5/1]

      ((\d{2})|(\d))\/((\d{2})|(\d))\/((\d{4})|(\d{2}))

      This identifies incorrect 24 hour time in the format hh:mm:

      /((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9])/

      Validate a number between 1 and 255, such as an IP octet:

      ^([1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])$

      This breaks down a Uniform Resource Identifier (URI) into its component parts. (from ActiveState quoting Appendix B of IETF RFC 2396)

      my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
      print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if
        $uri =~ m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};
      

        $1 = http: $2 = http (the scheme) $3 = //www.ics.uci.edu $4 = www.ics.uci.edu (the authority) $5 = /pub/ietf/uri/ (the path) $6 = $7 = (the query) $8 = #Related $9 = Related (the fragment)

      Validate an ip address in the form 255.255.255.255 -- if it were combined with the email pattern above, the error above would not exist. Of course, the best way to test an email address is to send e-mail to it:

      ^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$

      Validates date in the US m/d/y format from 1/1/1600 - 12/31/9999. The days are validated for the given month and year. Leap years are validated for all 4 digits years from 1600-9999, and all 2 digits years except 00 since it could be any century (1900, 2000, 2100). Days and months must be 1 or 2 digits and may have leading zeros. Years must be 2 or 4 digit years. 4 digit years must be between 1600 and 9999. Date separator may be a slash (/), dash (-), or period (.)

      ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

      Validate passwords to be at least 4 characters, no more than 8 characters, and must include at least one upper case letter, one lower case letter, and one numeric digit.

      ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$

      Validate major credit card numbers from Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits.

      ^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$

      This will Use extended grep for a valid MAC address, such as [01:23:45:67:89:ab], [01:23:45:67:89:AB], [fE:dC:bA:98:76:54] with colons seperating octets. It will ignore strings too short or long, or with invalid characters, such as [01:23:45:67:89:ab:cd], [01:23:45:67:89:Az], [01:23:45:56:]. It will accept mixed case hexadecimal.

      ^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$

      This matches the name of any state in the United States:

      [ACF-IK-PR-W][a-y]{2,4}[a-y][CDIJMVY]?[a-z]{0,7}

      But you probably use a drop-down list rather than making people type them out.

      This Perl script (from Craig Berry) uses a pattern to validate British Royal Mail codes used in the UK. Each code has 2 parts: the outward (first) part cannot contain any character in "CIKMOV."

      use strict;
      my @patterns = ('AN NAA', 'ANN NAA', 'AAN NAA', 'AANN NAA',
                      'ANA NAA', 'AANA NAA', 'AAA NAA');
      foreach (@patterns) {
        s/A/[A-Z]/g;
        s/N/\\d/g;
        s/ /\\s?/g;
      }
      my $re = join '|', @patterns;
      while (<>) {
        print /^(?:$re)$/o ? "valid\n" : "invalid\n";
      }
      

      This matches any hexadecimal number with a decimal value of 1 to 4 digits in the range 0 to 65535:

      [a-fA-F0-9]{1,4}

     

      Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Error Recovery with Regular Expressions

      If a VU regular expression contains an error, when you run a suite, TestManager writes the message to stderr output prefixed with the following header:

      sqa7vui#xxx: fatal orig type error: tname: sname, line lineno

      where #xxx identifies the user ID (not present if 0), fatal signifies that error recovery is not possible (otherwise not present), orig specifies the error origination (user, system, server, or program), and type specifies the general error category (initialization, argument parsing, script initialization, or runtime). If the error occurred during execution of a script (run-time category), tname specifies the name of the script being executed when the error occurred, sname specifies the name of the VU source file that contains the VU statement causing the error, and lineno specifies the line number of this VU statement in the source file. Note that the source file information will not be available if the script's source cross-reference section has been stripped.

      If a run-time error occurs due to an improper regular expression pattern in the match library function, a diagnostic message of the following form follows the header:

      Regular Expression Error = errno

      where errno is an error code which indicates the type of regular expression error. The following table lists the possible errno values and explains each.

      errno Explanation

      2 Illegal assignment form. Character after )$ must be a digit.Example: "([0-9]+)$x"

      3 Illegal character inside braces. Expecting a digit.Example: "x{1,z}"

      11 Exceeded maximum allowable assignments. Only $0 through $9 are valid.Example: "([0-9]+)$10"

      30 Missing operand to a range operator (? {m,n} + *).Example: "?a"

      31 Range operators (? {m,n} + *) must not immediately follow a left parenthesis.Example: "(?b)"

      32 Two consecutive range operators (? {m,n} + *) are not allowed.Example: "[0-9]+?"

      34 Range operators (? {m,n} + *) must not immediately follow an assignment operation.Example: "([0-9]+)$0{1-4}"Correction: "(([0-9]+)$0){1-4}"

      36 Range level exceeds 254.Example: "[0-9]{1-255}"

      39 Range nesting depth exceeded maximum of 18 during matching of subject string.

      41 Pattern must have non-zero length.Example: ""

      42 Call nesting depth exceeded 80 during matching of subject string.

      44 Extra comma not allowed within braces.Example: "[0-9]{3,4,}"

      46 Lower range parameter exceeds upper range parameter.Example: "[0-9]{4,3}"

      49 '\0' not allowed within brackets, or missing right bracket.Example: "[\0] or [0-9"

      55 Parenthesis nesting depth exceeds maximum of 18.Example: "(((((((((((((((((((x)))))))))))))))))))"

      56 Unbalanced parentheses. More right parentheses than left parentheses.Example: "([0-9]+)$1)"

      57 Program error. Please report.

      70 Program error. Please report.

      90 Unbalanced parentheses. More left parentheses than right parentheses.Example: "(([0-9]+)$1"

      91 Program error. Please report.

      100 Program error. Please report.

     

      Go to Top of this page.
    Previous topic this page
    Next topic this page

    Portions ©Copyright 1996-2007 Wilson Mar. All rights reserved. | Privacy Policy |

    Talk to me

    Related:

  • Programming Languages
  • Java Programming
  • Applications Development

  • How I may help

    Send a message with your email client program


    Your rating of this page:
    Low High




    Your first name:

    Your family name:

    Your location (city, country):

    Your Email address: 



      Top of Page Go to top of page

    Thank you!