Using Regular Expressions for Input validation

Products:   
Areas: ASP, ASP.NET(C#), ColdFusion, Java Servlets, JSP, Perl, Perl 

Very often, the input to forms is required to be of a specific nature. As such, part of the submission process involves checking the input to ensure that it is in accordance to a specific format. Regular expressions are a powerful and robust way of matching text against a pattern. A typical Regular expression is composed of a combination of symbols, metacharacters and quantifiers appearing in a certain order to form a pattern. In of themselves, Regular expressions appear cryptic but in their concise form, regular expressions are able to achieve functionality that would otherwise require comparatively large amounts of code.

In CodeCharge Studio, input values of textual nature can be validated using Regular expressions. This is done by specifying the Regular expression in the Input Validation property of the field. When the user submits the form, the submitted values are checked against the Regular expression and if they don't match the pattern, form processing is terminated and an error message is displayed. The Input Validation property comes with two default Regular expressions for validating 5 digit zip codes and email addresses. You can however construct your own Regular expression and enter it into the Input Validation property.

Two of the symbols that are commonly used in Regular expressions are ^ and $:

 Indicates the start of a string. ^ABD - matches any string that starts with 'ABC'
 Indicated the end of a string XYZ$ - matches any string that ends with 'XYZ'

Along with the above symbols, a Regular expression usually contains a number of quantifiers that denote the number of times a character can occur:

Represents 0 or more occurrences abc* - matches a string that has 'ab' followed by 0 or multiple occurrences of 'c' e.g. 'ab', 'abc', 'abcccc'
Represents 1 or more occurrences abc+ - matches a string that has 'ab' followed by 1 or more occurrences of 'c' e.g. 'abc', 'abcc'
? Represents 1 or 0 occurrences abc? - matches a string that has 'ab' and is optionally followed by one 'c' e.g. 'ab', 'abc'

Curly braces {} are used to specify bounds that indicate the ranges in the number of occurrences:

{n}  Matches exactly n times abc{2} - matches a string that has 'ab' followed by 2 occurrences of 'c' i.e. 'abcc''
{n,}  Matches n or more times abc{2,} - matches a string that has 'ab' followed by 2 or more occurrences of 'c' e.g. 'abcc', 'abcccccc'
{n,m} Matches between n and m times abc{2,4} -matches a string that has 'ab' followed by 2 to 4 occurrences of 'c' e.g. 'abcc', 'abccc', 'abcccc'

The quantifiers *, + and ? could also be expressed in terms of bounds i.e. {0,}, {1,} and {0,1} respectively. Which version to use is simply a matter of choice.

When a sequence of characters needs to be treated as a single entity, a pair of parenthesis are used:

( )  Groups together a sequence of characters. a(bc)* - matches any string that starts with 'a' followed by 0 or multiple occurrences of 'bc' e.g. 'a', 'abcbcbc'

(ab){2}c - matches any string that has 2 occurrences of 'ab' followed by 'c' e.g. 'ababc'

Square brackets are used to represent the characters that are acceptable in a single position of a string:

[ ] Characters that could appear in a single position of a string a[0-9]b - matches any string that has a digit between 'a' and 'b' e.g. 'a1b', 'a4b'

^[a-e] - matches any string that begins with the lowercase letters 'a' through 'e' e.g. 'ab', 'effe'

The symbol | is used to indicate a logical OR condition where either of two choices could make a match:

| Represents an OR condition. abc|xzy - matches any string that has either 'abc' or 'xyz'

(ab){2}|(ac){5} - matches any string that has 2 occurrences of 'ab' or 5 occurrences of 'ac'

Along with the above quantifiers, Regular expressions also use metacharacters to represent certain types of characters. The most general metacharacter is the period (.) which is used to represent any single character.

. Represents any single character. a.b - matches a string that has a 'a' followed by one character and a 'b' e.g. 'a_b', 'a@b'

^.{5}$ - matches any string with exactly 5 characters e.g. 'abd12', 'as&@a'

The \d metacharacter is used to represent any digit i.e. 0-9

\d Represents a digit. a[\d]c - matches any string that has 'a' followed by a digit then 'c' e.g. 'a0c', 'a9c'

\d$ - matches any string that ends with a digit e.g. 'abc1', '1'

The \w metacharacter is used to represent any "word" character (alphanumeric plus "_" )

\w Represents a word character. ^\d\w - matches any string that begins with a digit followed by a word character e.g. '1_', '9z'

a\w|b. - matches any string that has 'a' followed by a word character or has 'b' followed by any character. e.g. 'af', 'b ', 'b#'

The \s metacharacter is used to represent a whitespace character while \t represents a tab.

\s Represents a whitespace character. \d\s\d - matches any string that two digits separated by a whitespace e.g. '01 234', 'asc8 9xyz'
\t Represents a tab character. ^\d{3}\t\d{5}$ - matches a string that has 3 digits followed by a tab then five other digits e.g. '000    12345',

With the symbols, quantifiers and metacharacters above, you can construct complex Regular expressions to validate various types of values. The table below presents some examples of Regular expressions:

Value 

Regular Expression

Example

Email    somewhere@servername.com 
USA Zip Code    12345 or 12345-6789
Phone Home    (123) 456-7892 or 123-456-7892
Social Security Number   000-00-0000
Password    password1 (6-15 characters, must begin with
letter and end with a digit)
Internet URL

http://www.codecharge.com
http://www.codecharge.com/download/

If you use JSP or Java Servlets, you have to download and install the Jakarta ORO package in order to be able to process Regular expressions.

Conclusion:
This article presents an introduction to Regular expressions. A thorough dealing of Regular expressions would require a book dedicated to the subject. There are many more components which are not addressed here. The interested reader should consult literature that deals exclusively with Regular expressions. However, the components addressed here should form a solid basis for creating the sort of Regular expressions needed for input validation in CodeCharge Studio.



Viewed 140294 times.   Last updated: 12/13/2002 12:12:37 AM