Very often, the input to
forms is required to be of a specific nature. As such, part of the submission
process involves checking the input to ensure that it is in accordance to a
specific format. Regular expressions are a powerful and robust way of matching
text against a pattern. A typical Regular expression is composed of a combination
of symbols, metacharacters and quantifiers appearing in a certain order to form
a pattern. In of themselves, Regular expressions appear cryptic but in their
concise form, regular expressions are able to achieve functionality that would
otherwise require comparatively large amounts of code.
In CodeCharge Studio, input
values of textual nature can be validated using Regular expressions. This is
done by specifying the Regular expression in the Input Validation
property of the field. When the user submits the form, the submitted values
are checked against the Regular expression and if they don't match the pattern,
form processing is terminated and an error message is displayed. The Input Validation
property comes with two default Regular expressions for validating 5 digit zip
codes and email addresses. You can however construct your own Regular expression
and enter it into the Input Validation property.
Two of the symbols that
are commonly used in Regular expressions are ^ and $:
^ |
Indicates
the start of a string. |
^ABD
- matches any string that starts with 'ABC' |
$ |
Indicated
the end of a string |
XYZ$
- matches any string that ends with 'XYZ' |
Along with the above symbols,
a Regular expression usually contains a number of quantifiers that denote the number of times
a character can occur:
* |
Represents
0 or more occurrences |
abc*
- matches a string that has 'ab' followed by 0 or multiple occurrences of
'c' e.g. 'ab', 'abc', 'abcccc' |
+ |
Represents
1 or more occurrences |
abc+
- matches a string that has 'ab' followed by 1 or more occurrences of 'c'
e.g. 'abc', 'abcc' |
? |
Represents
1 or 0 occurrences |
abc?
- matches a string that has 'ab' and is optionally followed by one 'c' e.g.
'ab', 'abc' |
Curly braces {} are used
to specify bounds that indicate the ranges in the number of occurrences:
{n} |
Matches
exactly n times |
abc{2}
- matches a string that has 'ab' followed by 2 occurrences of 'c' i.e. 'abcc'' |
{n,} |
Matches
n or more times |
abc{2,}
- matches a string that has 'ab' followed by 2 or more occurrences of 'c'
e.g. 'abcc', 'abcccccc' |
{n,m} |
Matches
between n and m times |
abc{2,4}
-matches a string that has 'ab' followed by 2 to 4 occurrences of 'c' e.g.
'abcc', 'abccc', 'abcccc' |
The quantifiers *, + and
? could also be expressed in terms of bounds i.e. {0,}, {1,} and {0,1} respectively.
Which version to use is simply a matter of choice.
When a sequence of characters
needs to be treated as a single entity, a pair of parenthesis are used:
(
) |
Groups
together a sequence of characters. |
a(bc)*
- matches any string that starts with 'a' followed by 0 or multiple occurrences
of 'bc' e.g. 'a', 'abcbcbc'
(ab){2}c - matches any string that has 2 occurrences of 'ab' followed by
'c' e.g. 'ababc' |
Square brackets are used
to represent the characters that are acceptable in a single position of a string:
[
] |
Characters
that could appear in a single position of a string |
a[0-9]b
- matches any string that has a digit between 'a' and 'b' e.g. 'a1b', 'a4b'
^[a-e] - matches any string that begins with the lowercase letters 'a' through
'e' e.g. 'ab', 'effe' |
The symbol | is used to
indicate a logical OR condition where either of two choices could make a match:
| |
Represents
an OR condition. |
abc|xzy
- matches any string that has either 'abc' or 'xyz'
(ab){2}|(ac){5} - matches any string that has 2 occurrences of 'ab' or 5
occurrences of 'ac' |
Along with the above quantifiers,
Regular expressions also use metacharacters to represent certain types of characters.
The most general metacharacter is the period (.) which is used to represent
any single character.
. |
Represents
any single character. |
a.b
- matches a string that has a 'a' followed by one character and a 'b' e.g.
'a_b', 'a@b'
^.{5}$ - matches any string with exactly 5 characters e.g. 'abd12', 'as&@a' |
The \d metacharacter is
used to represent any digit i.e. 0-9
\d |
Represents
a digit. |
a[\d]c
- matches any string that has 'a' followed by a digit then 'c' e.g. 'a0c',
'a9c'
\d$ - matches any string that ends with a digit e.g. 'abc1', '1' |
The \w metacharacter is
used to represent any "word" character (alphanumeric plus "_"
)
\w |
Represents
a word character. |
^\d\w
- matches any string that begins with a digit followed by a word character
e.g. '1_', '9z'
a\w|b. - matches any string that has 'a' followed by a word character or
has 'b' followed by any character. e.g. 'af', 'b ', 'b#' |
The \s metacharacter is
used to represent a whitespace character while \t represents a tab.
\s |
Represents
a whitespace character. |
\d\s\d
- matches any string that two digits separated by a whitespace e.g. '01
234', 'asc8 9xyz' |
\t |
Represents
a tab character. |
^\d{3}\t\d{5}$
- matches a string that has 3 digits followed by a tab then five other digits
e.g. '000 12345', |
With the symbols, quantifiers
and metacharacters above, you can construct complex Regular expressions to validate
various types of values. The table below presents some examples of Regular expressions:
If you use JSP or Java Servlets,
you have to download and install the Jakarta
ORO package in order to be able to process Regular expressions.
Conclusion:
This article presents an introduction to Regular expressions. A thorough dealing
of Regular expressions would require a book dedicated to the subject. There
are many more components which are not addressed here. The interested reader
should consult literature that deals exclusively with Regular expressions. However,
the components addressed here should form a solid basis for creating the sort
of Regular expressions needed for input validation in CodeCharge Studio. |