RegExp

Normally, when you search for a sub-string in a string, the match should be exact. So if we search for a sub-string “abc” then the string being searched should contain these exact letters in the same sequence for a match to be found. We can extend this kind of search to a case insensitive search where the sub-string “abc” will find strings like “Abc”, “ABC” etc. That is, the case is ignored but the sequence of the letters should be exactly the same. Sometimes, a case insensitive search is also not enough. For example, if we want to search for numeric digit, then we basically end up searching for each digit independantly. This is where regular expressions come in to our help.

Regular expressions are text patterns that are used for string matching. Regular expressions are strings that contains a mix of plain text and special characters to indicate what kind of matching to do. Here's a very brief turorial on using regular expressions before we move on to the code for handling regular expressions.

Suppose, we are looking for a numeric digit then the regular expression we would search for is “[0-9]”. The brackets indicate that the character being compared should match any one of the characters enclosed within the bracket. The dash (-) between 0 and 9 indicates that it is a range from 0 to 9. Therefore, this regular expression will match any character between 0 and 9, that is, any digit. If we want to search for a special character literally we must use a backslash before the special character. For example, the single character regular expression “\*” matches a single asterisk. In the table below the special characters are briefly described.

Examples

The caret (^) in the beginning of the string. The expression “^A” will match an A only at the beginning of the string.

The caret (^) immediately following the left-bracket ([) has a different meaning. It is used to exclude the remaining characters within brackets from matching the target string. The expression “[^0-9]” indicates that the target character should not be a digit.

The dollar sign ($) will match the end of the string. The expression “abc$” will match the sub-string “abc” only if it is at the end of the string.

The alternation character (|) allows either expression on its side to match the target string. The expression “a|b” will match a as well as b.

The dot (.) will match any character.

The asterix (*) indicates that the character to the left of the asterix in the expression should match 0 or more times.

The plus (+) is similar to asterix but there should be at least one match of the character to the left of the + sign in the expression.

The question mark (?) matches the character to its left 0 or 1 times.

The parenthesis () affects the order of pattern evaluation and also servesas a tagged expression that can be used when replacing the matched sub-string with another expression.

Brackets ([ and ]) enclosing a set of characters indicates that any of the enclosed characters may match the target character.

The parenthesis, besides affecting the evaluation order of the regular expression, also serves as tagged expression which is something like a temporary memory. This memory can then be used when we want to replace the found expression with a new expression. The replace expression can specify a & character which means that the & represents the sub-string that was found. So, if the sub-string that matched the regular expression is “abcd”, then a replace expression of “xyz&xyz” will change it to “xyzabcdxyz”. The replace expression can also be expressed as “xyz\0xyz”. The “\0” indicates a tagged expression representing the entire sub-string that was matched. Similarly we can have other tagged expression represented by “\1”, “\2” etc. Note that although the tagged expression 0 is always defined, the tagged expression 1,2 etc. are only defined if the regular expression used in the search had enough sets of parenthesis.

Examples
String Search Replace Result
Mr. (Mr)(\.) \1s\2 Mrs.
abc (a)b© &-\1-\2 abc-a-c
bcd (ab)c*d &-\1 bcd-b
abcde (.*)c(.*) &-\1-\2 abcde-ab-de
cde (abcd)e &-\1 cde-cd