|
December 1999 De-mystifying Regular Expressions By
François Desrochers Regular expressions made it to MPE/iX with the advent of Posix. Until now, however, they have been mostly restricted to the Posix shell and related tools. That is, until Robelles Qedit opened regular expressions to MPE users. The syntax of regular expressions is almost like a mini programming language. Simple expressions can be understood very easily. More complex expressions have to be evaluated carefully to get the exact meaning. The existing MPE
wildcards and Qedits pattern matching are fairly powerful too.
However, one thing they lack is fine granularity. Wildcard characters
usually represent a range of characters at a particular position. For
example, the question mark (?) represents one alphanumeric character.
What if you were looking for only lowercase vowels? What if you want
to see strings that contain any letter except vowels. You can easily
do all of these tasks with regular expressions. Like wildcards and pattern matching, there are a small number of characters that have special meanings in regular expressions. These are called metacharacters: dot - any
character All other
characters (except for the escape character, which is a backslash)
are used as literals. You can combine metacharacters with literals to
describe precisely the strings you are looking for. Of course,
regular expressions can be wordy and rather long and complex. As an example, lets say you want to find all the lines that start with a five-letter word such as Green, Greer, or Great. First, the word has to be at the start of a line. So the regular expression has to start with a caret (^). Next we want to see the two letters Gr with the G in uppercase. That would be simply Gr. For the third and
fourth characters we want only lowercase vowels. Thats easy
use two character classes in square brackets:
[aeiouy][aeiouy], where each matches one character. In our made-up
example, the last character in the word must not be lowercase d, f,
k, l, m, p or s. We include a space here to make sure we get only
five-letter words. Again, thats easy just put a caret in
the character class to negate it: [^dfklmps ]. To make sure we
do not get words with more than five letters, we end the regular
expression with a space. The final regular expression is
^Gr[aeiouy][aeiouy][^dfklmps ]. And the line-mode Qedit command to
match this regular expression is How to find lines that end with a number? We know there has to be at least one digit. We use a character class containing only the digits from 0 to 9 (for a range of contiguous characters we can use a dash, as in [0-9], instead of enumerating them). We then use the plus sign to indicate we accept one or more digits. A trailing dollar sign indicates we want lines in which the number is the last piece of information. So the regular
expression is [0-9]+$ and you could match this in Qedit for Windows
by selecting the Regular Expression option in the String Search
dialog box. Parentheses are used to group parts of a complex expression or to isolate others. Each set of parentheses makes up a subexpression, which does not change the way string searching is done. In fact, if you are only searching, subexpressions do not have any effect. Where they are useful is in replacing parts of the matched strings. During a search, each subexpression is numbered from 0 to 9. You can then use that number in the replacement string as a back reference to the subexpression. For example, we might want to find pairs of last and first names separated by a colon, as in Green:Bob and Panagopoulos:Aristotle. One regular expression to find those pairs is (.*):(.*), where .* means any character repeated any number of times, and : must match exactly one colon. By putting the .* portions in parentheses, we create two subexpressions. Subexpression 0 is for the last name before the colon and subexpression 1 is for the first name after the colon. To convert
Green:Bob into Dear Bob Green, the replacement string would be Dear
\2 \1,. Back referencing is done with a backslash followed by the
subexpression number (e.g., \0 and \1). Thus the replacement switches
the subexpressions. You can use the same back reference as many times
as you want in the replacement string. The whole expression that is
matched is then replaced with the result of the replacement string if
you do the following line-mode Qedit command: Regular
expressions help you to select the information that you want and to
rearrange it the way you want, taking you one step higher in
flexibility and power. Think of them as pattern matching on steroids!
Qedit version 4.8 has regular expressions, in both Line mode and
Qedit for Windows.
|