De-mystifying Regular Expressions

December 1999

De-mystifying Regular Expressions

By François Desrochers
Robelle Consulting

You might have never heard the term regular expressions before, but that does not mean you have not used them. Regular expressions are actually widespread, with many tools having some kind of implementation. If you have worked on Unix, Mac O/S, or even DOS, you have probably used a regular expression without knowing it. If you have run grep, egrep or searched for a string while browsing a file with the more command, you have created your first regular expressions.

Regular expressions made it to MPE/iX with the advent of Posix. Until now, however, they have been mostly restricted to the Posix shell and related tools. That is, until Robelle’s Qedit opened regular expressions to MPE users.

The syntax of regular expressions is almost like a mini programming language. Simple expressions can be understood very easily. More complex expressions have to be evaluated carefully to get the exact meaning.

The existing MPE wildcards and Qedit’s pattern matching are fairly powerful too. However, one thing they lack is fine granularity. Wildcard characters usually represent a range of characters at a particular position. For example, the question mark (?) represents one alphanumeric character. What if you were looking for only lowercase vowels? What if you want to see strings that contain any letter except vowels. You can easily do all of these tasks with regular expressions.

Metacharacters

Like wildcards and pattern matching, there are a small number of characters that have special meanings in regular expressions. These are called metacharacters:

• dot - any character
• caret - start of line
• dollar sign - end of line
• square brackets - character class
• parentheses - subexpressions
• asterisk - zero or more characters
• plus sign - one or more characters
• question mark - zero or one optional character

All other characters (except for the escape character, which is a backslash) are used as literals. You can combine metacharacters with literals to describe precisely the strings you are looking for. Of course, regular expressions can be wordy and rather long and complex.

Be specific...

As an example, let’s say you want to find all the lines that start with a five-letter word such as Green, Greer, or Great. First, the word has to be at the start of a line. So the regular expression has to start with a caret (^). Next we want to see the two letters Gr with the G in uppercase. That would be simply Gr.

For the third and fourth characters we want only lowercase vowels. That’s easy — use two character classes in square brackets: [aeiouy][aeiouy], where each matches one character.

In our made-up example, the last character in the word must not be lowercase d, f, k, l, m, p or s. We include a space here to make sure we get only five-letter words. Again, that’s easy — just put a caret in the character class to negate it: [^dfklmps ].

To make sure we do not get words with more than five letters, we end the regular expression with a space. The final regular expression is ^Gr[aeiouy][aeiouy][^dfklmps ]. And the line-mode Qedit command to match this regular expression is

/list “^Gr[aeiouy][aeiouy][^dfklmps ]” (regexp)

...Or be expansive

How to find lines that end with a number? We know there has to be at least one digit. We use a character class containing only the digits from 0 to 9 (for a range of contiguous characters we can use a dash, as in [0-9], instead of enumerating them). We then use the plus sign to indicate we accept one or more digits. A trailing dollar sign indicates we want lines in which the number is the last piece of information.

So the regular expression is [0-9]+$ and you could match this in Qedit for Windows by selecting the Regular Expression option in the String Search dialog box.

Move parts around

Parentheses are used to group parts of a complex expression or to isolate others. Each set of parentheses makes up a subexpression, which does not change the way string searching is done. In fact, if you are only searching, subexpressions do not have any effect. Where they are useful is in replacing parts of the matched strings. During a search, each subexpression is numbered from 0 to 9. You can then use that number in the replacement string as a back reference to the subexpression.

For example, we might want to find pairs of last and first names separated by a colon, as in Green:Bob and Panagopoulos:Aristotle. One regular expression to find those pairs is (.*):(.*), where .* means any character repeated any number of times, and : must match exactly one colon. By putting the .* portions in parentheses, we create two subexpressions. Subexpression 0 is for the last name before the colon and subexpression 1 is for the first name after the colon.

To convert Green:Bob into Dear Bob Green, the replacement string would be Dear \2 \1,. Back referencing is done with a backslash followed by the subexpression number (e.g., \0 and \1). Thus the replacement switches the subexpressions. You can use the same back reference as many times as you want in the replacement string. The whole expression that is matched is then replaced with the result of the replacement string if you do the following line-mode Qedit command:

/change “(.*):(.*)”(regexp), “Dear \2 \1,”, all

Develop new search and replace powers!

Regular expressions help you to select the information that you want and to rearrange it the way you want, taking you one step higher in flexibility and power. Think of them as pattern matching on steroids! Qedit version 4.8 has regular expressions, in both Line mode and Qedit for Windows.

François Desrochers is a member of the Robelle Consulting R&D team.