Regular Expression Capabilities
The following is a comprehensive list of the regular expression language that regldg will understand.
Individual characters
You can enter individual characters in a few methods. regldg operates on characters in the ASCII and extended ASCII system, values 0 through
255.
Regular Expression | Meaning | Example | Produces |
p (any printable character) | p (that printable character) | p | p |
\a | Bell character (ASCII 7) | \a | [BEL] |
\b | Backspace character (ASCII 8) | \b | [BS] |
\t | Horizontal tab character (ASCII 9) | \t | [HT] |
\n | Newline character (ASCII 10) | \n | [NL] |
\v | Vertical tab character (ASCII 11) | \v | [VT] |
\f | Form feed character (ASCII 12) | \f | [FF] |
\r | Carriage return character (ASCII 13) | \r | [CR] |
\e | Escape character (ASCII 27) | \e | [ESC] |
\zNNN | A character specified by the ASCII code NNN (decimal). NNN can be 1, 2, or 3 digits, less than 256. | \z49 | 1 |
\z{NNN} | A character specified by the ASCII code NNN (decimal). NNN can be 1, 2, or 3 digits, less than 256. The { and } help to avoid confusion. See note below. | \z{119} | w |
\oNNN | A character specified by the ASCII code NNN (octal). NNN can be 1, 2, or 3 digits, less than 400 (octal). | \o072 | : |
\o{NNN} | A character specified by the ASCII code NNN (octal). NNN can be 1, 2, or 3 digits, less than 400 (octal). The { and } help to avoid confusion. See note below. | \o{12} | [NL] |
\xNN | A character specified by the ASCII code NN (hexadecimal). NN can be 1 or 2 digits, less than FF (hexadecimal). | \x5D | ] |
\x{NN} | A character specified by the ASCII code NN (hexadecimal). NN can be 1 or 2 digits, less than FF (hexadecimal). The { and } help to avoid confusion. See note below. | \x{26} | & |
Possible confusion with numerically-specified characters
Numerically-specified characters (using the constructs \zNNN, \oNNN, and \xNNN are a source of possible confusion.
Consider the regular expression \z1234. Does it mean \z1 234, \z12 34, or \z123 4? Who's to say? regldg will interpret
it as the last case, because it will continue to build numerically specified characters until the limits of its type are reached. Here, a decimal
numerically specified character can use up to three numbers, and since they were available, it used all three. To avoid possible confusion, use
the { and } characters to tell regldg exactly which numbers to use in your numerically specified characters.
Meta-characters
Certain characters have two meanings in regular expressions. Alone, their meaning is not what they look like. See
below on this page in other sections for each meta-characters special meaning. To use a meta-character's printed
meaning, just put a \ before it (this is called "escaping" it). A list of these characters are as
follows:
Meta-characters which must be escaped |
\ | | | * | ? |
+ | . | ( | ) |
[ | ] | { | } |
An example regular expression is 1+1. This does not mean 1+1 as it looks, because the + is a
quantifier (see the section Quantifiers below). To make 1+1, you must escape the + in the regular
expression, making the proper regular expression 1\+1.
Meta-character classes
regldg understands the basic meta-character classes in perl. In regldg, however, meta-character classes are
subject to the constraints of the character universe and the strictness of checking the character universe.
For more information about the character universe, see character universes.
Meta-character class | Characters included | Description |
. | Any character in the current character universe (including \n) |
\d | 0123456789 | Digits |
\D | Any character in the current character universe, excluding the members of \d |
\s | [SPACE][HT][VT][NL][FF] | Whitespaces |
\S | Any character in the current character universe, excluding the members of \s |
\w | ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz0123456789_ | Alphanumerics and _ |
\W | Any character in the current character universe, excluding the members of \w |
\u{1} | ABCDEFGHIJKLMNOPQRSTUVWXYZ | Uppercase letters |
\u{2} | abcdefghijklmnopqrstuvwxyz | Lowercase letters |
\u{4} | 0123456789 | Digits |
\u{8} | !@#$%^&* | Shift-with-numbers |
\u{16} | ;`:'[SPACE],".?_ | Punctuation |
\u{32} | (){}[] | Closures |
\u{64} | ~\/| | Others |
\u{128} | +-=<> | Math |
\u{NNN} | NNN is a number in decimal between 0 and 255, representing the sum of the pre-defined universe character class numbers.
The resulting character class will be the union of all the included pre-defined character universes.
Example: \u{233}
233 = 128 + 64 + 32 + 8 + 1
So, \u{233} will be the union of \u{1}, \u{8}, \u{32}, \u{64}, and \u{128}
|
\U{NNN} | NNN is a number in decimal between 0 and 255, representing the sum of the pre-defined universe character class numbers.
The resulting character class will be any character in the current universe, excluding the members of the union of all the included pre-defined universe character classes.
Example: \U{189}
189 = 128 + 32 + 16 + 8 + 4 + 1
So, \U{189} will be any character in the current character universe, excluding the members of the union of \u{1}, \u{4}, \u{8}, \u{32}, and \u{128}
|
Groupings
Groupings, nested groupings, and backreferences to the groupings are supported. Grouping characters together helps clarify
alternations, and allows repeating of past patterns (using backreferences
and quantifiers) in a singular regular expression output.
|
|
|
> regldg -m 35 "(firstpart)anotherpart(second(third)part)"
firstpartanotherpartsecondthirdpart
|
|
|
Alternations
Alternations allow you to use "this" or "that".
|
|
|
> regldg "ab|cd"
ab cd
|
|
|
Alternations are often used with groupings when there are things in the regular expression which are not to be involved in the
"this" or "that" game.
|
|
|
> regldg "fla(t|pper)"
flat flapper
|
|
|
regldg can also use multiple alternations to use "this" or "that" or "that" or "that"
or "that".
|
|
|
> regldg "(spl|th|fl|r)at"
splat
that
flat
rat
|
|
|
Backreferences
Backreferences are placeholders used to repeat a grouping from before in the same pattern. Groupings are numbers by their starting
( and can be referred to only after they have been closed with a ).
|
|
|
> regldg -us 19 -m 46 "(Pat|Grandma) went to school today in \1's car\."
Pat went to school today in Pat's car. Grandma went to school today in Grandma's car.
|
|
|
|
|
|
> regldg -m 9 -us 19 "(a(b)c) \1 \2" abc abc b
|
|
|
regldg includes an alternative method to use backreferences. Instead of \1 to mean
a backreference to grouping 1, you can use \!{1}. This will completely avoid the ambiguity
of whether it is a backreference or an octally-specified character. (This is, of course, as long as
you know this syntax. Otherwise, you might be completely confused as to what it is!) In action:
|
|
|
> regldg -m 9 -us 19 "(a(b)c) \\!{1} \\!{2}" abc abc b
|
|
|
Note the double-\s... these were required for me to enter this regex in a tcsh. To avoid this problem,
you could use the command line option --file=- and enter the regex directly into the program instead.
Character classes
Character classes represent all possible characters for a single location.
|
|
|
> regldg "[ab][cd]" ac bc ad bd
|
|
|
Some meta-characters don't need to be escaped while in character classes. These are (, *,
+, ?, {, [, |, ), and }. The \ and . characters definitely
need to be escaped in a character class. The range character - and
the end-character-class character ] must be escaped unless they are the only character in the character class.
|
|
|
> regldg -uc 0 "[(*+?{[|)}\\\-\]\.]"
( * + ? { [ | ) } \ - ] .
|
|
|
regldg is also capable of negated character classes, that is, character classes starting with the ^
character. A negated character class represents all characters in the current character universe, execpt those
explicitly written in the negated character class.
|
|
|
> regldg -us 2 "[^abcde]"
f g h i j k l m n o p q ...
|
|
|
[-] and []] are both handled correctly: [-] is a character class containing only
a - character, and []] is a character class containing only a ] character. Both
are actually silly, because a one-element character class could instead be just that character. In any other
character class, the - and ] characters are meta-characters, and need to be escaped.
Quantifiers
Quantifiers will allow you to write a character, character class, meta-character class or group once, and have it occur
a specifed (possibly variable) number of times.
Quantifier | Meaning |
* | The previous character, character class, meta-character class or group occurs between 0 to unlimited times (inclusive). (Unlimited is controlled
by the maximum word length of the program.)
|
+ | The previous character, character class, meta-character class or group occurs between 1 to unlimited times (inclusive). (Unlimited is controlled
by the maximum word length of the program.)
|
? | The previous character, character class, meta-character class or group occurs between 0 to 1 time (inclusive).
|
{2} | The previous character, character class, meta-character class or group occurs 2 times.
|
{1,3} | The previous character, character class, meta-character class or group occurs between 1 to 3 times (inclusive).
|
{4,} | The previous character, character class, meta-character class or group occurs between 4 and unlimited times (inclusive). (Unlimitied is controlled
by the maximum word length of the program.)
|
It is assumed that you have a good understanding of the usage of these items. A very important example, however, is when a quantifier
acts on groups containing alternations. In a single word of output, any number of sides
of the alternation could be used. The regular expression [a|b]{2} will produce aa,
bb, AND ab and ba. Shown in an explicit example:
|
|
|
> regldg "(ab|cd){2}" abab cdab abcd cdcd
|
|
|
|