The character universe's role in regldg

Current version: 1.0.0

Character universe introduction

Character universe specification
Character universe checking

Character universe specification

The character universe is the central topic to understanding how and why regldg does what it does with character and meta-character classes. The character universe is the group of characters that regldg is allowed to use to make the words of the output dictionary. Lets look at an example. Given the regular expression . , what do you expect the output of regldg to be? Well, it is clear that each word output will be only one character long, but which characters will those be? Could they be only uppercase letters? Could they include lowercase letters too? Numbers? Other symbols on the keyboard? Or, maybe they could be all the possible characters from the ASCII and extended ASCII character sets (0-255)? The set of possible characters is called the character universe, and you can decide what it should be for each run of regldg.

There are two command line options to set the character universe. Firstly, there are a number of common, pre-defined character universe sets. These are:

1. Uppercase letters A-Z
2. Lowercase letters a-z
4. Digits 0-9
8. Shift-digits !@#$%^&*
16. Punctuation ",'.:;?_` and [space]
32. Parentheses, brackets, and braces ()[]{}
64. Other stuff ~/|\
128. Math +-=

To use any one of these pre-defined character universe sets, specify the -us NNN or --universe-set=NNN option on the command line, where NNN is the number of the universe set.

You've probably noticed that the number of each pre-defined character universe set is a power of 2. This is so that you can combine universe sets simply by adding their numbers. If you want the character universe to have letters (upper- and lowercase), numbers, and punctuation, you can specify universe set number 23 (1 + 2 + 4 + 16).

The second way to specify the character universe is explicitly in a character class. You can put whatever characters you'd like in it using the formats shown in regldg's regular expression capabilities. You can specify the character class on the command line with option -u [UNIVERSE] or --universe=[UNIVERSE]. Be sure to start the character class with [ and end with ]. It will be parsed exactly like a character class, so if you make it a negated character class, it will be negated from the default character universe (universe set 7), or, if you already specified a different universe on the command line with -u or -us, it will be negated from that universe.

Here's an example: use regldg to generate all possible combinations of two-letter words using only the characters A, B, and C.

> regldg "--universe=[ABC]" ".{2}"
AA
BA
CA
AB
BB
CB
AC
BC
CC

To show that regldg is not afraid to be complex, let's do the same thing using the negated character class method. First, we set the character universe to uppercase letters. Then, we take out D-Z (leaving A-C). Finally, using the remaining characters, we output all possible two-letter words.

> regldg --universe-set=1 "--universe=[^D-\z{90}]" ".{2}"
AA
BA
CA
AB
BB
CB
AC
BC
CC

Character universe checking

While parsing a regular expression, there are two areas where the character universe must be controlled. regldg allows the control in both of these areas to be strict (on) or lax (off) for each run, determined by a single command line option.

The first area where controlling the character universe is important is in the explicit entry of characters. If the character universe is set to the uppercase letters A-Z, and a regular expression contains a space, should it result in an error (strict), or should it be allowed in only that place (lax)? This can be controlled by using the command line option -uc N or --universe-checking=N. Setting this value to 1 will enable strict checking of explicitly entered characters.

The second area where controlling the character universe is important is character and meta-character classes. If the character universe is set to the digits 0-4 only, and a regular expression contains a \d meta-character, should the resulting character class contain only the digits 0-4 (strict), or should it contain all the digits according to the full specification of \d (lax)? This behavior can also be controlled by using the command line option -uc N or --universe-checking=N. Setting this value to 2 will enable strict checking of the contents of character and meta-character classes.

To enable both types of strictness, 1 + 2 = 3, so set -uc 3 or --universe-checking=3. To disable both types of strictness, use -uc 0 or --universe-checking=0.

Lets see these in action with an example. If you are using a character universe of only the letters A-E, and generating all possible words, but you want each word of output to start with Z, you'd like to be able to use regldg -u "[ABCDE]" "Z.*". Here we go:

> regldg "--universe=[ABCDE]" "Z.*"
regldg: (Error) Z.*
regldg: (Error) ^
regldg: (Error) parse_regex_pass_char: specified character is not in the universe!

It didn't work! regldg got angry because we tried to use Z, but Z isn't in the character universe [ABCDE]. Ok, so lets turn off universe checking and see what happens.

> regldg "--universe=[ABCDE]" --universe-checking=0 "Z.*"
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z

Z

Z
...

Hmmm. Unexpected results! And wonderful system bells! Lets turn on --readable-output to see what happened.

> regldg "--universe=[ABCDE]" --universe-checking=0 --readable-output "Z.*"
Z
Z{000}
Z{001}
Z{002}
Z{003}
Z{004}
Z{005}
Z{006}
Z{007}
Z{008}
Z{009}
Z{010}
Z{011}
Z{012}
Z{013}
...

It did allow us to start words with Z, which wasn't in the character universe, but why are we getting ASCII characters starting from 0? The problem is the . metacharacter. As explained above, regldg allows you to make metacharacters retain all their characters (lax checking), or have the classes they represent thinned according to the current character universe (strict checking). (Technically: the character or meta-character class can be intersected with the character universe.) In the above example, the . metacharacter was allowed to represent all ASCII values 0-255, so we didn't get only the expected ZA, ZB, and ZC. Since we want . to represent only those charaters in the current character universe, we should turn on this type of strict character universe checking by adding a 2 to the --universe-checking option. So, using this information:

> regldg "--universe=[ABCDE]" --universe-checking=2 "Z.*"
Z
ZA
ZB
ZC
ZD
ZE
ZAA
ZBA
ZCA
ZDA
ZEA
ZAB
ZBB
ZCB
ZDB
ZEB
ZAC
ZBC
ZCC
ZDC
ZEC
ZAD
...

All set! That is what we wanted.