Compile Regexes (By Substitution)

Snails, 48 bytes

0 -> )0(\0!(l.)(~

1 -> )0(\1!(l.)(~

( -> )0({{(

) -> )0}}(~

| -> )0}|{(

* -> )0),(~

If we had to search for partial matches rather than matching only the full input, then it would be very easy. 0 would become \0, 1 would become \1, * would become ,, and the others would map to themselves. Instead there are a lot of shenanigans to prevent matches from starting somewhere other than the beginning or ending somewhere other than the end. !(l.) is an assertion that will fail if the beginning of the match is not at the beginning of the input. ~ matches a cell outside of the input, so it is added to all the characters that are allowed to be at the end of the regex. If there is another regex character following, it is canceled out by a numerical quantifier 0 which requires it to be matched 0 times, essentially commenting it out. To allow * (,) to work correctly despite the dummy out-of-bounds test being in the way, the language's bracket matching rules are heavily used. From the documentation:

Matching pairs of parentheses () or curly braces {} will behave as expected (like parentheses in regex), but it is also possible to leave out one half of a pair and have it inferred, according to the following rules. ) or } groups everything to the left until the nearest unclosed group opening instruction of the same type(( or { respectively), or the beginning of the pattern if none exists. It closes any unclosed opening instructions of the opposite type in the middle of this range. An otherwise unmatched ( or { is closed by the end of the pattern.

Clear as mud, right?


CJam, 151 bytes

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+ea`,m*\"T

The lines correspond to the characters 01(|)* (in that order). Try it online!

This uses no built-in regular expression or other types of pattern matching. In fact, CJam has neither of these features. Instead, it starts from the regular expression it represents and builds all possible strings it could match, to finally check if the user input is one of them.

Test runs

The following uses a program that reads a regular expression from STDIN, replaces each of its characters by the proper snippet, and finally evaluates the generated code to see if it matches the input specified in the command line argument.

$ cat regex.cjam
l"01(|)*""

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+ea`,m*\"T

"N%ers~
$ cjam regex.cjam '' <<< '(|)'
1
$ cjam regex.cjam '0' <<< '(|)'
0
$ cjam regex.cjam '' <<< '0(|)'
0
$ cjam regex.cjam '0' <<< '0(|)'
1
$ cjam regex.cjam '' <<< '(0|11)*'
1
$ cjam regex.cjam '0' <<< '(0|11)*'
1
$ cjam regex.cjam '11' <<< '(0|11)*'
1
$ cjam regex.cjam '011011000' <<< '(0|11)*'
1
$ cjam regex.cjam '1010' <<< '(0|11)*'
0

Unfortunately, this isn't particularly fast. It will choke rather quickly if there are more than 9 characters in the input or more than a single Kleene star in the regex.

At the cost of 5 extra bytes – for a total of 156 bytes – we can generate shorter strings to match the potential input against and deduplicate them. This doesn't change how the code works; it just makes it more efficient.

$ cat regex-fast.cjam 
l"01(|)*""

{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM0sa`T
{]Na/Saf+{:m*:sSf-~}%}:J{+:MU{]W=~Jea&,}|}:TM1sa`T
M{{+:M];eas!}:T}|U):UM'[T
MN`T
U(:UM'JT
M\"S+eas,)m*:sSf-L|\"T

"N%ers~
$ cjam regex-fast.cjam '0101001010' <<< '(01|10)*'
0
$ cjam regex-fast.cjam '011001101001' <<< '(01|10)*'
1
$ cjam regex-fast.cjam '0' <<< '(0*1)*'
0
$ time cjam regex-fast.cjam '101001' <<< '(0*1)*'
1