"vertical" regex matching in an ASCII "image"

Answer to question 1

To answer the first question one could use:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line
^                        # beginning of line
(?:
    .                    # any character except \n
    (?=                  # lookahead
        .*+\n            # go to next line
        ( \1?+ . )       # add a character to the 1st capturing group
        .*+\n            # next line
        ( \2?+ . )       # add a character to the 2nd capturing group
    )
)*?                      # repeat as few times as needed
X .*+\n                  # X on the first line and advance to next line
\1?+                     # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+\n                  # X on the 2nd line and advance to next line
\2?+                     # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X                        # X on the 3rd line

Online demo

This expression works in Perl, PCRE, Java and should work in .NET.

The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").

\1?+ means if \1 matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) \1 ). Which means match \1 if \1 is defined.

polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)

Answer to question 2

Plain matching

When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:

It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).

Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.

(Semi?)proof

Assume that no variable length lookbehinds are available.

You have to in some way count the number of characters on a line before an X.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.

Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.

Length/indirect solution

On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).

This solution is based on/inspired by m.buettner's nice "partial PCRE solution".

One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.

^
(?:
    (?:                   # match .+? characters
        .
        (?=               # counting the same number on the following two lines
            .*+\n
            ( \1?+ . )
            .*+\n
            ( \2?+ . )
        )
    )+?
    (?<= X )              # till the above consumes an X
    (?=                   # that matches the following conditions
        .*+\n
        \1?+
        (?<= X )
        .*+\n
        \2?+
        (?<= X )
    )
    (?=                   # count the number of matches
        .*+\n
        ( \3?+ . )        # the number of matches = length of $3
    )
)*                        # repeat as long as there are matches on this line
.*\n?                     # remove the rest of the line

Which in Perl could be written as:

$in =~ s/regex/$3/gmx;
$count = length $in;

Online demo

This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.

Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.

Test cases

Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).

Test #0:
--------------------
X
X
X

result: 1 (X)


Test #1:
--------------------
..X....
..X....
..X....

result: 1 (.)


Test #2:
--------------------
..X.X..
..X.X..
....X..

result: 1 (.)


Test #3:
--------------------
..X....
..X....
...X...

result: 0 ()


Test #4:
--------------------
..X....
...X...
..X....

result: 0 ()


Test #5:
--------------------
....X..
.X..X..
.X.....

result: 0 ()


Test #6:
--------------------
.X..X..
.X.X...
.X.X...

result: 1 (.)


Test #7:
--------------------
.X..X..
.X..X..
.X..X..

result: 2 (.X)


Test #8:
--------------------
XXX
XXX
XXX

result: 3 (XXX)


Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.

result: 5 (XXXXX)


Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.

result: 8 (3458.XXX)

Edit

The following solutions have two grave problems:

  1. They can't match multiple XXX sequences starting on the same line, as the pos advances too much.
  2. The second solution is incorrect: it matches consecutive lines where two X are above each other. There don't neccessarily have to be three in a row.

Therefore, all upvotes (and the bounty) should go to either of m.buettner's comprehensive .NET answer or the fascinating PCRE answer from Qtax himself.


Original Answer

This is an answer using embedding of Perl code into regexes. Because a Perl regex can use code to assert arbitrary conditions inside regexes or emit partial regexes, they are not limited to matching regular languages or context-free languages, but can match some parts of languages higher up in the Chomsky hierarchy.

The language you want to match can be described in regex terms as

^ .{n} X .*\n
  .{n} X .*\n
  .{n} X

where n is a number. This is about as complex as matching the anbncn language which is the canonical example for a context-sensitive language.

We can match the first line easily, and use some Perl code to emit the regex for the other lines:

    /^ (.*?) X
       (?: .*\n (??{"." x length($1)}) X){2}
    /mx

That was short! What does it do?

  • ^ (.*?) X anchores at the start of a line, matches as few non-newline characters as possible and then the X. We remember the line up to the X as capture group $1.

  • We repeat a group two times which matches the rest of the line, a newline, and then injects a regex that matches a string of the same length as $1. After that, there must be an X.

This regex will now match every string that has three X on top of each other.

If we want to extract all such sequences, we'll have to be nifty. Because sequences may overlap, e.g.

.X
XX
XX
X.

the position where the next match starts must not proceed past the first X. We can do this via a lookbehind and lookahead. Perl only supports constant-length lookbehind, but has the \K escape which provides similar semantics. Thus

/^ (.*?) \K X
   (?=( (?: .*\n (??{"."x length($1)}) X ){2} ))
/gmx

will match every sequence of three vertical Xes. Testing time:

$ perl -E'my$_=join"",<>; say "===\n$1X$2" while /^(.*?)\KX(?=((?:.*\n(??{"."x length($1)})X){2}))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXX
===
X.X...X..X.....
X....XXXXXX.....
X
===
X....XXXXXX.....
X..XXX...........
.....X
===
..............X
..X...........X....
..X...........X

Note: this relies on experimental regex features that are available from at least Perl 5, v10 onward. The code was tested with a v16 perl.


Solution without embedded code

Let us look at the lines

...X...\n
...X..\n

We want to assert that the leading ... part of each line is of same length. We can do so by recursion with base case X.*\n:

(X.*\n|.(?-1).)X

If we anchor that at the start of a line, we can match two vertical Xes. To match more than two lines, we have to do the recursion in a lookahead and then advance the match position to the next line, and repeat. For this, we simply match .*\n.

This results in the following regex which can match a string with three vertical Xes:

/ ^
  (?:
    (?=( X.*\n | .(?-1). ) X)
    .*\n # go to next line
  ){2}
/mx

But this isn't good enough, as we want to match all such sequences. To do this, we essentially put the whole regex into a lookahead. The regex engine makes sure to advance the position every time to produce a new match.

/ ^
  (?=
    (
      (?:
          (?= (X.*\n | .(?-1). ) X)
          .*\n # go to next line
      ){2}
      .* # include next line in $1
    )
  )
/mx

Testing time:

$ perl -E'my$_=join"",<>; say "===\n$1" while /^(?=((?:(?=(X.*\n|.(?-1).)X).*\n){2}.*))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
===
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
===
X....XXXXXX.....
X..XXX...........
.....X..........
===
..............X
..X...........X....
..X...........X....X...

So this works as well as the solution with embedded code, that is, it matches each group of lines with vertical Xes, not each group of Xes. (Actually, this solution seems more fragile to me than embedded code)