Regular expression matching a multiline block of text

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
  • Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
  • [A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final \n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.