How to match a paragraph using regex

You can split on double-newline like this:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589

Using split is one way, you can do so with regular expression also like this:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)

How to match a paragraph using regex

Tags:

Python

Regex

Paragraph

Related

Recent Posts