How to match a paragraph using regex
You can split on double-newline like this:
paragraphs = re.split(r"\n\n", DATA)
Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:
for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
print match.start(), match.end()
# Prints:
# 0 214
# 215 298
# 299 589
Using split is one way, you can do so with regular expression also like this:
paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)
The .+?
is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.
So basically here we want to find a sequence of characters (.+?
) which ends by a blank line (\n\n
) or the end of string ($
).
The re.DOTALL
flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)