Multiple negative lookbehind assertions in python regex?
Use nltk or similar tools as suggested by @root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
Input
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
Output
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
First, I think you may want to replace the space with \s+
, or \s
if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z]
, but A-Z
will not work (but remember there may be other uppercase letters than A-Z
...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z]
if it is not preceeded by Abs
or S
. The thing is that, if it is preceeded by an S
, it is not preceeded by Abs
, so the first pattern matches. If it is preceeded by Abs
, it is not preceeded by S
, so the second pattern version matches. In either way one of those patterns will match since Abs
and S
are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |
, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.
Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example
2example
and 3example
but not 1example
12example
or 123example
.