Match names, dialogues, and actions from transcript using regex
You can do this with re.findall
:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split
into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the *
to unpack the result of processWords
into the tuple
is strictly a python 3 feature.