How to extract specific content in a pandas dataframe with a regex?
You can try str.extract
and strip
, but better is use str.split
, because in names of movies can be numbers too. Next solution is replace
content of parentheses by regex
and strip
leading and trailing whitespaces:
#convert column to string
df['movie_title'] = df['movie_title'].astype(str)
#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
movie_title titles titles1 titles2
0 Toy Story 2 (1995) Toy Story Toy Story 2 Toy Story 2
1 GoldenEye (1995) GoldenEye GoldenEye GoldenEye
2 Four Rooms (1995) Four Rooms Four Rooms Four Rooms
3 Get Shorty (1995) Get Shorty Get Shorty Get Shorty
4 Copycat (1995) Copycat Copycat Copycat
You should assign text group(s) with ()
like below to capture specific part of it.
new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']
pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
Find groups in each string using passed regular expression
I wanted to extract the text after the symbol "@" and before the symbol "." (period) I tried this, it worked more or less because I have the symbol "@" but I don not want this symbol, anyway:
df['col'].astype(str).str.extract('(@.+.+)