Expression to remove URL links from Twitter tweet
The following regex will capture two matched groups: the first includes everything in the tweet until the url and the second will catch everything that will come after the URL (empty in the example you posted above):
import re
str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str)
if clean_tweet:
print clean_tweet.group(1)
print clean_tweet.group(2) # will print everything after the URL
you can use:
text = 'Amazing save #FACup #zeebox https://stackoverflow.com/tiUya56M Ok'
text = re.sub(r'https?:\/\/\S*', '', text, flags=re.MULTILINE)
# output: 'Amazing save #FACup #zeebox Ok'
r
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'?
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. https? will match either ‘http’ or ‘https’.https?:\/\/
will match any "http://" and "https://" in string\S
Returns a match where the string DOES NOT contain a white space character*
Zero or more occurrences
You could try the below re.sub function to remove URL link from your string,
>>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx'
>>> m = re.sub(r':.*$', ":", str)
>>> m
'This is a tweet with a url:'
It removes everything after first :
symbol and :
in the replacement string would add :
at the last.
This would prints all the characters which are just before to the :
symbol,
>>> m = re.search(r'^.*?:', str).group()
>>> m
'This is a tweet with a url:'
Do this:
result = re.sub(r"http\S+", "", subject)
http
matches literal characters\S+
matches all non-whitespace characters (the end of the url)- we replace with the empty string