Find http:// and or www. and strip from domain. leaving domain.com
It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit
(Python 2) or urllib.parse.urlsplit
(Python 3).
from urllib.parse import urlsplit # Python 3
from urlparse import urlsplit # Python 2
import re
url = 'www.python.org'
# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid
if not re.match(r'http(s?)\:', url):
url = 'http://' + url
# url is now 'http://www.python.org'
parsed = urlsplit(url)
# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined
host = parsed.netloc # www.python.org
# Removing www.
# This is a bad idea, because www.python.org could
# resolve to something different than python.org
if host.startswith('www.'):
host = host[4:]
You can do without regexes here.
with open("file_path","r") as f:
lines = f.read()
lines = lines.replace("http://","")
lines = lines.replace("www.", "") # May replace some false positives ('www.com')
urls = [url.split('/')[0] for url in lines.split()]
print '\n'.join(urls)
Example file input:
http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com
Output:
foo.com
foobar.com
bar.com
foobar.com
Edit:
There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.
Replace the line lines = lines.replace("www.", "")
with lines = re.sub(r'(www.)(?!com)',r'',lines)
. Of course, every possible TLD should be used for the not-match pattern.