Python urlparse -- extract domain name without subdomain
This is an update, based on the bounty request for an updated answer
Start by using the tld package. A description of the package:
Extracts the top level domain (TLD) from the URL given. List of TLD names is taken from Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1
from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()
print get_tld("http://www.google.co.uk")
print get_tld("http://zap.co.it")
print get_tld("http://google.com")
print get_tld("http://mail.google.com")
print get_tld("http://mail.google.co.uk")
print get_tld("http://google.co.uk")
This outputs
google.co.uk
zap.co.it
google.com
google.com
google.co.uk
google.co.uk
Notice that it correctly handles country level TLDs by leaving co.uk
and co.it
, but properly removes the www
and mail
subdomains for both .com
and .co.uk
The update_tld_names()
call at the beginning of the script is used to update/sync the tld names with the most recent version from Mozilla.
You probably want to check out tldextract, a library designed to do this kind of thing.
It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).
>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
So in your case:
>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"