How can I get the base of a URL in Python?
The best way to do this is use urllib.parse
.
From the docs:
The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes:
file
,ftp
,gopher
,hdl
,http
,https
,imap
,mailto
,mms
,news
,nntp
,prospero
,rsync
,rtsp
,rtspu
,sftp
,shttp
,sip
,sips
,snews
,svn
,svn+ssh
,telnet
,wais
,ws
,wss
.
You'd want to do something like this using urlsplit and urlunsplit:
from urllib.parse import urlsplit, urlunsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
# You now have:
# split_url.scheme "http"
# split_url.netloc "127.0.0.1"
# split_url.path "/asdf/login.php"
# split_url.query "q=abc"
# split_url.fragment "stackoverflow"
# Use all the path except everything after the last '/'
clean_path = "".join(split_url.path.rpartition("/")[:-1])
# "/asdf/"
# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)
# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
# A more advanced example
advanced_split_url = urlsplit('http://foo:[email protected]:5000/asdf/login.php?q=abc#stackoverflow')
# You now have *in addition* to the above:
# advanced_split_url.username "foo"
# advanced_split_url.password "bar"
# advanced_split_url.hostname "127.0.0.1"
# advanced_split_url.port "5000"
Well, for one, you could just use os.path.dirname
:
>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'
It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).
You may also want to look at urllib.parse.urlparse
for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path
component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.
Lastly, if you want to just split off the component after the last slash, you can do an rsplit
with a maxsplit
of 1
, and keep the first component:
>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'