Canonicalize / normalize a URL?
How about this:
In [1]: from urllib.parse import urljoin
In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'
Inspired by answers to this question. It doesn't normalize ports, but it should be simple to whip up a function that does.
There is now a library dedicated this exact problem url-normalize
It does more than just normalising the path as per the docs:
URI Normalization function:
- Take care of IDN domains.
- Always provide the URI scheme in lowercase characters.
- Always provide the host, if any, in lowercase characters.
- Only perform percent-encoding where it is essential.
- Always use uppercase A-through-F characters when percent-encoding.
- Prevent dot-segments appearing in non-relative URI paths.
- For schemes that define a default authority, use an empty authority if the default is desired.
- For schemes that define an empty path to be equivalent to a path of "/", use "/".
- For schemes that define a port, use an empty port if the default is desired
- All portions of the URI must be utf-8 encoded NFC from Unicode strings
Here is an example:
from url_normalize import url_normalize
url = 'http://google.com:80/a/../'
print(url_normalize(url))
Which gives:
http://google.com/
This is what I use and it's worked so far. You can get urlnorm from pip.
Notice that I sort the query parameters. I've found this to be essential.
from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm
def canonizeurl(url):
split = urlsplit(urlnorm.norm(url))
path = split[2].split(' ')[0]
while path.startswith('/..'):
path = path[3:]
while path.endswith('%20'):
path = path[:-3]
qs = urlencode(sorted(parse_qsl(split.query)))
return urlunsplit((split.scheme, split.netloc, path, qs, ''))
Old (deprecated) answer
The [no longer maintained] urltools module normalizes multiple slashes, .
and ..
components without messing up the double slash in http://
.
Once you do (this does not work anymore as the author renamed the repo) the usage is as follows:pip install urltools
print urltools.normalize('http://example.com:80/a////b/../c')
>>> 'http://example.com/a/c'
While the module is not pip-installable anymore it is a single file so you can re-use bits of it.
Updated answer for Python3
For Python3 consider using urljoin
from the urllib.urlparse
module.
from urllib.parse import urljoin
urljoin('https://stackoverflow.com/questions/10584861/', '../dinsdale')
# Out[17]: 'https://stackoverflow.com/questions/dinsdale'