Canonicalize / normalize a URL?

How about this:

In [1]: from urllib.parse import urljoin

In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'

Inspired by answers to this question. It doesn't normalize ports, but it should be simple to whip up a function that does.

There is now a library dedicated this exact problem url-normalize

It does more than just normalising the path as per the docs:

URI Normalization function:

Take care of IDN domains.

Always provide the URI scheme in lowercase characters.

Always provide the host, if any, in lowercase characters.

Only perform percent-encoding where it is essential.

Always use uppercase A-through-F characters when percent-encoding.

Prevent dot-segments appearing in non-relative URI paths.

For schemes that define a default authority, use an empty authority if the default is desired.

For schemes that define an empty path to be equivalent to a path of "/", use "/".

For schemes that define a port, use an empty port if the default is desired

All portions of the URI must be utf-8 encoded NFC from Unicode strings

Here is an example:

from url_normalize import url_normalize

url = 'http://google.com:80/a/../'
print(url_normalize(url))

Which gives:

http://google.com/

This is what I use and it's worked so far. You can get urlnorm from pip.

Notice that I sort the query parameters. I've found this to be essential.

from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm

def canonizeurl(url):
    split = urlsplit(urlnorm.norm(url))
    path = split[2].split(' ')[0]

    while path.startswith('/..'):
        path = path[3:]

    while path.endswith('%20'):
        path = path[:-3]

    qs = urlencode(sorted(parse_qsl(split.query)))
    return urlunsplit((split.scheme, split.netloc, path, qs, ''))

Old (deprecated) answer

The [no longer maintained] urltools module normalizes multiple slashes, . and .. components without messing up the double slash in http://.

Once you do ~~pip install urltools~~ (this does not work anymore as the author renamed the repo) the usage is as follows:

print urltools.normalize('http://example.com:80/a////b/../c')
>>> 'http://example.com/a/c'

While the module is not pip-installable anymore it is a single file so you can re-use bits of it.

Updated answer for Python3

For Python3 consider using urljoin from the urllib.urlparse module.

from urllib.parse import urljoin

urljoin('https://stackoverflow.com/questions/10584861/', '../dinsdale')
# Out[17]: 'https://stackoverflow.com/questions/dinsdale'

Canonicalize / normalize a URL?

Old (deprecated) answer

Updated answer for Python3

Tags:

Url

Python 3.X

Normalization

Related

Recent Posts