Function in Python to clean up and normalize a URL

Take a look at urlparse.urlparse(). I've had good success with it.


note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html


It's done in scrapy:

http://nullege.com/codes/search/scrapy.utils.url.canonicalize_url

Canonicalize the given url by applying the following procedures:

  • sort query arguments, first by key, then by value
  • percent encode paths and query arguments. non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
  • normalize all spaces (in query arguments) '+' (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)

Tags:

Python

Url