URL parsing in Python - normalizing double-slash in paths

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

The url will be at index 0 of the list returned and the query at index 1.

It is not possible to have two '?' in an url so it should work for all urls.

The path (//path) alone is not valid, which confuses the function and gets interpreted as a hostname

http://tools.ietf.org/html/rfc3986.html#section-3.3

If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//").

I don't particularly like either of these solutions, but they work:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

Depending on what you are doing, you could do the joining manually:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]

# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path

It is mentioned in official urlparse docs that:

If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result. For example

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

If you do not want that behavior, preprocess the url with urlsplit() and urlunsplit(), removing possible scheme and netloc parts.

So you can do :

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

Output = 'http://www.example.com/path'

URL parsing in Python - normalizing double-slash in paths

Tags:

Python

Urlparse

Related

Recent Posts