Combining base url with resultant href in scrapy

It is because you didn't add the scheme, eg http:// in your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.

The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.

more info

Quote from docs:

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.

Also, you can pass <a> element directly as argument.

An alternative solution, if you don't want to use urlparse:

response.urljoin(i[1:])

This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.

This makes your code reusable in the future if you want to change the domain you are crawling.

Combining base url with resultant href in scrapy

Tags:

Python

Url

Scrapy

Related

Recent Posts