Combining base url with resultant href in scrapy
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:])
as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy
is to use response.follow()
. scrapy will handle the rest.
more info
Quote from docs:
Unlike
scrapy.Request
,response.follow
supports relative URLs directly - no need to callurljoin
.
Also, you can pass <a>
element directly as argument.
An alternative solution, if you don't want to use urlparse
:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com
for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.