Scrapy: Save response.body as html file?
Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use
self.html_file.write(response.body.decode("utf-8"))
instead of
self.html_file.write(response.body)
also you can use
self.html_file.write(response.text)
The correct way is to use response.text
, and not response.body.decode("utf-8")
. To quote documentation:
Keep in mind that
Response.body
is always a bytes object. If you want the unicode version useTextResponse.text
(only available inTextResponse
and subclasses).
and
text: Response body, as unicode.
The same as
response.body.decode(response.encoding)
, but the result is cached after the first call, so you can accessresponse.text
multiple times without extra overhead.Note:
unicode(response.body)
is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.
Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with
statement, the example should be rewritten like:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
def parse(self, response):
with open(self.path_to_html, 'w') as html_file:
html_file.write(response.text)
yield {
'url': response.url
}
But the html_file
will only accessible from the parse
method.