Scrapy: Save response.body as html file?

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use

 self.html_file.write(response.body.decode("utf-8"))

instead of

  self.html_file.write(response.body)

also you can use

  self.html_file.write(response.text)

The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:

Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

and

text: Response body, as unicode.

The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.

Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

But the html_file will only accessible from the parse method.

Scrapy: Save response.body as html file?

Tags:

Python

Django

Web Crawler

Scrapy

Related

Recent Posts