scrapy text encoding

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped.

That is to add in your settings.py:

FEED_EXPORT_ENCODING = 'utf-8'

Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:

vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]

But I think that you expect another result. Your code return one item with all search results. To return items for each result:

hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
                             hxs.select("//div[@class='results_address_class']/text()").extract()):
    vriskoit = VriskoItem()
    vriskoit['eponimia'] = eponimia.encode('utf-8')
    vriskoit['address'] = address.encode('utf-8')
    yield vriskoit

Update

JSON exporter writes unicode symbols escaped (e.g. \u03a4) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False (see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.

So if you want exported items to be written in utf-8 encoding, e.g. for read them in text editor, you can write custom item pipeline.

pipelines.py:

import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

Don't forget to add this pipeline to settings.py:

 ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']

You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline is just basic example.

Tags:

Scrapy