Why are my input/output processors in Scrapy not working?
However, there is one more place where you can specify the input and output processors to use: in the Item Field metadata.
I suspect the documentation is misleading/wrong (or may be out of date?), because, according to the source code, the input_processor
field attribute is read only inside the ItemLoader
instance, which means that you need to use an Item Loader anyway.
You can use a built-in one and leave your DmozItem
definition as is:
from scrapy.loader import ItemLoader
class DmozSpider(scrapy.Spider):
# ...
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = ItemLoader(DmozItem(), selector=sel)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
This way the input_processor
and output_processor
Item Field arguments would be taken into account and the processors would be applied.
Or you can define the processors inside a custom Item Loader instead of the Item
class:
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class MyItemLoader(ItemLoader):
desc_in = MapCompose(
lambda x: ' '.join(x.split()),
lambda x: x.upper()
)
desc_out = Join()
And use it to load items in your spider:
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = MyItemLoader(DmozItem(), selector=sel)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()