How can i use multiple requests and pass items in between them in scrapy python
All of the answers provided do have their pros and cons. I'm just adding an extra one to demonstrate how this has been simplified due to changes in the codebase (both Python & Scrapy). We no longer need to use meta
and can instead use cb_kwargs
(i.e. keyword arguments to pass to the callback function).
So instead of doing this:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp",
callback=self.parseDescription1)
request.meta['item'] = Item()
return [request]
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return [Request("http://www.example.com/lin2.cpp",
callback=self.parseDescription2, meta={'item': item})]
...
We can do this:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
yield response.follow("http://www.example.com/lin1.cpp",
callback=self.parseDescription1,
cb_kwargs={"item": item()})
def parseDescription1(self,response, item):
item['desc1'] = "More data from this new response"
yield response.follow("http://www.example.com/lin2.cpp",
callback=self.parseDescription2,
cb_kwargs={'item': item})
...
and if for some reason you have multiple links you want to process with the same function, we can swap
yield response.follow(a_single_url,
callback=some_function,
cb_kwargs={"data": to_pass_to_callback})
with
yield from response.follow_all([many, urls, to, parse],
callback=some_function,
cb_kwargs={"data": to_pass_to_callback})
The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].
If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1
, parseDescription2
, and parseDescription3
to succeed and run without errors in order to return the item.
For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.
Workaround
Thus, I currently employ the following workaround: Instead of only passing the item around in the request.meta
dict, pass around a call stack that knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.
The errback
request parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.
def callnext(self, response):
''' Call next target for the item loader, or yields it if completed. '''
# Get the meta object from the request, as the response
# does not contain it.
meta = response.request.meta
# Items remaining in the stack? Execute them
if len(meta['callstack']) > 0:
target = meta['callstack'].pop(0)
yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
else:
yield meta['loader'].load_item()
def parseDescription1(self, response):
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
# Build the call stack
callstack = [
{'url': "http://www.example.com/lin2.cpp",
'callback': self.parseDescription2 },
{'url': "http://www.example.com/lin3.cpp",
'callback': self.parseDescription3 }
]
return self.callnext(response)
def parseDescription2(self, response):
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
return self.callnext(response)
def parseDescription3(self, response):
# ...
return self.callnext(response)
Warning
This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.
For more information, check the blog post I wrote about that solution.
No problem. Following is correct version of your code:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
request.meta['item'] = item
yield request
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
yield request
yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return item
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return item
In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
request.meta['item'] = Item()
return [request]
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return [item]
Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.
If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.