Scrapy on a schedule
First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)
) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).
To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner
core API as opposed to scrapy.crawler.CrawlerProcess
. The main difference between the two is that CrawlerProcess
runs Twisted's reactor
for you (thus making it difficult to restart the reactor), where as CrawlerRunner
relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner
:
from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
"""
Run a spider within Twisted. Once it completes,
wait 5 seconds and run another spider.
"""
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(QuotesSpider)
# you can use reactor.callLater or task.deferLater to schedule a function
deferred.addCallback(reactor.callLater, 5, run_crawl)
return deferred
run_crawl()
reactor.run() # you have to run the reactor yourself