Speed up web scraper
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED
setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
Here's a collection of things to try:
- use latest scrapy version (if not using already)
- check if non-standard middlewares are used
- try to increase
CONCURRENT_REQUESTS_PER_DOMAIN
,CONCURRENT_REQUESTS
settings (docs) - turn off logging
LOG_ENABLED = False
(docs) - try
yield
ing an item in a loop instead of collecting items into theitems
list and returning them - use local cache DNS (see this thread)
- check if this site is using download threshold and limits your download speed (see this thread)
- log cpu and memory usage during the spider run - see if there are any problems there
- try run the same spider under scrapyd service
- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
- try running
Scrapy
onpypy
, see Running Scrapy on PyPy
Hope that helps.