How to handle a 429 Too Many Requests response in Scrapy?
You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
import time
class TooManyRequestsRetryMiddleware(RetryMiddleware):
def __init__(self, crawler):
super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
elif response.status == 429:
self.crawler.engine.pause()
time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
elif response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
Add 429 to retry codes in settings.py
RETRY_HTTP_CODES = [429]
Then activate it on settings.py
. Don't forget to deactivate the default retry middleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}
Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.
Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.
Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.
Solutions:
Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.
If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting
Building upon Aminah Nuraini's answer, you can use Twisted's Deferreds to avoid breaking asynchrony by calling time.sleep()
from twisted.internet import reactor, defer
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 60 # Delay in seconds.
MAX_DELAY = 600 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
The line await async_sleep(delay)
blocks process_response
's execution until delay
seconds have passed, but Scrapy is free to do other stuff in the meantime. This async
/await
corutine syntax was introduced in Python 3.5 and support for it was added in Scrapy 2.0.
It's still necessary to modify settings.py
as in the original answer.