How "download_slot" works within scrapy

Let's start with the Scrapy architecture. When you create a scrapy.Request, the Scrapy engine passes the request to the downloader to fetch the content. The downloader puts incoming requests into slots which you can imagine as independent queues of requests. The queues are then polled and each individual request gets processed (the content gets downloaded).

Now, here's the crucial part. To determine into what slot to put the incoming request, downloader checks request.meta for download_slot key. If it's present, it puts the request into the slot with that name (and creates it if it doesn't yet exist). If the download_slot key is not present, it puts the request into the slot for the domain (more accurately, the hostname) the request's URL points to.

This explains why your script runs faster. You create multiple downloader slots because they are based on the author's name. If you did not, they would be put into the same slot based on the domain (which is always stackoverflow.com). Thus, you effectively increase the parallelism of downloading content.

This explanation is a little bit simplified but it should give you a picture of what's going on. You can check the code yourself.

For example there is a some target website which allows to process only 1 request per 20 seconds and we need to parse/process 3000 webpages of products data from it. Common spider with DOWNLOAD_DELAY setting to 20 - application will finish work in ~17 hours(3000 pages * 20 seconds download delay).

If you have aim to increase scraping speed without getting banned by website and you have for example 20 valid proxies You can uniformly allocate request urls to all your proxies using proxy and download_slot meta key and significally reduce application completion time

Click to copy

from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy

class ProxySpider(scrapy.Spider):
    name = 'proxy'
    start_urls = ['https://example.com/products/1','https://example.com/products/2','....']#list with 3000 products url
    proxies = [',,,'] #list wiht 20 proxies

    def start_requests(self):
        for index, url in start_urls:
            chosen_proxy = proxies(index % len(self.proxies)
            yield Request(url, callback=self.parse,
                          meta = {"proxy":chosen_proxy,"download_slot":chosen_proxy})

    def parse(self,response):
        ....
            yeild item
            #yield Request(deatails_url,
                           callback=self.parse_additional_details,
                           meta= 
                           {"download_slot":response.request.meta["download_slot"],
                            "proxy":response.request.meta["download_slot"]})



if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0','DOWNLOAD_DELAY':20, "COOKIES_ENABLED":False
    })
    process.crawl(ProxySpider)
    process.start()

How "download_slot" works within scrapy

Tags:

Python

Web Scraping

Python 3.X

Scrapy

Related

Recent Posts