Unable to use proxies one by one until there is a valid response

you need write a downloader middleware, to install a process_exception hook, scrapy calls this hook when exception raised. in the hook, you could return a new Request object, with dont_filter=True flag, to let scrapy reschedule the request until it succeeds.

in the meanwhile, you could verify response extensively in process_response hook, check the status code, response content etc., and reschedule request as necessary.

in order to change proxy easily, you should use built-in HttpProxyMiddleware, instead of tinker with environ:

request.meta['proxy'] = proxy_address

take a look at this project as an example.

As we know http response needs to pass all middlewares in order to reach spider methods.

It means that only requests with valid proxies can proceed to spider callback functions.

In order to use valid proxies we need to check ALL proxies first and after that choose only from valid proxies.

When our previously chosen proxy doesn't work anymore - we mark this proxy as not valid and choose new one from remaining valid proxies in spider errback.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http.request import Request

class ProxySpider(scrapy.Spider):
    name = "sslproxies"
    check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
    proxy_link = "https://www.sslproxies.org/"
    current_proxy = ""
    proxies = {}

    def start_requests(self):
        yield Request(self.proxy_link,callback=self.parse_proxies)

    def parse_proxies(self,response):

        for row in response.css("table#proxylisttable tbody tr"):
             if "yes" in row.extract():
                 td = row.css("td::text").extract()
                 self.proxies["http://{}".format(td[0]+":"+td[1])]={"valid":False}

        for proxy in self.proxies.keys():
             yield Request(self.check_url,callback=self.parse,errback=self.errback_httpbin,
                           meta={"proxy":proxy,
                                 "download_slot":proxy},
                           dont_filter=True)

    def parse(self, response):
        if "proxy" in response.request.meta.keys():
            #As script reaches this parse method we can mark current proxy as valid
            self.proxies[response.request.meta["proxy"]]["valid"] = True
            print(response.meta.get("proxy"))
            if not self.current_proxy:
                #Scraper reaches this code line on first valid response
                self.current_proxy = response.request.meta["proxy"]
                #yield Request(next_url, callback=self.parse_next,
                #              meta={"proxy":self.current_proxy,
                #                    "download_slot":self.current_proxy})

    def errback_httpbin(self, failure):
        if "proxy" in failure.request.meta.keys():
            proxy = failure.request.meta["proxy"]
            if proxy == self.current_proxy:
                #If current proxy after our usage becomes not valid
                #Mark it as not valid
                self.proxies[proxy]["valid"] = False
                for ip_port in self.proxies.keys():
                    #And choose valid proxy from self.proxies
                    if self.proxies[ip_port]["valid"]:
                        failure.request.meta["proxy"] = ip_port
                        failure.request.meta["download_slot"] = ip_port
                        self.current_proxy = ip_port
                        return failure.request
        print("Failure: "+str(failure))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'COOKIES_ENABLED': False,
        'DOWNLOAD_TIMEOUT' : 10,
        'DOWNLOAD_DELAY' : 3,
    })
    c.crawl(ProxySpider)
    c.start()

Unable to use proxies one by one until there is a valid response

Tags:

Python

Web Scraping

Proxy

Scrapy

Scrapy Spider

Related

Recent Posts