Scrapy and proxies
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware
.
The easiest way to use a proxy is to set the environment variable http_proxy
. How this is done depends on your shell.
C:\>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy
you should follow below,
C:\>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port
Single Proxy
Enable
HttpProxyMiddleware
in yoursettings.py
, like this:DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 }
pass proxy to request via
request.meta
:request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider):
name = "my_spider"
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
def parse(self, response):
...parse code...
if something:
yield self.get_request(url)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.meta['proxy'] = random.choice(self.proxy_pool)
return req