Scrapy with Privoxy and Tor: how to renew IP
But Tor connects with the same IP everytime
That is a documented Tor feature:
An important thing to note is that a new circuit does not necessarily mean a new IP address. Paths are randomly selected based on heuristics like speed and stability. There are only so many large exits in the Tor network, so it's not uncommon to reuse an exit you have had previously.
That's the reason why using the code below can result in reusing the same IP address again.
from stem import Signal
from stem.control import Controller
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
https://github.com/DusanMadar/TorIpChanger helps you to manage this behavior. Disclaimer - I wrote TorIpChanger
.
I've also put together a guide on how to use Python with Tor and Privoxy: https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b.
Here's an example of how you can use `TorIpChanger` (`pip install toripchanger`) in your `ProxyMiddleware`.
from toripchanger import TorIpChanger
# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
def process_request(self, request, spider):
ip_changer.get_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
Or, if you want to use a different IP after 10 requests, you can do something like below.
from toripchanger import TorIpChanger
# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
This blog post might help you a bit as it deals with the same issue.
EDIT: Based on concrete requirement (new IP for each request or after N requests), put appropriate call to set_new_ip
in process_request
method of the middleware. Note, however, that call to set_new_ip
function doesn't have to always ensure new IP (there's a link to the FAQ with explanation).
EDIT2: The module with ProxyMiddleware
class would look like this:
from stem import Signal
from stem.control import Controller
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
class ProxyMiddleware(object):
def process_request(self, request, spider):
_set_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])