Python selenium multiprocessing
The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:
- Use instead class
Driver
that will crate the driver instance and store it on the thread-local storage but also have a destructor that willquit
the driver when the thread-local storage is deleted:
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
#print('The driver has been "quitted".')
create_driver
now becomes:
threadLocal = threading.local()
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
- Finally, after you have no further use for the
ThreadPool
instance but before it is terminated, add the following lines to delete the thread-local storage and force theDriver
instances' destructors to be called (hopefully):
del threadLocal
import gc
gc.collect() # a little extra insurance
how can I reduce the execution time using selenium when it is made to run using multiprocessing
A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:
(... skipped for brevity ...)
threadLocal = threading.local()
def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver
def get_title(url):
driver = get_driver()
driver.get(url)
(...)
(...)
On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.
Note: ThreadPool
uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool
instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.
My question: how can I reduce the execution time?
Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.
For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.
Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')
To run put this into blogspider.py
and run
$ scrapy runspider blogspider.py
See the Scrapy website for a complete tutorial.
Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.