Python selenium multiprocessing

The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:

Use instead class Driver that will crate the driver instance and store it on the thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:

Click to copy

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')

create_driver now becomes:

Click to copy

threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver

Finally, after you have no further use for the ThreadPool instance but before it is terminated, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):

Click to copy

del threadLocal
import gc
gc.collect() # a little extra insurance

how can I reduce the execution time using selenium when it is made to run using multiprocessing

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

Click to copy

(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver


def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.

My question: how can I reduce the execution time?

Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.

For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.

Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).

Click to copy

import scrapy
class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self, response):
        for title in response.css('.summary .question-hyperlink'):
            yield title.get('href')

To run put this into blogspider.py and run

Click to copy

$ scrapy runspider blogspider.py

See the Scrapy website for a complete tutorial.

Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.

Python selenium multiprocessing

Tags:

Python

Web Scraping

Python 3.X

Selenium

Multiprocessing

Related

Recent Posts