Python data scraping with Scrapy

Basically, you have plenty of tools to choose from:

  • scrapy
  • beautifulsoup
  • lxml
  • mechanize
  • requests (and grequests)
  • selenium

These tools have different purposes but they can be mixed together depending on the task.

Scrapy is a powerful and very smart tool for crawling web-sites, extracting data. But, when it comes to manipulating the page: clicking buttons, filling forms - it becomes more complicated:

  • sometimes, it's easy to simulate filling/submitting forms by making underlying form action directly in scrapy
  • sometimes, you have to use other tools to help scrapy - like mechanize or selenium

If you make your question more specific, it'll help to understand what kind of tools you should use or choose from.

Take a look at an example of interesting scrapy&selenium mix. Here, selenium task is to click the button and provide data for scrapy items:

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider

class ElyseAvenueItem(Item):
    name = Field()

class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = [""]
    start_urls = [

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:


        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item



Here's an example on how to use scrapy in your case:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider

class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()

class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = [""]
    start_urls = ['']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes ='//select[@name="combox_doc_doctype"]/option')

        form_token ='//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type ='.//@value').extract()[0]
                doc_type_name ='.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="",
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows ='//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough ='.//td[2]/div/font/text()').extract()
            block ='.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

Save it in and run via scrapy runspider -o output.json and in output.json you will see:

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}

Hope that helps.

If you simply want to submit the form and extract data from the resulting page, I'd go for:

  • requests to send the post request
  • beautiful soup to extract chosen data from the result page

Scrapy added value really holds in its ability to follow links and crawl a website, I don't think it is the right tool for the job if you know precisely what you are searching for.

I would personally use mechanize as I do not have any experience with scrapy. However a library named scrapy purpose built for screen scraping should be up for the task. I would just have a go with both of them and see which does the job best/easiest.