Scrapy, scraping data inside a Javascript

(I posted this to scrapy-users mailing list but by Paul's suggestion I'm posting it here as it complements the answer with the shell command interaction.)

Generally, websites that use a third party service to render some data visualization (map, table, etc) have to send the data somehow, and in most cases this data is accessible from the browser.

For this case, an inspection (i.e. exploring the requests made by the browser) shows that the data is loaded from a POST request to https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php

So, basically you have there all the data you want in a nice json format ready for consuming.

Scrapy provides the shell command which is very convenient to thinker with the website before writing the spider:

$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...

In [1]: from scrapy.http import FormRequest

In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'

In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}

In [4]: req = FormRequest(url, formdata=payload)

In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)
...

In [6]: import json

In [7]: data = json.loads(response.body)

In [8]: len(data['stores']['listing'])
Out[8]: 127

In [9]: data['stores']['listing'][0]
Out[9]: 
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
 u'city': u'Singapore',
 u'id': 78,
 u'lat': u'1.440409',
 u'lon': u'103.801489',
 u'name': u"McDonald's Admiralty",
 u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
 u'phone': u'68940513',
 u'region': u'north',
 u'type': [u'24hrs', u'dessert_kiosk'],
 u'zip': u'731678'}

In short: in your spider you have to return the FormRequest(...) above, then in the callback load the json object from response.body and finally for each store's data in the list data['stores']['listing'] create an item with the wanted values.

Something like this:

class McDonaldSpider(BaseSpider):
    name = "mcdonalds"
    allowed_domains = ["mcdonalds.com.sg"]
    start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]

    def parse(self, response):
        # This receives the response from the start url. But we don't do anything with it.
        url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
        payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
        return FormRequest(url, formdata=payload, callback=self.parse_stores)

    def parse_stores(self, response):
        data = json.loads(response.body)
        for store in data['stores']['listing']:
            yield McDonaldsItem(name=store['name'], address=store['address'])

When you open https://www.mcdonalds.com.sg/locate-us/ in your browser of choice, open up the "inspect" tool (hopefully it has one, e.g. Chrome or Firefox), and look for the "Network" tab.

You can further filter for "XHR" (XMLHttpRequest) events, and you'll see a POST request to https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php with this body

action=ws_search_store_location&store_name=0&store_area=0&store_type=0

The response to that POST request is a JSON object with all the information you want

import json
import pprint
...
class MySpider(BaseSpider):
...
    def parse_json(self, response):

        js = json.loads(response.body)
        pprint.pprint(js)

This would output something like:

{u'flagicon': u'https://www.mcdonalds.com.sg/wp-content/themes/mcd/images/storeflag.png',
 u'stores': {u'listing': [{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
                           u'city': u'Singapore',
                           u'id': 78,
                           u'lat': u'1.440409',
                           u'lon': u'103.801489',
                           u'name': u"McDonald's Admiralty",
                           u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
                           u'phone': u'68940513',
                           u'region': u'north',
                           u'type': [u'24hrs', u'dessert_kiosk'],
                           u'zip': u'731678'},
                          {u'address': u'383 Bukit Timah Road<br/>#01-09B<br/>Alocassia Apartments<br/>Singapore 259727',
                           u'city': u'Singapore',
                           u'id': 97,
                           u'lat': u'1.319752',
                           u'lon': u'103.827398',
                           u'name': u"McDonald's Alocassia",
                           u'op_hours': u'Daily: 0630-0100',
                           u'phone': u'68874961',
                           u'region': u'central',
                           u'type': [u'24hrs_weekend',
                                     u'drive_thru',
                                     u'mccafe'],
                           u'zip': u'259727'},

                        ...
                          {u'address': u'60 Yishuan Avenue 4 <br/>#01-11<br/><br/>Singapore 769027',
                           u'city': u'Singapore',
                           u'id': 1036,
                           u'lat': u'1.423924',
                           u'lon': u'103.840628',
                           u'name': u"McDonald's Yishun Safra",
                           u'op_hours': u'24 hours',
                           u'phone': u'67585632',
                           u'region': u'north',
                           u'type': [u'24hrs',
                                     u'drive_thru',
                                     u'live_screening',
                                     u'mccafe',
                                     u'bday_party'],
                           u'zip': u'769027'}],
             u'region': u'all'}}

I'll leave you to extract the fields you want.

In the FormRequest() you send with Scrapy you probably need to add a "X-Requested-With: XMLHttpRequest" header (your browser sends that if you look at the request headers in the inspect tool)