How to scrape data from a google map (in flash)?
Quickly glancing at the page in Firebug and looking at the network calls, you can see where they are pulling the data from. Seems to be a couple of XML files, namely:
http://graphics8.nytimes.com/packages/xml/map_feed_victims.txt?c=2182
and
http://graphics8.nytimes.com/packages/xml/map_feed_incidents.txt?c=2182
+1 to @ericoneal's answer, but for the sake of noting an alternative approach, you could also download and install Fiddler. Fiddler routes your port-80 traffic through a proxy and provides you an interface for poking-around in the HTTP responses that follow your web request.
I'll describe the usage. In the screenshot, I just launched Fiddler, then opened your link in IE. All the data starts streaming-in without my doing anything else. Once it's settled, at left, I clicked on one of the returns (map_feed_incidents.txt
, as noted by Eric), then at top-right, I select Inspectors. The pane at bottom-right provides several inspection formats. I tried a few, and the screen shows the TextView.
At a glance, the content appears to be line-break and tab-delimited (it's definitely not real XML). The top line specifies the file format, and every other line is an incident record. Here's the top-line and first record from the _incidents
file (scroll right and note the id
field):
LAT:DOUBLE LONG:DOUBLE incident_date:STRING incident_time:STRING boro:STRING num_victims:INTEGER primary_motive:STRING id:INTEGER weapon:STRING light_dark:STRING year:INTEGER
40.665626 -73.909699 01/01/08 02:09 Brooklyn 1 7 gun D 2008
The lat/long is obvious. The other two files (_victims
and _perpetrators
) use the same approach. Here's the top line and first record from the _perps
table:
incident_id:INTEGER sex:STRING race:STRING age:INTEGER
7 M B 20
The presence of incident_id
is useful. Both _victims
and _perps
have this column, and it relates their data back to the geo-tagged _incidents
table using that table's id
column.
As an aside.. I have to agree with George and wonder why they included the victim's name. That seems like a major ethical oversight. While it's meaningless as a mapped attribute, I would not be surprised to see the perpetrator's name. But the victim's? At first I thought this may have been an unused element in the data payload, but it's really in the map?!?! That's a very questionable decision, and it leads me to believe nobody is using that map. Otherwise I think some criticism would've emerged from the general public.
I don't know if you can get the exact same data from the NYC Open data repository, but here is a link.
A slightly different approach could be to try to gather the data using the New York Times API: http://prototype.nytimes.com/gst/apitool/index.html