Extract information from HTML using CSS selectors?

Warning

This answer pertains to the original release of jsoupLink. The interface changed completely in a later version. Please see the Github page for the current interface.

=================================

As much as I would like to see a solution to this problem written in Mathematica, this is very unlikely given the scope of the problem. I would like to share a way to solve this using JLink, in the hope that it may help someone.

JLink, for those who don't know, is a package that comes with Mathematica. It allows you to execute Java code from within Mathematica. This means you can use any Java library out there to solve your problems without leaving the notebook interface. For this particular problem I will use jSoup, which is a parser just like the ones mentioned in the question.

Downloading and installing the package

You can download the latest version as a zip file from here.

It is important that the files are kept in the correct folder, otherwise Mathematica will not be able to locate the Java files. Therefore, to install the package start by evaluating

FileNameJoin[{$UserBaseDirectory, "Applications"}]

in Mathematica and unzip the zip file you downloaded into this folder. Then use Needs["`jSoupLink`"] to load the package.

Usage

The package contains three functions: ParseHTML, ParseHTMLString and ParseHTMLFragment. Some information about these is contained in their usage messages, which, if you have loaded the package, you can view using for example

?jSoupLink`ParseHTML

Typically you will use ParseHTML to download HTML source code from a website and then select a few elements. From these elements you will then extract some data. The general syntax is like this:

jSoupLink`ParseHTML[
website address,
CSS selector,
data elements to extract
]

website address is any URL, for example http://mathematica.stackexchange.com. CSS selector is basically any valid CSS3 selector. There is a list of CSS3 selector in jSoup's documentation. Data elements to extract can be almost anything contained by the elements that you've selected. Most commonly you'll want to extract attributes such as src if you've selected img elements or href if you've selected links (a elements). There are a few keywords that aren't attributes such as text to select the text contained by a selected element (some text in <p>some text</p>) or html to select the HTML contained by a selected element. You can glean the complete list from the package source code, and look them up in jSoup's documentation if you're not sure what they are.

Examples

Selecting images from Wikipedia

urls = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/Sweden", (* URL *)
   "table.infobox img", (* CSS selector *)
   "src" (* Attribute to retrieve *)
   ];
Partition[Import /@ urls, 2] // Grid

Example images from Wikipedia

Select headlines (both text and URL) from NYT

headlines = Rest@jSoupLink`ParseHTML[
    "http://www.nytimes.com/pages/politics/index.html",
    "h2 a, h3 a",
    {"text", "href"}
    ];
Take[headlines, 5] // TableForm

NYT headlines

Build a database with information about Swedish municipalities, using data on Wikipedia

headers = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable th",
   "text"
   ];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)
headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)

municipalities = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable td",
   "text"
   ];
municipalities = Partition[municipalities, 9];

ds = Dataset@Composition[
     Map[AssociationThread],
     Map[(headers -> #) &]
     ][municipalities];

Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type

ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal

{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...


If you are looking for a simple quick fix solution.

  1. Download http://nodejs.org/
  2. run npm install jquery
  3. Example http://pastebin.com/raw.php?i=E2t9hSYu

Code

(function () {    
  var env = require('jsdom').env,
      // first argument can be html string, filename, or url
      html = '<html><body><h1>Hello World!</h1><p class="hello">Heya Big World!</body></html>';

  env(html, function (errors, window) {
    console.log(errors);

    var $ = require('jquery')(window);
    console.log($('.hello').text());
  });
}());

Then you can call the script using something like

Import["!node script.js", "Text"]

The data should be outputted to the console. There is a slight bug listed here on Windows

Run Command Not Executing Node

Tags:

Html

Css