Extract information from HTML using CSS selectors?
Warning
This answer pertains to the original release of jsoupLink. The interface changed completely in a later version. Please see the Github page for the current interface.
=================================
As much as I would like to see a solution to this problem written in Mathematica, this is very unlikely given the scope of the problem. I would like to share a way to solve this using JLink, in the hope that it may help someone.
JLink, for those who don't know, is a package that comes with Mathematica. It allows you to execute Java code from within Mathematica. This means you can use any Java library out there to solve your problems without leaving the notebook interface. For this particular problem I will use jSoup, which is a parser just like the ones mentioned in the question.
Downloading and installing the package
You can download the latest version as a zip file from here.
It is important that the files are kept in the correct folder, otherwise Mathematica will not be able to locate the Java files. Therefore, to install the package start by evaluating
FileNameJoin[{$UserBaseDirectory, "Applications"}]
in Mathematica and unzip the zip file you downloaded into this folder. Then use Needs["`jSoupLink`"]
to load the package.
Usage
The package contains three functions: ParseHTML
, ParseHTMLString
and ParseHTMLFragment
. Some information about these is contained in their usage messages, which, if you have loaded the package, you can view using for example
?jSoupLink`ParseHTML
Typically you will use ParseHTML
to download HTML source code from a website and then select a few elements. From these elements you will then extract some data. The general syntax is like this:
jSoupLink`ParseHTML[
website address,
CSS selector,
data elements to extract
]
website address
is any URL, for example http://mathematica.stackexchange.com
. CSS selector
is basically any valid CSS3 selector. There is a list of CSS3 selector in jSoup's documentation. Data elements to extract
can be almost anything contained by the elements that you've selected. Most commonly you'll want to extract attributes such as src
if you've selected img
elements or href
if you've selected links (a
elements). There are a few keywords that aren't attributes such as text
to select the text contained by a selected element (some text in <p>some text</p>
) or html
to select the HTML contained by a selected element. You can glean the complete list from the package source code, and look them up in jSoup's documentation if you're not sure what they are.
Examples
Selecting images from Wikipedia
urls = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/Sweden", (* URL *)
"table.infobox img", (* CSS selector *)
"src" (* Attribute to retrieve *)
];
Partition[Import /@ urls, 2] // Grid
Select headlines (both text and URL) from NYT
headlines = Rest@jSoupLink`ParseHTML[
"http://www.nytimes.com/pages/politics/index.html",
"h2 a, h3 a",
{"text", "href"}
];
Take[headlines, 5] // TableForm
Build a database with information about Swedish municipalities, using data on Wikipedia
headers = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable th",
"text"
];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)
headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)
municipalities = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable td",
"text"
];
municipalities = Partition[municipalities, 9];
ds = Dataset@Composition[
Map[AssociationThread],
Map[(headers -> #) &]
][municipalities];
Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type
ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal
{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...
If you are looking for a simple quick fix solution.
- Download http://nodejs.org/
- run
npm install jquery
- Example http://pastebin.com/raw.php?i=E2t9hSYu
Code
(function () {
var env = require('jsdom').env,
// first argument can be html string, filename, or url
html = '<html><body><h1>Hello World!</h1><p class="hello">Heya Big World!</body></html>';
env(html, function (errors, window) {
console.log(errors);
var $ = require('jquery')(window);
console.log($('.hello').text());
});
}());
Then you can call the script using something like
Import["!node script.js", "Text"]
The data should be outputted to the console. There is a slight bug listed here on Windows
Run Command Not Executing Node