How to extract information from a Wikipedia infobox?

The accepted answer is correct on all points, and especially the subtext that parsing wikitexxt is horrible.

If, however, getting your data from Wikidata doesn't quite work for you, because (just hypothetically) you're the person trying to move data from WP to WD, I believe the format you are looking for is the parsetree. Here is what it looks like:

<...lots of other stuff omitted>
<template lineStart= "1">
   <title>Datatable TableRow</title>
   <part>
      <name>Picture         </name>
      <equals>=</equals>
      <value> Picture 2013-07-26.jpg</value>
   </part>
   <part>
      <name>Inscription    </name>
      <equals>=</equals>
      <value> This is an Inscription on visible on the image</value>
   </part>
   <part>
      <name>NS           </name>
      <equals>=</equals>
      <value> 54.0902049</value>
   </part>
   <part>
      <name>EW           </name>
      <equals>=</equals>
      <value> 12.1364164</value>
   </part>
   <part>
      <name>Region       </name>
      <equals>=</equals>
      <value> DE-MV</value>
   </part>
   <part>
      <name>Name         </name>
      <equals>=</equals>
      <value> Person, Anna</value>
   </part>
   <part>
      <name>Location          </name>
      <equals>=</equals>
      <value> Lange Stra\u00dfe&amp;nbsp;14&lt;br /&gt;&lt;small&gt;ex: Lange Stra\u00dfe&amp;nbsp;89&lt;/small&gt;</value>
   </part>
   <part>
      <name>Date </name>
      <equals>=</equals>
      <value> </value>
   </part>
</template>

Here's an URI to such a request with the Mediawiki API Sandbox. Note the list of properties that includes parsetree. I've included some other categories (including categories) just in case, and you probably want to trim the list to what you actually need, to save your time and others' servers.

The wrong way: trying to parse HTML

Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex.

This is actually a really bad idea most of the time. Wikipedia's HTML code is not particularly parsing-friendly (especially infoboxes which are a system of hand-written templates), the exact structure changes from infobox to infobox, and the structure of an infobox might change over time. You might also miss out on some features that would be otherwise available, such as internationalization.

The other wrong way: trying to parse wikitext

At a glance, the wikitext of some articles looks like it's a pretty straightforward representation of the infobox:

{{ Infobox Foo
| param1 = bar
| param2 = 123
...

In reality, that's not the case. Templates are "recursive" so you might run into stuff like param1 = {{convert|10|km|mi}}; template parameters might contain complex wikitext or HTML markup; some parameters might be missing from the article wikitext and fetched by the template from a subpage or other data repository. Just finding out where a parameter starts and ends might not be a simple business if it contains other templates which have their own parameters.

The ideal way: using a structured data source

There are various projects to provide the information contained in Wikipedia infoboxes in a structured form; the two large ones are Wikidata and DBpedia.

Wikidata is a project to build a knowledge base containing structured data; it is maintained by the same global movement that built Wikipedia, so information is in the process of being moved over. This is a manual process, so not all information in Wikipedia is available via Wikidata, on the other hand there is a lot of information that's in Wikidata but not in Wikipedia. You can find the Wikidata page of an article and see what information it contains by following the Wikidata item link in the left-hand toolbar on the article page; programmatically, you can access Wikidata information using the wbgetentities API module (sandbox, explanation of concepts), e.g. wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein. There is also a SPARQL endpoint, database dumps, and clients in PHP, Java and Python.

DBPedia is a project to harvest Wikipedia infobox information by automated means and publish it in a structured form. You can find the DBPedia page for a Wikipedia article by going to http://dbpedia.org/page/<Wikipedia article name>, e.g. http://dbpedia.org/page/Albert_Einstein. It has many data formats, dumps, a SPARQL endpoint and various other things.

The wrong ways done right

If the information you need is not available via Wikidata or DBpedia, there are still semi-structured ways of extracting data from infoboxes. For HTML-based extraction you can use Wikipedia's REST content API (e.g. https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein) which returns a richer, more semantic HTML than the one used on normal article pages, and preserves in it some information about template structure.

Alternatively, you might start from wikitext and parse it into a syntax tree using the simpler, client-side mwparserfromhell Python module (docs) or the more powerful parsoid-jsapi which interacts with the Wikipedia REST content service.

A higher-level Python library which tries to extract infobox contents from wikitext is wptools.

How to extract information from a Wikipedia infobox?

The wrong way: trying to parse HTML

The other wrong way: trying to parse wikitext

The ideal way: using a structured data source

The wrong ways done right

Tags:

Wikipedia Api

Wikipedia

Dbpedia

Wikidata

Structured Data

Related

Recent Posts