Extracting information from PDFs of research papers

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.

https://code.google.com/p/pdfssa4met/


I'm only allowed one link per posting so this is it: pdfinfo Linux manual page

This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.

To pull out bibliographic references, look at cb2bib:

cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.

You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.