Is it possible to extract table infomation using Apache Tika?
Tika doesn't parse table information. In fact confusing part is that it converts tables tags as <p>
which actually means we lose the structure. This is the case till current version 1.14. In future that may be remedied but no plans till now to work on that direction.
You can refer to JIRA which discusses this shortcoming in Tika. After the JIRA, wiki was also updated to reflect this inadequacy.[Disclaimer: I raised the JIRA]
Now the solution part: In my experience, Aspose.Pdf for Java does a brilliant job for converting pdf into html. But its licensed. You can check the quality via free trial version. Code and example links.
I use a combination of tika (tika-app-1.19.jar) & aspose (aspose-pdf-18.9.1.jar)...
I first modify the pdf using Aspose, to have pipes ('|') at the end of the table-columns... ... and then read it into Tika and convert it to text...
InputStream is = part.getInputStream(); // input-stream of PDF or PDF part
// Aspose add pipes ("|")
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document pdfDocument = new Document(is); // load existing PDF file
PageCollection pageCollection = pdfDocument.getPages();
int iNumPages = pageCollection.size();
for(int i = 1; i <= iNumPages; i++)
{
Page page = pageCollection.get_Item(i);
TableAbsorber absorber = new TableAbsorber();// Create TableAbsorber object to find tables
absorber.visit(page);// Visit first page with absorber
IGenericList<AbsorbedTable> listTables = absorber.getTableList();
for(AbsorbedTable absorbedTable : listTables)
{
IGenericList<AbsorbedRow> listRows = absorbedTable.getRowList();
for(AbsorbedRow absorbedRow : listRows)
{
IGenericList<AbsorbedCell> listCells = absorbedRow.getCellList();
for(AbsorbedCell absorbedCell : listCells)
{
TextFragmentCollection collectionTextFrag = absorbedCell.getTextFragments();
Rectangle rectangle = absorbedCell.getRectangle();
// Add pipes ("|") to indicate table ends
TextBuilder textBuilder = new TextBuilder(page);
TextFragment textFragment = new TextFragment("|");
double x = rectangle.getURX();
double y = rectangle.getURY();
textFragment.setPosition(new Position(x, y));
textBuilder.appendText(textFragment);
}
}
}
}
pdfDocument.save(outputStream);
is = new ByteArrayInputStream(outputStream.toByteArray()); // input-steam of modified PDF with pipes included ("|")
now the above pdf input stream with pipes ("|") at table cell ends can be pulled into Tika and changed to text...
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
//InputStream stream = new ByteArrayInputStream(sIS.getBytes(StandardCharsets.UTF_8));
pdfParser.parse(is, handler, metadata, context);
String sPdfData = handler.toString();
Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1
So basically we can write a custom SAX implementation to parse the file.
The structure text output will be of the form (Meta details avoided)
<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>
In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).
Override public void characters(char[] ch, int start, int length) with the logic
Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution