Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET
You can also check out ifilters - there are a number of resources if you do a search for asp.net ifilters:
- http://www.codeproject.com/KB/cs/IFilter.aspx
- http://en.wikipedia.org/wiki/IFilters
- http://www.ifilter.org/
- https://stackoverflow.com/questions/1535992/ifilter-or-sdk-for-many-file-types
Of course, there is added hassle if you are distributing this to client systems, because you will either need to include the ifilters with your distribution and install those with your app on their machine, or they will lack the ability to extract text from any files they don't have ifilters for.
This is one of the reasons I was dissatisfied with Lucene for a project I was working on. Xapian is a competing product, and is orders of magnitude faster than Lucene in some cases and has other compelling features (well, they were compelling to me at the time). The big issue? It's written in C++ and you have to interop to it. That's for indexing and retrieval. For the actual parsing of the text, that's where Lucene really falls down -- you have to do it yourself. Xapian has an omega component that manages calling other third party components to extract data. In my limited testing it worked pretty darn well. I did not finish the project (more than POC) but I did write up my experience compiling it for 64 bit. Of course this was almost a year ago, so things might have changed.
If you dig into the Omega documentation you can see the tools that they use to parse documents.
PDF (.pdf) if pdftotext is available (comes with xpdf)
PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available
OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available
OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available
MS Word documents (.doc, .dot) if antiword is available
MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
MS Powerpoint documents (.ppt, .pps) if catppt is available, (comes with catdoc)
MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available
Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
Compressed AbiWord documents (.zabw) if gzip is available
Rich Text Format documents (.rtf) if unrtf is available
Perl POD documentation (.pl, .pm, .pod) if pod2text is available
TeX DVI files (.dvi) if catdvi is available
DjVu files (.djv, .djvu) if djvutxt is available
XPS files (.xps) if unzip is available