How to decode a PDF stream?

  1. "Two xref tables and two %%EOF"?

    This alone is not an indication of a malicious PDF file. There can by two or even more instances of each, if the file was generated via the "incremental update" feature. (Each digitally signed PDF file is like that, and each file which was changed in Acrobat and saved by using the 'Save' button/menu instead of the 'Save as...' button/menu is like that too.)

  2. "How to decode a compressed PDF stream from a specific object"?

    Have a look at Didier Stevens' Python script pdf-parser.py. With this command line tool, you can dump the decoded stream of any PDF object into a file. Example command to dump the stream of PDF object number 13:

    pdf-parser.py -o 13 -f -d obj13.dump my.pdf
    

You can use RUPS to analyze the PDF and export or just look at the stream already decoded. About the %%EOF you can have as many as the number of appends made to the PDF.


A %%EOF comment should be present at the end of the file, any other comments (any line beginning %) may be present at any point in the file. So yes, 2 %%EOF comments is perfectly valid. This is documented in the PDF Reference. Check example 3.11 in the 1.7 PDF Reference Manual on page 112 for a documented example in the specification which has the structure you describe. This is a PDF file which has been incrementally updated.

Note that more recent versions of PDF can have cross reference streams, which are themselves compressed.

The easiest way to decode a PDF file is to use a tool intended to do it, for example MuPDF can do this with "mutool clean -d <input pdf file> <output PDF file>" will decompress (-d) all the compressed streams in a PDF file and write the output to a new PDF file.

Otherwise you will need to use something like zlib for Flate and LZW decompression, you will need to write your own RunLength decompression as well as ASCIIHex85 I think. Not to mention JBIG, JPEG and JPEG2000 if you want the images decoded too.