How to know if a PDF file is compressed or not and to (un)compress it
in short:
To know if it's compressed already:
strings your.pdf | grep /Filter
To (un)compress a PDF, use QPDF
qpdf --stream-data=compress your.pdf compressed.pdf
qpdf --stream-data=uncompress compressed.pdf uncompressed.pdf
explanation:
The "Filter" keyword inside a pdf file is a indicator of the compression method used. Some of them are:
CCITT G3/G4 – used for monochrome images
JPEG – a lossy algorithm that is used for images
JPEG2000 – a more modern alternative to JPEG, which is also used for compressing images
Flate – used for compressing text as well as images
JBIG2 – an alternative to CCITT compression for monochrome images
LZW – used for compressing text as well as images but getting replaced by Flate
RLE – used for monochrome images
ZIP – used for grayscale or color images
(copied from here).
However, given the PDF complex file structure, most of the time some part (or "stream") of the PDF will be compressed already in some way (and will show up when grepping /Filter) while some other part will not be, so there is no YES / NO answer to the question whether the PDF is compressed.
one way to overcome this would be to add the -c
option to grep, which returns the number of occurrences, so you could see relatively how well it is compressed. for example, if strings
"large
.pdf" | grep -c /Filter
returns less then 10 it's pretty non-compressed.
Another property relating to size in PDFs, is whether they have been optimized for quick access, with "optimized" PDFs being bigger in size, to quote from wikipedia:
There are two layouts to the PDF files—non-linear (not "optimized") and linear ("optimized"). Non-linear PDF files consume less disk space than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linear PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner that enables them to be read in a Web browser plugin without waiting for the entire file to download, since they are written to disk in a linear (as in page order) fashion. PDF files may be optimized using Adobe Acrobat software or QPDF.
You can check whether the PDF is optimized using pdfinfo your.pdf
.
pdftk is a tool to perform some operations on PDF files, like compression/decompression:
$ pdftk test.pdf output compressed_test.pdf compress