How can I extract images and their metadata from PDFs?
Images do not contain metadata and are stored as raw data which needs to be assemebled into images. I wrote 2 blog posts explaining how image data is stored in a PDF file at https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-how-are-images-stored/ and https://blog.idrsolutions.com/2010/09/understanding-the-pdf-file-format-images/
I don't agree to the others and have a POC for your question: You can extract the XMP Metadata of images using pdfbox in the following way:
public void getXMPInformation() { // Open PDF document PDDocument document = null; try { document = PDDocument.load(PATH_TO_YOUR_DOCUMENT); } catch (IOException e) { e.printStackTrace(); } // Get all pages and loop through them List pages = document.getDocumentCatalog().getAllPages(); Iterator iter = pages.iterator(); while( iter.hasNext() ) { PDPage page = (PDPage)iter.next(); PDResources resources = page.getResources(); Map images = null; // Get all Images on page try { images = resources.getImages(); } catch (IOException e) { e.printStackTrace(); } if( images != null ) { // Check all images for metadata Iterator imageIter = images.keySet().iterator(); while( imageIter.hasNext() ) { String key = (String)imageIter.next(); PDXObjectImage image = (PDXObjectImage)images.get( key ); PDMetadata metadata = image.getMetadata(); System.out.println("Found a image: Analyzing for Metadata"); if (metadata == null) { System.out.println("No Metadata found for this image."); } else { InputStream xmlInputStream = null; try { xmlInputStream = metadata.createInputStream(); } catch (IOException e) { e.printStackTrace(); } try { System.out.println("--------------------------------------------------------------------------------"); String mystring = convertStreamToString(xmlInputStream); System.out.println(mystring); } catch (IOException e) { e.printStackTrace(); } } // Export the images String name = getUniqueFileName( key, image.getSuffix() ); System.out.println( "Writing image:" + name ); try { image.write2file( name ); } catch (IOException e) { // TODO Auto-generated catch block //e.printStackTrace(); } System.out.println("--------------------------------------------------------------------------------"); } } } }
And the "Helper methods":
public String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the BufferedReader.readLine()
* method. We iterate until the BufferedReader return null which means
* there's no more data to read. Each line will appended to a StringBuilder
* and returned as String.
*/
if (is != null) {
StringBuilder sb = new StringBuilder();
String line;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = reader.readLine()) != null) {
sb.append(line).append("\n");
}
} finally {
is.close();
}
return sb.toString();
} else {
return "";
}
}
private String getUniqueFileName( String prefix, String suffix ) {
/*
* imagecounter is a global variable that counts from 0 to the number of
* extracted images
*/
String uniqueName = null;
File f = null;
while( f == null || f.exists() ) {
uniqueName = prefix + "-" + imageCounter;
f = new File( uniqueName + "." + suffix );
}
imageCounter++;
return uniqueName;
}
Note: This is a quick and dirty proof of concept and not a well-styled code.
The Images must have XMP-Metadata when placed in InDesign before building the PDF document. The XMP-Metdadata can be set by using Photoshop for example. Please be aware, that p.e. not all IPTC/Exif/... Information is converted into the XMP-Metadata. Only a small number of fields are converted.
I'm using this method on JPG and PNG images, placed in PDFs build with InDesign. It works well and I can get all image-informations after the production-steps from the ready PDFs (picture coating).