Friday, May 13, 2011

Extracting images from a pdf file

This week I went to the local photo shop in our town to scan some documents. I gave the salesperson my documents and he told me to come back in a couple of hours. I returned and received my scanned douments. At home I figured out that he gave me a CD with a couple of pdf files (my fault, I forgot to tell him that I need images). Here is a trick that will extract images from a pdf document:

$ ls
$ pdfimages MinoltaSc11051113080_1_1.pdf -j MinoltaSc11051113080_1_1
$ ls
MinoltaSc11051113080_1_1-000.jpg  MinoltaSc11051113080_1_1.pdf
$ file *
MinoltaSc11051113080_1_1-000.jpg:   JPEG image data, JFIF standard 1.01
MinoltaSc11051113080_1_1.pdf:       PDF document, version 1.3
$ xv MinoltaSc11051113080_1_1-000.jpg

That's it. In this case pdfimages is the tool of your choice. The -j option tells pdfimages to create jpeg images instead of ppm images. The last option is the prefix for the image and can be set to any name (that may sound logical to you). It has a few more options (most are for displaying the usage information) like setting a password, eg:

$ pdfimages test.pdf -j test
Error: Incorrect password
$ pdfimage test.pdf -opw <password> -j test

When you have a pdf file that contains more than one image, pdfimages will number your images automatically:

$ pdfimages test.pdf -j test
$ ls test*