cat brain.log | less

Getting it down on `paper`

PDF to PostScript

Tired of the ghostscript errors? Try poppler (and poppler-utils).

Here’s my problem. I have these PDFs and they are huge! 500MB. The pdf contents is exactly one page, and it’s a scanned image. I want to convert each pdf to a png. When I use ImageMagick’s `convert`, I get this error (followed by a bunch of rubbish):

$ convert 01.pdf
   **** Warning:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.

– Enter poppler –

Install poppler-utils (for pdftops)
$ sudo yum install poppler-utils

What do I get when I install poppler-utils? This package contains pdftops (PDF to PostScript converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), pdftohtml (PDF to HTML converter), pdftotext (PDF to text converter), and pdffonts (PDF font analyzer).

Let’s look at basic metadata

$ pdfinfo 01.pdf
Producer:       Adobe Photoshop for Macintosh
CreationDate:   Fri Mar 12 17:03:08 2010
ModDate:        Fri Mar 12 23:12:35 2010
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      4114.32 x 3270.72 pts
File size:      450186770 bytes
Optimized:      yes
PDF version:    1.7

Now at document-level metadata, including true pixel resolution.
Redundant? Yes, I noticed this too — but look, there’s more!

$ pdfinfo -meta 01.pdf | more
Producer:       Adobe Photoshop for Macintosh
CreationDate:   Fri Mar 12 17:03:08 2010
ModDate:        Fri Mar 12 23:12:35 2010
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      4114.32 x 3270.72 pts
File size:      450186770 bytes
Optimized:      yes
PDF version:    1.7
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996
, 2008/05/07-21:37:19        ">
 <rdf:RDF xmlns:rdf="">
 <rdf:Description rdf:about=""
 ... edited for brevity ...

Look! We found the image dimensions: 17143px by 13628px.

Ok, so why is it that when I run `pdftops` to convert from pdf to ps, followed by ImageMagick’s `convert` from ps to png, I get an image that’s 4114×3270 pixels? I want the full resolution!
Warning, this snippet is incorrect

$ pdftops 01.pdf
$ convert 01.png
$ identify 01.png
01.png PNG 4115x3271 4115x3271+0+0 DirectClass 24mb

I must be missing something. … so it is. I overlooked `pdfimages`.

Proper conversion found here

$ pdfimages -j 01.pdf img
$ ls
01.pdf img-000.ppm
$ identify img-000.ppm
img-000.ppm PNM 17143x13628 17143x13628+0+0 DirectClass 6.7e+02mb 4.110u 0:05

Lookie there: full resolution… in a .PPM file? Who cares if ImageMagick can handle it!

From PPM to PNG (like I wanted)

$ convert img-000.ppm img-000.png
$ identify img-000.png
img-000.png PNG 17143x13628 17143x13628+0+0 DirectClass 3.5e+02mb 12.540u 0:14

And there you have it – Stupid PDF converted to awesome full resolution PNG.

Thank you, come again.




No comments so far.

(comments are closed)