PDF to PostScript
Tired of the ghostscript errors? Try poppler (and poppler-utils).
Here’s my problem. I have these PDFs and they are huge! 500MB. The pdf contents is exactly one page, and it’s a scanned image. I want to convert each pdf to a png. When I use ImageMagick’s `convert`, I get this error (followed by a bunch of rubbish):
$ convert 01.pdf 01.ps **** Warning: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data.
– Enter poppler -
Install poppler-utils (for pdftops)
$ sudo yum install poppler-utils
What do I get when I install poppler-utils? This package contains pdftops (PDF to PostScript converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), pdftohtml (PDF to HTML converter), pdftotext (PDF to text converter), and pdffonts (PDF font analyzer).
packages.debian.org
Let’s look at basic metadata
$ pdfinfo 01.pdf Producer: Adobe Photoshop for Macintosh CreationDate: Fri Mar 12 17:03:08 2010 ModDate: Fri Mar 12 23:12:35 2010 Tagged: no Pages: 1 Encrypted: no Page size: 4114.32 x 3270.72 pts File size: 450186770 bytes Optimized: yes PDF version: 1.7
Now at document-level metadata, including true pixel resolution.
Redundant? Yes, I noticed this too — but look, there’s more!
$ pdfinfo -meta 01.pdf | more Producer: Adobe Photoshop for Macintosh CreationDate: Fri Mar 12 17:03:08 2010 ModDate: Fri Mar 12 23:12:35 2010 Tagged: no Pages: 1 Encrypted: no Page size: 4114.32 x 3270.72 pts File size: 450186770 bytes Optimized: yes PDF version: 1.7 Metadata: <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996 , 2008/05/07-21:37:19 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:tiff="http://ns.adobe.com/tiff/1.0/"> <tiff:ImageWidth>17143</tiff:ImageWidth> <tiff:ImageLength>13628</tiff:ImageLength> <tiff:Compression>5</tiff:Compression> <tiff:PhotometricInterpretation>2</tiff:PhotometricInterpretation> <tiff:Orientation>1</tiff:Orientation> <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel> <tiff:PlanarConfiguration>1</tiff:PlanarConfiguration> <tiff:XResolution>3000000/10000</tiff:XResolution> <tiff:YResolution>3000000/10000</tiff:YResolution> <tiff:ResolutionUnit>2</tiff:ResolutionUnit> <tiff:NativeDigest>256,257,258,259,262,274,277,284,530,531,282,283,296, 301,318,319,529,532,306,270,271,272,305,315,33432;B4FF87F3DB26D93C7CFD5C3C5C66BE 9E</tiff:NativeDigest> <tiff:BitsPerSample> <rdf:Seq> <rdf:li>8</rdf:li> <rdf:li>8</rdf:li> <rdf:li>8</rdf:li> </rdf:Seq> </tiff:BitsPerSample> </rdf:Description> ... edited for brevity ...
Look! We found the image dimensions: 17143px by 13628px.
Ok, so why is it that when I run `pdftops` to convert from pdf to ps, followed by ImageMagick’s `convert` from ps to png, I get an image that’s 4114×3270 pixels? I want the full resolution!
Warning, this snippet is incorrect
$ pdftops 01.pdf 01.ps $ convert 01.ps 01.png $ identify 01.png 01.png PNG 4115x3271 4115x3271+0+0 DirectClass 24mb
I must be missing something. … so it is. I overlooked `pdfimages`.
Proper conversion found here
$ pdfimages -j 01.pdf img $ ls 01.pdf img-000.ppm $ identify img-000.ppm img-000.ppm PNM 17143x13628 17143x13628+0+0 DirectClass 6.7e+02mb 4.110u 0:05
Lookie there: full resolution… in a .PPM file? Who cares if ImageMagick can handle it!
From PPM to PNG (like I wanted)
$ convert img-000.ppm img-000.png $ identify img-000.png img-000.png PNG 17143x13628 17143x13628+0+0 DirectClass 3.5e+02mb 12.540u 0:14
And there you have it – Stupid PDF converted to awesome full resolution PNG.
Thank you, come again.
Reference: http://stefaanlippens.net/pdf2ps_vs_pdftops
