Improving PyPDF2 with PDFtk

PyPDF2 (forked from pyPdf) is wonderful. I use it a fair bit in my job, mainly for chopping up PDFs and re-assembling the pages in a different order. It does sometimes have difficulty with non-standard PDFs though that seem fine in other programs. This can be frustrating.

The one that I’ve been battling with today from some PDFs provided by a client was:

PyPDF2.utils.PdfReadError: EOF marker not found

I managed to find a workaround using PDFtk to fix the PDF in memory at the first sign of any trouble. It works well so far, so in case anyone else is having similar issues I thought I’d write it up.

So here’s how I was opening PDF files before.

from PyPDF2 import PdfFileReader
from cStringIO import StringIO

input_path = 'c:/test_in.pdf'

with open(input_path, 'rb') as input_file:
    input_buffer = StringIO(input_file.read())

input_pdf = PdfFileReader(input_buffer)

At that point you’re free to do whatever it is you want to do with input_pdf. Providing of course that it loaded without issue. I’m loading the file into a StringIO object first for speed; the program this is from does lots of things with the file and StringIO made things much faster.

So to work around the EOF problem I add a new decompress_pdf function that gets called if there’s a problem parsing the PDF. It takes the data from the StringIO and sends it to a PDFtk process on stdin that simply runs PDFtk’s uncompress command on the data. The fixed PDF is read back from stdout and returned as a StringIO, where things will hopefully carry on as if nothing happened.

from PyPDF2 import PdfFileReader, utils
from cStringIO import StringIO
import subprocess

input_path = 'c:/test_in.pdf'

def decompress_pdf(temp_buffer):
    temp_buffer.seek(0)  # Make sure we're at the start of the file.

    process = subprocess.Popen(['pdftk.exe',
                                '-',  # Read from stdin.
                                'output',
                                '-',  # Write to stdout.
                                'uncompress'],
                                stdin=temp_buffer,
                                stdout=subprocess.PIPE,
                                stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()

    return StringIO(stdout)

with open(input_path, 'rb') as input_file:
    input_buffer = StringIO(input_file.read())

try:
    input_pdf = PdfFileReader(input_buffer)
except utils.PdfReadError:
    input_pdf = PdfFileReader(decompress_pdf(input_file))

The problem I was seeing seemed to be because of invalid characters appearing after the %%EOF marker in the PDF. PDFtk seems better at fixing this and spits out a valid PDF when the uncompress command is used.

Of course, more error detection would be good in case parsing still fails, but this worked for me today and made me happy.

Comments