Process PDF’s fast with PyPDF2 and Pdfminer3k

Published on March 11th, 2018

Getting text from PDF files can be an absolute pain. Fortunately the quality of scans are getting better which makes parsing them a little bit easier. In this post we will show you two Python packages for working with PDF files. Neither of them is perfect, but you can get decent results with them. In our experience, PyPDF2 is faster and gives better output than pdfminer3k. However, pdfminer3k seems to be better at reading some PDF files where PyPDF2 doesn’t recognize any text at all. The results are highly dependent on the PDF files you are trying to parse, so you might want to try both packages. They can both be installed with ‘pip’:

pip install pdfminer3k
pip install PyPDF2

In both the examples, we will try to read the file ‘sample.pdf’ and print the text to the Python console. Make sure you have placed the ‘sample.pdf’ in your working directory, or otherwise specify the full path to the file.

PyPDF2

import PyPDF2

# Specify the PDF file
file = 'sample.pdf'

# Open the file
pdf_file = open(file', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)

# Read and print the number of pages
number_of_pages = read_pdf.getNumPages()

print(number_of_pages)

# Read and print the first page of the PDF file
page = read_pdf.getPage(0)
page_content = page.extractText()

print(page_content)

More information: Documentation, Github

Pdfminer3k

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

with open('sample.pdf', 'rb') as fp:
    parser = PDFParser(fp)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize('')
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    text = ''
    # Process each page contained in the document.
    for page in doc.get_pages():
        interpreter.process_page(page)
        layout = device.get_result()
        for lt_obj in layout:
            if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
                text_on_page = lt_obj.get_text()
                text += text_on_page

    print(text)

More information: Github

Further reading: How to work with Python and Excel files

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.