Getting text from PDF files can be an absolute pain. Fortunately the quality of scans are getting better which makes parsing them a little bit easier. In this post we will show you two Python packages for working with PDF files. Neither of them is perfect, but you can get decent results with them. In our experience, PyPDF2 is faster and gives better output than pdfminer3k. However, pdfminer3k seems to be better at reading some PDF files where PyPDF2 doesn’t recognize any text at all. The results are highly dependent on the PDF files you are trying to parse, so you might want to try both packages. They can both be installed with ‘pip’:
pip install pdfminer3k pip install PyPDF2In both the examples, we will try to read the file ‘sample.pdf’ and print the text to the Python console. Make sure you have placed the ‘sample.pdf’ in your working directory, or otherwise specify the full path to the file.
PyPDF2
import PyPDF2 # Specify the PDF file file = 'sample.pdf' # Open the file pdf_file = open(file', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) # Read and print the number of pages number_of_pages = read_pdf.getNumPages() print(number_of_pages) # Read and print the first page of the PDF file page = read_pdf.getPage(0) page_content = page.extractText() print(page_content)More information: Documentation, Github
Pdfminer3k
from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTTextBox, LTTextLine with open('sample.pdf', 'rb') as fp: parser = PDFParser(fp) doc = PDFDocument() parser.set_document(doc) doc.set_parser(parser) doc.initialize('') rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) text = '' # Process each page contained in the document. for page in doc.get_pages(): interpreter.process_page(page) layout = device.get_result() for lt_obj in layout: if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine): text_on_page = lt_obj.get_text() text += text_on_page print(text)More information: Github
Further reading: How to work with Python and Excel files