Wednesday, April 24, 2019

Stack Abuse: Working with PDFs in Python: Reading and Splitting

The PDF Document Format

Today, the Portable Document Format (PDF) belongs to the most commonly used data formats. In 1990, the structure of a PDF document was defined by Adobe. The idea behind the PDF format is that transmitted data/documents look exactly the same for both parties that are involved in the communication process - the creator, author or sender, and the receiver. PDF is the successor of the PostScript format, and standardized as ISO 32000-2:2017.

Processing PDF Documents

For Linux there are mighty command line tools available such as pdftk and pdfgrep. As a developer there is a huge excitement building your own software that is based on Python and uses PDF libraries that are freely available.

This article is the beginning of a little series, and will cover these helpful Python libraries. In Part One we will focus on the manipulation of existing PDFs. You will learn how to read and extract the content (both text and images), rotate single pages, and split documents into its individual pages. Part Two will cover adding a watermark based on overlays. Part Three will exclusively focus on writing/creating PDFs, and will also include both deleting and re-combining single pages into a new document.

Tools and Libraries

The range of available solutions for Python-related PDF tools, modules, and libraries is a bit confusing, and it takes a moment to figure out what is what, and which projects are maintained continuously. Based on our research these are the candidates that are up-to-date:

  • PyPDF2: A Python library to extract document information and content, split documents page-by-page, merge documents, crop pages, and add watermarks. PyPDF2 supports both unencrypted and encrypted documents.

  • PDFMiner: Is written entirely in Python, and works well for Python 2.4. For Python 3, use the cloned package PDFMiner.six. Both packages allow you to parse, analyze, and convert PDF documents. This includes the support for PDF 1.7 as well as CJK languages (Chinese, Japanese, and Korean), and various font types (Type1, TrueType, Type3, and CID).

  • PDFQuery: It describes itself as "a fast and friendly PDF scraping library" which is implemented as a wrapper around PDFMiner, lxml, and pyquery. Its design aim is "to reliably extract data from sets of PDFs with as little code as possible."

  • tabula-py: It is a simple Python wrapper of tabula-java, which can read tables from PDFs and convert them into Pandas DataFrames. It also enables you to convert a PDF file into a CSV/TSV/JSON file.

  • pdflib for Python: An extension of the Poppler Library that offers Python bindings for it. It allows you to parse, analyze, and convert PDF documents. Not to be confused with its commercial pendant that has the same name.

  • PyFPDF: A library for PDF document generation under Python. Ported from the FPDF PHP library, a well-known PDFlib-extension replacement with many examples, scripts, and derivatives.

  • PDFTables: A commercial service that offers extraction from tables that comes as a PDF document. Offers an API so that PDFTables can be used as SAAS.

  • PyX - the Python graphics package: PyX is a Python package for the creation of PostScript, PDF, and SVG files. It combines an abstraction of the PostScript drawing model with a TeX/LaTeX interface. Complex tasks like creating 2D and 3D plots in publication-ready quality are built out of these primitives.

  • ReportLab: An ambitious, industrial-strength library largely focused on precise creation of PDF documents. Available freely as an Open Source version as well as a commercial, enhanced version named ReportLab PLUS.

  • PyMuPDF (aka "fitz"): Python bindings for MuPDF, which is a lightweight PDF and XPS viewer. The library can access files in PDF, XPS, OpenXPS, epub, comic and fiction book formats, and it is known for its top performance and high rendering quality.

  • pdfrw: A pure Python-based PDF parser to read and write PDF. It faithfully reproduces vector formats without rasterization. In conjunction with ReportLab, it helps to re-use portions of existing PDFs in new PDFs created with ReportLab.

Library Used for
PyPDF2 Reading
PyMuPDF Reading
pdflib Reading
PDFTables Reading
tabula-py Reading
PDFMiner.six Reading
PDFQuery Reading
pdfrw Reading, Writing/Creating
Reportlab Writing/Creating
PyX Writing/Creating
PyFPDF Writing/Creating

Below we will focus on PyPDF2 and PyMuPDF, and explain how to extract text and images in the easiest way possible. In order to understand the usage of PyPDF2 a combination of the official documentation and a lot of examples that are available from other resources helped. In contrast, the official PyMuPDF documentation is much clearer, and considerably faster using the library.

Extracting Text with PyPDF2

PyPDF2 can be installed as a regular software package, or using pip3 (for Python3). The tests here are based on the package for the upcoming Debian GNU/Linux release 10 "Buster". The name of the Debian package is python3-pypdf2.

Listing 1 imports the PdfFileReader class, first. Next, using this class, it opens the document, and extracts the document information using the getDocumentInfo() method, the number of pages using getDocumentInfo(), and the content of the first page.

Please note that PyPDF2 starts counting the pages with 0, and that's why the call pdf.getPage(0) retrieves the first page of the document. Eventually, the extracted information is printed to stdout.

Listing 1: Extracting the document information and content.

#!/usr/bin/python

from PyPDF2 import PdfFileReader

pdf_document = "example.pdf"  
with open(pdf_document, "rb") as filehandle:  
    pdf = PdfFileReader(filehandle)
    info = pdf.getDocumentInfo()
    pages = pdf.getNumPages()

    print (info)
    print ("number of pages: %i" % pages)

    page1 = pdf.getPage(0)
    print(page1)
    print(page1.extractText())

Fig. 1: Extracted text from a PDF file using PyPDF2
Fig. 1: Extracted text from a PDF file using PyPDF2

As shown in Figure 1 above, the extracted text is printed on a continuing basis. There are no paragraphs, or sentence separations. As stated in the PyPDF2 documentation, all text data is returned in the order they are provided in the content stream of the page, and relying on it may lead to some surprises. This mainly depends on the internal structure of the PDF document, and how the stream of PDF instructions was produced by the PDF writer process.

Extracting Text with PyMuPDF

PyMuPDF is available from the PyPi website, and you install the package with the following command in a terminal:

$ pip3 install PyMuPDF

Displaying document information, printing the number of pages, and extracting the text of a PDF document is done in a similar way as with PyPDF2 (see Listing 2). The module to be imported is named fitz, and goes back to the previous name of PyMuPDF.

Listing 2: Extracting content from a PDF document using PyMuPDF.

#!/usr/bin/python

import fitz

pdf_document = "example.pdf"  
doc = fitz.open(pdf_document):  
print ("number of pages: %i" % doc.pageCount)  
print(doc.metadata)

page1 = doc.loadPage(0)  
page1text = page1.getText("text")  
print(page1text)  

The nice thing about PyMuPDF is that it keeps the original document structure intact - entire paragraphs with linebreaks are kept as they are in the PDF document (see Figure 2).

Fig. 2: Extracted text data
Fig. 2: Extracted text data

Extracting Images from PDFs with PyMuPDF

PyMuPDF simplifies extracting images from PDF documents using the method getPageImageList(). Listing 3 is based on an example from the PyMuPDF wiki page, and extracts and saves all the images from the PDF as PNG files on a page-by-page basis. If an image has a CMYK colorspace, it will be converted to RGB, first.

Listing 3: Extracting images.

#!/usr/bin/python

import fitz

pdf_document = fitz.open("file.pdf")  
for current_page in range(len(pdf_document)):  
    for image in pdf_document.getPageImageList(current_page):
        xref = image[0]
        pix = fitz.Pixmap(pdf_document, xref)
        if pix.n < 5:        # this is GRAY or RGB
            pix.writePNG("page%s-%s.png" % (current_page, xref))
        else:                # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("page%s-%s.png" % (current_page, xref))
            pix1 = None
        pix = None

Running this Python script on a 400 page PDF, it extracted 117 images in less than 3 seconds, which is amazing. The individual images are stored in PNG format. In order to keep the original image format and size, instead of converting to PNG, have a look at extended versions of the scripts in the PyMuPDF wiki.

Fig. 3: Extracted images
Fig. 3: Extracted images

Splitting PDFs into Pages with PyPDF2

For this example, both the PdfFileReader and the PdfFileWriter classes first need to be imported. Then we open the PDF file, create a reader object, and loop over all the pages using the reader object's getNumPages method.

Inside of the for loop, we create a new instance of PdfFileWriter, which does not contain any pages, yet. We then add the current page to our writer object using the pdfWriter.addPage() method. This method accepts a page object, which we get using the PdfFileReader.getPage() method.

The next step is to create a unique filename, which we do by using the original file name plus the word "page", plus the page number. We add 1 to the current page number because PyPDF2 counts the page numbers starting at zero.

Finally, we open the new file name in "write binary" mode (mode wb), and use the write() method of the pdfWriter class to save the extracted page to disk.

Listing 4: Splitting a PDF into single pages.

#!/usr/bin/python

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_document = "example.pdf"  
pdf = PdfFileReader(pdf_document)

for page in range(pdf.getNumPages()):  
    pdf_writer = PdfFileWriter
    current_page = pdf.getPage(page)
    pdf_writer.addPage(current_page)

    outputFilename = "example-page-{}.pdf".format(page + 1)
    with open(outputFilename, "wb") as out:
        pdf_writer.write(out)

        print("created", outputFilename)

Fig. 4: Splitting a PDF
Fig. 4: Splitting a PDF

Find All Pages Containing Text

This use case is quite a practical one, and works similar to pdfgrep. Using PyMuPDF the script returns all the page numbers that contain the given search string. The pages are loaded one after the next, and with the help of the searchFor() method all the occurences of the search string are detected. In case of a match an according message is printed on stdout.

Listing 5: Search for a given text.

#!/usr/bin/python

import fitz

filename = "example.pdf"  
search_term = "invoice"  
pdf_document = fitz.open(filename):

for current_page in range(len(pdf_document)):  
    page = pdf_document.loadPage(current_page)
    if page.searchFor(search_term):
        print("%s found on page %i" % (search_term, current_page))

Figure 5 below shows the search result for the term "Debian GNU/Linux" in a 400-page book.

Fig. 5: Searching a PDF document
Fig. 5: Searching a PDF document

Conclusion

The methods shown here are quite powerful. With a comparably small number of lines of code a result is easily obtained. More use-cases are examined in Part Two (coming soon!) that covers adding a watermark to a PDF.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...