Monday, May 25, 2020

Creating and Modifying PDF Files in Python

The PDF, or Portable Document Format, is one of the most common formats for sharing documents over the Internet. PDFs can contain text, images, tables, forms, and rich media like videos and animations, all in a single file.

This abundance of content types can make working with PDFs difficult. There are a lot of different kinds of data to decode when opening a PDF file! Fortunately, the Python ecosystem has some great packages for reading, manipulating, and creating PDF files.

In this tutorial, you’ll learn how to:

  • Read text from a PDF
  • Split a PDF into multiple files
  • Concatenate and merge PDF files
  • Rotate and crop pages in a PDF file
  • Encrypt and decrypt PDF files with passwords
  • Create a PDF file from scratch

Along the way, you’ll have several opportunities to deepen your understanding by following along with the examples. You can download the materials used in the examples by clicking on the link below:

Extracting Text From a PDF

In this section, you’ll learn how to read a PDF file and extract the text using the PyPDF2 package. Before you can do that, though, you need to install it with pip:

$ python3 -m pip install PyPDF2

Verify the installation by running the following command in your terminal:

$ python3 -m pip show PyPDF2
Name: PyPDF2
Version: 1.26.0
Summary: PDF toolkit
Home-page: http://mstamy2.github.com/PyPDF2
Author: Mathieu Fenniak
Author-email: biziqe@mathieu.fenniak.net
License: UNKNOWN
Location: c:\\users\\david\\python38-32\\lib\\site-packages
Requires:
Required-by:

Pay particular attention to the version information. At the time of writing, the latest version of PyPDF2 was 1.26.0. If you have IDLE open, then you’ll need to restart it before you can use the PyPDF2 package.

Opening a PDF File

Let’s get started by opening a PDF and reading some information about it. You’ll use the Pride_and_Prejudice.pdf file located in the practice_files/ folder in the companion repository.

Open IDLE’s interactive window and import the PdfFileReader class from the PyPDF2 package:

>>>
>>> from PyPDF2 import PdfFileReader

To create a new instance of the PdfFileReader class, you’ll need the path to the PDF file that you want to open. Let’s get that now using the pathlib module:

>>>
>>> from pathlib import Path
>>> pdf_path = (
...     Path.home()
...     / "creating-and-modifying-pdfs"
...     / "practice_files"
...     / "Pride_and_Prejudice.pdf"
... )

The pdf_path variable now contains the path to a PDF version of Jane Austen’s Pride and Prejudice.

Now create the PdfFileReader instance:

>>>
>>> pdf = PdfFileReader(str(pdf_path))

You convert pdf_path to a string because PdfFileReader doesn’t know how to read from a pathlib.Path object.

Recall from chapter 12, “File Input and Output,” that all open files should be closed before a program terminates. The PdfFileReader object does all of this for you, so you don’t need to worry about opening or closing the PDF file!

Read the full article at https://realpython.com/creating-modifying-pdf/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]



from Real Python
read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...