Daily Python: Stack Abuse: Using borb to Create E-books From Project Gutenberg

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

In this guide, we'll take a look at how to convert a UTF-8 book (from Project Gutenberg) to a PDF document.

Project Gutenberg eBooks may be freely used in the United States because most are not protected by U.S. copyright law. They may not be free of copyright in other countries.

Installing borb

borb can be downloaded from source on GitHub, or installed via pip:

$ pip install borb

Installing unidecode

For this project we will also use unidecode, it's a wonderful little library that converts text from UTF-8 to ASCII. Keep in mind that not every character in UTF-8 can be represented as an ASCII character.

This is a lossy conversion, in principle so there will be some discrepancy every time you do a conversion:

$ pip install unidecode

Creating a PDF Document with borb

Creating a PDF document using borb typically follows the same steps every time:

from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

import typing
import re

from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout

# Create empty Document
pdf = Document()

# Create empty Page
page = Page()

# Add Page to Document
pdf.append_page(page)

# Create PageLayout
layout: PageLayout = SingleColumnLayout(page)

Creating E-books with borb

Note: We'll be dealing with raw text books. Each book will have a different structure and each book requires a different approach to rendering. This is a highly subjective (styling) and highly book-dependent task, though, the general process is the same.

The book we'll be downloading is UTF-8 encoded. Not every font supports every character. In fact, the PDF spec defines 14 standard fonts (which every reader/writer ought to have embedded), none of which support the full UTF-8 range.

So, to make our lives a bit easier, we're going to be using this little utility function to convert a str to ASCII:

from unidecode import unidecode

def to_ascii(s: str) -> str:
    s_out: str = ""
    for c in s:
      if c == '“' or c == '”' or c == 'â':
        s_out += '"'
      else:
        s_out += unidecode(c)  
    return s_out

Next, in our main method, we're going to be downloading the UTF-8 book.

In our example, we'll be using "The Mysterious affair at Styles" by Agatha Christie, which can be easily obtained in raw format from Project Gutenberg:

# Define which ebook to fetch
url = 'https://www.gutenberg.org/files/863/863-0.txt'

# Download text
import requests
txt = requests.get(url).text
print("Downloaded %d bytes of text..." % len(txt))

# Split to lines
lines_of_text: typing.List[str] = re.split('\r\n', txt)
lines_of_text = [to_ascii(x) for x in lines_of_text]

# Debug
print("This ebook contains %d lines... " % len(lines_of_text))

This prints:

Downloaded 361353 bytes of text...
This ebook contains 8892 lines...

The first lines of text are a general header added by Project Gutenberg. We don't really want that in our ebook so we're going to simply delete it, by checking whether a line starts with a certain pattern and slicing it off via the slice notation:

# Skip header
header_offset: int = 0
for i in range(0, len(lines_of_text)):
  if lines_of_text[i].startswith("*** START OF THE PROJECT GUTENBERG EBOOK"):
    header_offset = i + 1
    break
while lines_of_text[header_offset].isspace():
  header_offset += 1
lines_of_text = lines_of_text[header_offset:]
print("The first %d lines are the gutenberg header..." % header_offset)

This prints:

The first 24 lines are the gutenberg header...

Similarly, the last lines of text are just a copyright notice. We'll delete that as well:

# Skip footer
footer_offset: int = len(lines_of_text)
for i in range(0, len(lines_of_text)):
    if "*** END OF THE PROJECT GUTENBERG EBOOK" in lines_of_text[i]:
      footer_offset = i
      break
lines_of_text = lines_of_text[0:footer_offset]
print("The last %d lines are the gutenberg footer .." % (len(lines_of_text) - footer_offset))

With that out of the way, we're going to process the main body of text.

This code took some trial and error and if you're working with a different book - it will take some trial and error too.

Figuring out when to insert a chapter title, when to start a new paragraph, what the table of contents is, etc. depends on the book as well. This is an opportunity to play around with borb a bit, and try to parse the input yourself with a different book:

from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.canvas.layout.text.heading import Heading
from borb.pdf.canvas.color.color import HexColor, X11Color
from decimal import Decimal

# Main processing loop
i: int = 0
while i < len(lines_of_text):
  
    # Process lines
    paragraph_text: str = ""
    while i < len(lines_of_text) and not len(lines_of_text[i]) == 0:
      paragraph_text += lines_of_text[i]
      paragraph_text += " "
      i += 1

    # Empty line
    if len(paragraph_text) == 0:
      i += 1
      continue

    # Space
    if paragraph_text.isspace():
      i += 1
      continue

    # Contains the word 'CHAPTER' multiple times (likely to be table of contents)
    if sum([1 for x in paragraph_text.split(' ') if 'CHAPTER' in x]) > 2:
      i += 1
      continue

    # Debug
    print("Processing line %d / %d" % (i, len(lines_of_text)))

    # Outline
    if paragraph_text.startswith("CHAPTER"):
      print("Adding Header of %d bytes .." % len(paragraph_text))
      try:
        page = Page()
        pdf.append_page(page)
        layout = SingleColumnLayout(page)
        layout.add(Heading(paragraph_text, font_color=HexColor("13505B"), font_size=Decimal(20)))
      except:
        pass
      continue

    # Default
    try:
        layout.add(Paragraph(paragraph_text))
    except:
      pass
  
    # Default behaviour
    i += 1

All that's left is to store the final PDF document:

with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

creating pdf ebooks with borb

Conclusion

In this guide you've learned how to process a large piece of text and create a PDF out of it automatically using borb.

Creating books from raw text files is not a standard process, and you'll have to test things out and play around with the loops and the way you treat text to get it right.

from Planet Python
via read more

Daily Python

Monday, November 1, 2021

Stack Abuse: Using borb to Create E-books From Project Gutenberg

Installing borb

Installing unidecode

Creating a PDF Document with borb

Creating E-books with borb

Conclusion

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

Search This Blog