A while ago I wrote about how to extract text from PDF documents in Python using the PDFMiner library. However, in a recent project I had some trouble using PDFMiner to extract text, possibly because the documents I was working with were scanned PDFs. In this case the answer is to use OCR-based text extraction, and that’s exactly what the textract library is able to do by making use of the tesseract OCR algorithms.
Using textract is extremely straightforward:
import textract pdffile = "myfile.pdf" text = textract.process(pdffile, method='tesseract', language='eng')
et voila.
However… although using textract is easy, installing it is not.
Installing Textract
Here are my notes for the steps I needed to go through to get textract on my laptop.
I am using a MacBook Pro running Mojave 10.14.5 and I’m using a clean python virtual environment.
Step 1 Try to install textract using pip. Wait for the error message. If you don’t get one then good for you, otherwise move to Step 2.
Step 2 The error message is probably telling you that you don’t have the swig library.
brew install swig
Step 3 Once again you try to pip install textract. The error now looks like this:
deps/sphinxbase/src/libsphinxad/ad_openal.c:43:10: fatal error: 'al.h' file not found
#include
^~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit status 1
You need to change the header files in the pocketsphinx library. I found that the easiest way to do this is to install pocketsphinx from source. First clone the source code:
git clone --recursive https://github.com/bambocher/pocketsphinx-python/
but before you install the library:
cd pocketsphinx-python/deps/sphinxbase/src/libsphinxad/
and in ad_openal.c change:
#include <al.h>
#include <alc.h>
to
#include <OpenAL/al.h>
#include <OpenAL/alc.h>
then install the library by doing:
cd pocketsphinx-python python setup.py install
Step 4 Now we’re ready to install textract. However, if we try to pip install it then it will try to fetch a different version of pocketsphinx and fail again.
To stop it doing that, grab the textract source tarball from here and untar it:
tar -xvzf textract-1.6.1.tar.gz
then go into the requirements directory:
cd textract-1.6.1/requirements/
open the python file and change:
pocketsphinx==0.1.3
to
pocketsphinx==0.1.15
then install textract:
cd textract-1.6.1 python setup.py install
from ALL YOUR BASE ARE BELONG TO US
via read more
No comments:
Post a Comment