Introduction
Spelling mistakes are common, and most people are used to software indicating if a mistake was made. From autocorrect on our phones, to red underlining in text editors, spell checking is an essential feature for many different products.
The first program to implement spell checking was written in 1971 for the DEC PDP-10. Called SPELL, it was capable of performing only simple comparisons of words and detecting one or two letter differences. As hardware and software advanced, so have spell checkers. Modern spell checkers are capable of handling morphology and using statistics to improve suggestions.
Python offers many modules to use for this purpose, making writing a simple spell checker an easy 20-minute ordeal.
One of these libraries being TextBlob, which is used for natural language processing that provides an intuitive API to work with.
In this article we'll take a look at how to implement spelling correction in Python with TextBlob.
Table of Contents
Installation
First, we'll need to install TextBlob, since it doesn't come preinstalled. Open up a console and install it using pip:
$ pip install textblob
This should install everything we need for this project. Upon finishing the installation, the console output should include something like:
Successfully installed click-7.1.2 joblib-0.17.0 nltk-3.5 regex-2020.11.13 textblob-0.15.3
TextBlob is built on top of NLTK, so it also comes with the installation.
The correct() Function
The most straightforward way to correct input text is to use the correct()
method. The example text we'll be using is a paragraph from Charles Darwin's "On the Origin of Species", which is part of the public domain, packed into a file called text.txt
.
In addition, we'll add some deliberate spelling mistakes:
As far as I am abl to judg, after long attnding to the sbject, the condiions of lfe apear to act in two ways—directly on the whle organsaton or on certin parts alne and indirectly by afcting the reproducte sstem. Wit respct to te dirct action, we mst bea in mid tht in every cse, as Profesor Weismann hs latly insistd, and as I have inidently shwn in my wrk on "Variatin undr Domesticcation," thcere arae two factrs: namly, the natre of the orgnism and the natture of the condiions. The frmer sems to be much th mre importannt; foor nealy siimilar variations sometimes aris under, as far as we cn juddge, disimilar conditios; annd, on te oter hannd, disssimilar variatioons arise undder conditions which aappear to be nnearly uniiform. The efffects on tthe offspring arre ieither definnite or in definite. They maay be considdered as definnite whhen allc or neearly all thhe ofefspring off inadividuals exnposed tco ceertain conditionas duriing seveal ggenerations aree moodified in te saame maner.
It's full of spelling mistakes, in almost every word. Let's write up a simple script, using TextBlob, to correct these mistakes and print them back to the console:
from textblob import TextBlob
with open("text.txt", "r") as f: # Opening the test file with the intention to read
text = f.read() # Reading the file
textBlb = TextBlob(text) # Making our first textblob
textCorrected = textBlb.correct() # Correcting the text
print(textCorrected)
If you've worked with TextBlob before, this flow will look familiar to you. We've read the file and the contents inside of it, and constructed a TextBlob
instance by passing the contents to the constructor.
Then, we run the correct()
function on that instance to perform spelling correction.
After running the script above, you should get an output similar to:
Is far as I am all to judge, after long attending to the subject, the conditions of life appear to act in two ways—directly on the while organisation or on certain parts alone and indirectly by acting the reproduce system. It respect to te direct action, we must be in mid the in every case, as Professor Weismann he lately insisted, and as I have evidently shown in my work on "Variation under Domesticcation," there are two facts: namely, the nature of the organism and the nature of the conditions. The former seems to be much th are important; for nearly similar variations sometimes arms under, as far as we in judge, similar condition; and, on te other hand, disssimilar variations arise under conditions which appear to be nearly uniform. The effects on the offspring are either definite or in definite. They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in te same manner.
How Correct is TextBlob's Spelling Correction?
As we can see, the text still has some spelling errors. Words like "abl"
were supposed to be "able"
, not "all"
. Though, even with these, it's still better than the original.
Now comes the question, how much better is it?
The following code snippet is a simple script that test how good is TextBlob in correcting errors, based on this example:
from textblob import TextBlob
# A function that compares two texts and returns
# the number of matches and differences
def compare(text1, text2):
l1 = text1.split()
l2 = text2.split()
good = 0
bad = 0
for i in range(0, len(l1)):
if l1[i] != l2[i]:
bad += 1
else:
good += 1
return (good, bad)
# Helper function to calculate the percentage of misspelled words
def percentageOfBad(x):
return (x[1] / (x[0] + x[1])) * 100
Now, with those two functions, let's run a quick analysis:
with open("test.txt", "r") as f1: # test.txt contains the same typo-filled text from the last example
t1 = f1.read()
with open("original.txt", "r") as f2: # original.txt contains the text from the actual book
t2 = f2.read()
t3 = TextBlob(t1).correct()
mistakesCompOriginal = compare(t1, t2)
originalCompCorrected = compare(t2, t3)
mistakesCompCorrected = compare(t1, t3)
print("Mistakes compared to original ", mistakesCompOriginal)
print("Original compared to corrected ", originalCompCorrected)
print("Mistakes compared to corrected ", mistakesCompCorrected, "\n")
print("Percentage of mistakes in the test: ", percentageOfBad(mistakesCompOriginal), "%")
print("Percentage of mistakes in the corrected: ", percentageOfBad(originalCompCorrected), "%")
print("Percentage of fixed mistakes: ", percentageOfBad(mistakesCompCorrected), "%", "\n")
Running it will print out:
Mistakes compared to original (126, 194)
Original compared to corrected (269, 51)
Mistakes compared to corrected (145, 175)
Percentage of mistakes in the test: 60.62499999999999 %
Percentage of mistakes in the corrected: 15.937499999999998 %
Percentage of fixed mistakes: 54.6875 %
As we can see, the correct
method managed to get our spelling mistake percentage from 60.6% to 15.9%, which is pretty decent, however there's a bit of a catch. It corrected 54.7% of the words, so why is there still a 15.9% mistake rate?
The answer is overcorrection. Sometimes, it can change a word that is spelled correctly, like the first word in our example text where "As"
was corrected to "Is"
. Other times, it just doesn't have enough information about the word and the context to tell which word the user was intending to type, so it guesses that it should replace "whl"
with "while"
instead of "whole"
.
There is no perfect spelling corrector because so much of spoken language is contextual, so keep that in mind. In most use cases, there are way fewer mistakes than in our example, so TextBlob should be able to work well enough for the average user.
Training TextBlob with Custom Datasets
What if you want to spellcheck another language which isn't supported by TextBlob out of the box? Or maybe you want to get just a little bit more precise? Well, there might be a way to achieve this. It all comes down to the way spell checking works in TextBlob.
TextBlob uses statistics of word usage in English to make smart suggestions on which words to correct. It keeps these statistics in a file called en-spelling.txt
, but it also allows you to make your very own word usage statistics file.
Let's try to make one for our Darwin example. We'll use all the words in the "On the Origin of Species" to train. You can use any text, just make sure it has enough words, that are relevant to the text you wish to correct.
In our case, the rest of the book will provide great context and additional information that TextBlob would need to be more accurate in the correction.
Let's rewrite the script:
from textblob.en import Spelling
import re
textToLower = ""
with open("originOfSpecies.txt","r") as f1: # Open our source file
text = f1.read() # Read the file
textToLower = text.lower() # Lower all the capital letters
words = re.findall("[a-z]+", textToLower) # Find all the words and place them into a list
oneString = " ".join(words) # Join them into one string
pathToFile = "train.txt" # The path we want to store our stats file at
spelling = Spelling(path = pathToFile) # Connect the path to the Spelling object
spelling.train(oneString, pathToFile) # Train
If we look into the train.txt
file, we'll see:
a 3389
abdomen 3
aberrant 9
aberration 5
abhorrent 1
abilities 1
ability 4
abjectly 1
able 54
ably 5
abnormal 17
abnormally 2
abodes 2
...
This indicates that the word "a"
shows up as a word 3389 times, while "ably"
shows up only 5 times. To test out this trained model, we'll use suggest(text)
instead of correct(text)
, which a list of word-confidence tuples. The first elements in the list will be the word it's most confident about, so we can access it via suggest(text)[0][0]
.
Note that this might be slower, so go word by word while spell-checking, as dumping huge amounts of data can result in a crash:
from textblob.en import Spelling
from textblob import TextBlob
pathToFile = "train.txt"
spelling = Spelling(path = pathToFile)
text = " "
with open("test.txt", "r") as f:
text = f.read()
words = text.split()
corrected = " "
for i in words :
corrected = corrected +" "+ spelling.suggest(i)[0][0] # Spell checking word by word
print(corrected)
And now, this will result in:
As far as I am all to judge after long attending to the subject the conditions of life appear to act in two ways—directly on the whole organisation or on certain parts alone and indirectly by acting the reproduce system It respect to the direct action we most be in mid the in every case as Professor Weismann as lately insisted and as I have incidently shown in my work on "Variatin under Domesticcation," there are two facts namely the nature of the organism and the nature of the conditions The former seems to be much th are important for nearly similar variations sometimes arise under as far as we in judge dissimilar conditions and on the other hand dissimilar variations arise under conditions which appear to be nearly uniform The effects on the offspring are either definite or in definite They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in the same manner.
This fixes around 2 out of 3 of misspelled words, which is pretty good, considering the run without much context.
Conclusion
In this article we'll used TextBlob to implement a basic spelling corrector, both with the stock prediction model a custom one.
Correcting man-made spelling errors has become a common task for software developers. Even though it has become easier and more efficient via data mining, many spelling mistakes need context to be corrected.
In conclusion, proofreaders are probably not going to get automated out of work any time soon, though, some basic correction can be automated to save time and effort.
from Planet Python
via read more
No comments:
Post a Comment