Introduction
In this tutorial, we'll explain how to replace words in text sequences, with Python using the FlashText module, which provides one of the most efficient ways of replacing a large set of words in a textual document.
How Does the FlashText Algorithm Work?
The FlashText module is based on its proprietary algorithm, the FlashText algorithm. In essence, it is based on a Python implementation of the Aho–Corasick algorithm.
The fundemental takeaway of the algorithm is to reduce the time spent finding a large number of keywords in the text, by minimizing the number of times that the text is scanned.
A naive algorithm consists of comparing each keyword to every single word in the whole text - akin to Linear Search. That approach is extremely inefficient, because it needs to scan the entire text document
n
times - wheren
is the number of keywords.
On the other hand, FlashText opts for a different approach, intending to scan the text only once.
The key to the efficiency of the FlashText algorithm is that it stores all of the keywords, paired with corresponding replacement words in a dictionary. Then, instead of scanning text once for every keyword in the dictionary, it scans the text only once. In that one scan over the text, words are matched to the dictionary keys and if present - replaced with the key's value.
How to Install FlashText
Installing FlashText is fairly straightforward, via pip
:
pip install flashtext
How to Use FlashText
Let's first take a look at the FlashText
API and some of the key classes within it.
The KeywordProcessor Class
The main class we'll be working with, which takes care of the processing of keywords is the KeywordProcessor
class. Let's import it directly from FlashText
and initialize it:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
The previous line creates the KeywordProcessor
object that will work in the case-insensitive mode.
Case-insensitive mode will match
abc
withABC
, as well as withaBc
,ABc
, etc.
Case-sensitive mode will only match
aaa
with exactly the same string,aaa
.
Alternatively, we can create a KeywordProcessor
instance in the case-sensitive mode:
keyword_processor= KeywordProcessor(case_sensitive=True)
Defining the Keywords Dictionary
In the FlashText module, we use keywords to define words that need to be replaced. The KeywordProcessor
object contains a dictionary containing all of the defined keywords.
There are two ways of adding keywords to the dictionary: in bulk or one-by-one.
Firstly, let's take a look at how to add keywords one-by-one:
keyword_processor.add_keyword(<keyword>, <replacementWord>)
<replacementWord>
is an optional argument. When not specified, just the<keyword>
is added to the dictionary and there is no way of replacing it with another replacement word.So, if you want to use FlashText to replace words in a text, we highly recommend you to specify the
<replacementWord>
argument.
If we have more than a couple of keywords, adding them one-by-one can be a little time-consuming. An alternative, much more commonly used even for small lists of keywords is to add keywords in bulk:
keyword_dictionary = {
'replacementWord1': ['list', 'of', 'keywords', 'for', 'replacementWord1'],
'replacementWord2': ['list', 'of', 'keywords', 'for', 'replacementWord2'],
...
'replacementWordN': ['list', 'of', 'keywords', 'for', 'replacementWordN']
}
keyword_processor.add_keywords_from_dict(keyword_dictionary )
Each key
in the dictionary is a string keyword. Each value
must be a list. Alternatively, you can provide keywords through a List
:
keyword_processor.add_keywords_from_list(['list', 'of', 'keywords'])
Though, with this approach - you just add the keywords without replacement words. Or, if a text file contains key-value pairs following a key=>value
syntax:
keyword1=>value1
keyword2=>value2
We can import them through the keywords_from_file()
function:
keyword_processor.add_keywords_from_file('keyword_list.txt')
A popular approach, which allows you for the most flexibility and great readability is using a dictionary. It's also the most natural match for the algorithm, given the fact that it all ultimately ends up in a dictionary.
Now, let's take a look at a quick example. Imagine we have a textual document and we want to minimize the usage of synonyms in order to standardize the vocabulary used. In essence, we want to replace all occurrences of words such as awful, terrible and horrible (list of keywords) with the word bad (replacement word), and all occurrences of words such as fine, excellent and great, with the word good.
We would add those keywords
and replacement_words
to the keyword_dictionary
:
keyword_dictionary = {
"bad": ["awful", "terrible", "horrible"],
"good": ["fine", "excellent", "great"]
}
And, finally, add the keyword_dictionary
to the keyword_processor
object:
keyword_processor.add_keywords_from_dict(keyword_dictionary)
Replace Keywords with Replacement Words
Once we've loaded the keywords and their respective replacement words into the KeywordProcessor
instance, we can execute replace_keywords()
function, which scans the provided text, and executes the replacement:
new_text = keywordProcessor.replace_keywords("Passed String")
It parses through the provided text, replacing all of the keywords within it with their matched values, and returns a new string.
Now, we typically don't work with string literals here - but rather with documents. We'll want to open a document, read the lines within it, and pass those in as a string to the replace_keywords()
function.
Note: For really long files, that might not fit into your local machine's memory - you might want to consider reading a file line-by-line.
In any case, let's load in a text file and execute the replace_keywords()
function on the contents:
# Open the long textual document `data.txt`
with open('data.txt', 'r+') as file:
# Load the content from `data.txt` to a variable as a string
content = file.read()
# Replace all desired keywords from `data.txt` and store it in the new variable
new_content = keyword_processor.replace_keywords(content)
# Replace the old content
file.seek(0)
file.truncate()
# Write the alternated content to the original file
file.write(new_content)
So if we feed a text file such as, text.txt
:
The breakfast was terrific! I really loved the eggs, you're a great cook.
With the following keywords and replacement words:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_dictionary = {
"good": ["terrific", "great"],
"eggs": ["hash browns"]
}
keyword_processor.add_keywords_from_dict(keyword_dictionary)
with open('data.txt', 'r+') as file:
content = file.read()
new_content = keyword_processor.replace_keywords(content)
file.seek(0)
file.truncate()
file.write(new_content)
It'd result in an altered text.txt
file:
The breakfast was good! I really loved the hash browns, you're a good cook.
Other Useful Functionalities of the FlashText Module
Let's make a dummy keyword_processor
and keyword_dictionary
to illustrate some of the other useful functionalities of the FlashText
module:
keywordProcessor = KeywordProcessor()
keywordDictionary = {
"bad": ["awful", "terrible", "horrible"],
"good": ["fine", "excellent", "great"]
}
keywordProcessor.add_keywords_from_dict(keywordDictionary)
To get a list of all keywords in the KeywordProcessor
instance, we use the get_all_keywords()
function:
# List all added keywords
print(keywordProcessor.get_all_keywords())
Which results in:
{'awful': 'bad', 'terrible': 'bad', 'horrible': 'bad', 'fine': 'good', 'excellent': 'good', 'great': 'good'}
To check if a keyword is present in the KeywordProcessor
, we can use the in
operator:
'bad' in keywordProcessor
# Output: true
# keyword `bad` is PRESENT in the keywordProcessor
'red' in keywordProcessor
# Output: false
# keyword `red` is NOT PRESENT in the keywordProcessor
'awful' in keywordProcessor
# Output: false
# keyword `awful` is NOT THE KEYWORD in the keywordProcessor
# instead, it IS REPLACEMENT WORD
And to access a replacement_word
based on a certain keyword
:
keywordProcessor['fine']
# Output: 'good'
keywordProcessor['excelent']
# Output: 'good'
keywordProcessor['goood']
# Output: None
# There is no keyword `goood` in the keywordProcessor
And finally, to remove keywords from a KeywordProcessor
, we use the remove_keyword()
function:
keyword_processor.remove_keyword('fine')
# This will remove `fine` from the keywordProcessor
Alternatively, we can specify list or dictionary of keyword-value pairs that we want to remove, and use them to remove specified elements:
# Using a dictionary to remove keywords
keywordProcessor.remove_keywords_from_dict({"bad": ["awful", "terrible"]})
# This will remove keywords `awful` and `terrible` from the keywordProcessor
# Using a list to remove keywords
keywordProcessor.remove_keywords_from_list(["fine", "excellent"])
# This will remove keywords `fine` and `excellent` from the keywordProcessor
FlashText vs Regular Expressions
FlashText was primarily created as an alternative to regular expressions, so it would be useful to compare the two of them. In fact, it was created as a response to a question on StackOverflow.
When comparing the speed of execution - FlashText is the clear winner. It takes around the same time for the same text, with a small and large number of keywords. On the other hand, with regular expressions - the execution time increases proportionally to the number of keywords to be replaced.
As the author of FlashText notes - for large queries, it can take regular expressions days to execute, while FlashText does it in 15 minutes:
Credit: Vikash Singh, author of FlashText, at FreeCodeCamp
Benchmarks have shown that FlashText takes less time than RegEx to find and replace keywords when the number of keywords exceeds 500.
Though, when it comes to the special character matching, FlashText has no chance of beating regular expressions. Even more, FlashText doesn't even have the support for that kind of matching - it can only match plain keywords without any special characters.
Special character matching describes the situations where some characters in the word have a special meaning and can match multiple different characters.
For example, character\d
describes digits from 0 to 9, meaning that the keywordnumber\d
will match wordsnumber0
,number1
,number2
, etc.
FlashText is not designed with special character matching in mind, so it can only match equal string literals. The wordnumber1
can only be matched with the keywordnumber1
and there is no way of using a single keyword to match multiple different words.
Conclusion
As we've already seen, FlashText is a very simple yet powerful tool. It is pretty lightweight, easy to learn, and very time-efficient no matter the number of keywords to be replaced.
As with any other tool, the key is to know what is the best use-case scenario for it. If you have more than 500 keywords to be replaced and those keywords are simple, without any special character matching, there is no reason not to go with the FlashText over regular expressions.
On the other hand, if you have less than 500 keywords or some kind of special character matching, you should probably ditch FlashText and go with good old regular expressions given their flexibility and syntax.
from Planet Python
via read more
No comments:
Post a Comment