Thursday, June 3, 2021

Stack Abuse: Searching and Replacing Words in Python with FlashText

Introduction

In this tutorial, we'll explain how to replace words in text sequences, with Python using the FlashText module, which provides one of the most efficient ways of replacing a large set of words in a textual document.

How Does the FlashText Algorithm Work?

The FlashText module is based on its proprietary algorithm, the FlashText algorithm. In essence, it is based on a Python implementation of the Aho–Corasick algorithm.

The fundemental takeaway of the algorithm is to reduce the time spent finding a large number of keywords in the text, by minimizing the number of times that the text is scanned.

A naive algorithm consists of comparing each keyword to every single word in the whole text - akin to Linear Search. That approach is extremely inefficient, because it needs to scan the entire text document n times - where n is the number of keywords.

On the other hand, FlashText opts for a different approach, intending to scan the text only once.

The key to the efficiency of the FlashText algorithm is that it stores all of the keywords, paired with corresponding replacement words in a dictionary. Then, instead of scanning text once for every keyword in the dictionary, it scans the text only once. In that one scan over the text, words are matched to the dictionary keys and if present - replaced with the key's value.

How to Install FlashText

Installing FlashText is fairly straightforward, via pip:

pip install flashtext

How to Use FlashText

Let's first take a look at the FlashText API and some of the key classes within it.

The KeywordProcessor Class

The main class we'll be working with, which takes care of the processing of keywords is the KeywordProcessor class. Let's import it directly from FlashText and initialize it:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

The previous line creates the KeywordProcessor object that will work in the case-insensitive mode.

Case-insensitive mode will match abc with ABC, as well as with aBc, ABc, etc.

Case-sensitive mode will only match aaa with exactly the same string, aaa.

Alternatively, we can create a KeywordProcessor instance in the case-sensitive mode:

keyword_processor= KeywordProcessor(case_sensitive=True)

Defining the Keywords Dictionary

In the FlashText module, we use keywords to define words that need to be replaced. The KeywordProcessor object contains a dictionary containing all of the defined keywords.

There are two ways of adding keywords to the dictionary: in bulk or one-by-one.

Firstly, let's take a look at how to add keywords one-by-one:

keyword_processor.add_keyword(<keyword>, <replacementWord>)

<replacementWord> is an optional argument. When not specified, just the <keyword> is added to the dictionary and there is no way of replacing it with another replacement word.

So, if you want to use FlashText to replace words in a text, we highly recommend you to specify the <replacementWord> argument.

If we have more than a couple of keywords, adding them one-by-one can be a little time-consuming. An alternative, much more commonly used even for small lists of keywords is to add keywords in bulk:

keyword_dictionary = {
    'replacementWord1': ['list', 'of', 'keywords', 'for', 'replacementWord1'],
    'replacementWord2': ['list', 'of', 'keywords', 'for', 'replacementWord2'],
    ...
    'replacementWordN': ['list', 'of', 'keywords', 'for', 'replacementWordN']
}

keyword_processor.add_keywords_from_dict(keyword_dictionary )

Each key in the dictionary is a string keyword. Each value must be a list. Alternatively, you can provide keywords through a List:

keyword_processor.add_keywords_from_list(['list', 'of', 'keywords'])

Though, with this approach - you just add the keywords without replacement words. Or, if a text file contains key-value pairs following a key=>value syntax:

keyword1=>value1
keyword2=>value2

We can import them through the keywords_from_file() function:

keyword_processor.add_keywords_from_file('keyword_list.txt')

A popular approach, which allows you for the most flexibility and great readability is using a dictionary. It's also the most natural match for the algorithm, given the fact that it all ultimately ends up in a dictionary.

Now, let's take a look at a quick example. Imagine we have a textual document and we want to minimize the usage of synonyms in order to standardize the vocabulary used. In essence, we want to replace all occurrences of words such as awful, terrible and horrible (list of keywords) with the word bad (replacement word), and all occurrences of words such as fine, excellent and great, with the word good.

We would add those keywords and replacement_words to the keyword_dictionary:

keyword_dictionary = {
    "bad": ["awful", "terrible", "horrible"],
    "good": ["fine", "excellent", "great"]
}

And, finally, add the keyword_dictionary to the keyword_processor object:

keyword_processor.add_keywords_from_dict(keyword_dictionary)

Replace Keywords with Replacement Words

Once we've loaded the keywords and their respective replacement words into the KeywordProcessor instance, we can execute replace_keywords() function, which scans the provided text, and executes the replacement:

new_text = keywordProcessor.replace_keywords("Passed String")

It parses through the provided text, replacing all of the keywords within it with their matched values, and returns a new string.

Now, we typically don't work with string literals here - but rather with documents. We'll want to open a document, read the lines within it, and pass those in as a string to the replace_keywords() function.

Note: For really long files, that might not fit into your local machine's memory - you might want to consider reading a file line-by-line.

In any case, let's load in a text file and execute the replace_keywords() function on the contents:

# Open the long textual document `data.txt`
with open('data.txt', 'r+') as file:
    # Load the content from `data.txt` to a variable as a string
    content = file.read()
    # Replace all desired keywords from `data.txt` and store it in the new variable
    new_content = keyword_processor.replace_keywords(content)
    # Replace the old content
    file.seek(0)
    file.truncate()
    # Write the alternated content to the original file 
    file.write(new_content)

So if we feed a text file such as, text.txt:

The breakfast was terrific! I really loved the eggs, you're a great cook.

With the following keywords and replacement words:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

keyword_dictionary = {
    "good": ["terrific", "great"],
    "eggs": ["hash browns"]
}

keyword_processor.add_keywords_from_dict(keyword_dictionary)

with open('data.txt', 'r+') as file:
    content = file.read()
    new_content = keyword_processor.replace_keywords(content)
    file.seek(0)
    file.truncate()
    file.write(new_content)

It'd result in an altered text.txt file:

The breakfast was good! I really loved the hash browns, you're a good cook.

Other Useful Functionalities of the FlashText Module

Let's make a dummy keyword_processor and keyword_dictionary to illustrate some of the other useful functionalities of the FlashText module:

keywordProcessor = KeywordProcessor()
keywordDictionary = {
    "bad": ["awful", "terrible", "horrible"],
    "good": ["fine", "excellent", "great"]
}
keywordProcessor.add_keywords_from_dict(keywordDictionary)

To get a list of all keywords in the KeywordProcessor instance, we use the get_all_keywords() function:

# List all added keywords
print(keywordProcessor.get_all_keywords())

Which results in:

{'awful': 'bad', 'terrible': 'bad', 'horrible': 'bad', 'fine': 'good', 'excellent': 'good', 'great': 'good'}

To check if a keyword is present in the KeywordProcessor, we can use the in operator:

'bad' in keywordProcessor
# Output: true
# keyword `bad` is PRESENT in the keywordProcessor

'red' in keywordProcessor
# Output: false
# keyword `red` is NOT PRESENT in the keywordProcessor

'awful' in keywordProcessor
# Output: false
# keyword `awful` is NOT THE KEYWORD in the keywordProcessor
# instead, it IS REPLACEMENT WORD

And to access a replacement_word based on a certain keyword:

keywordProcessor['fine']
# Output: 'good'

keywordProcessor['excelent']
# Output: 'good'

keywordProcessor['goood']
# Output: None
# There is no keyword `goood` in the keywordProcessor

And finally, to remove keywords from a KeywordProcessor, we use the remove_keyword() function:

keyword_processor.remove_keyword('fine')
# This will remove `fine` from the keywordProcessor

​Alternatively, we can specify list or dictionary of keyword-value pairs that we want to remove, and use them to remove specified elements:

# Using a dictionary to remove keywords
keywordProcessor.remove_keywords_from_dict({"bad": ["awful", "terrible"]})
# This will remove keywords `awful` and `terrible` from the keywordProcessor

# Using a list to remove keywords
keywordProcessor.remove_keywords_from_list(["fine", "excellent"])
# This will remove keywords `fine` and `excellent` from the keywordProcessor

FlashText vs Regular Expressions

FlashText was primarily created as an alternative to regular expressions, so it would be useful to compare the two of them. In fact, it was created as a response to a question on StackOverflow.

When comparing the speed of execution - FlashText is the clear winner. It takes around the same time for the same text, with a small and large number of keywords. On the other hand, with regular expressions - the execution time increases proportionally to the number of keywords to be replaced.

As the author of FlashText notes - for large queries, it can take regular expressions days to execute, while FlashText does it in 15 minutes:

replace-benchmark

Credit: Vikash Singh, author of FlashText, at FreeCodeCamp

Benchmarks have shown that FlashText takes less time than RegEx to find and replace keywords when the number of keywords exceeds 500.

Though, when it comes to the special character matching, FlashText has no chance of beating regular expressions. Even more, FlashText doesn't even have the support for that kind of matching - it can only match plain keywords without any special characters.

Special character matching describes the situations where some characters in the word have a special meaning and can match multiple different characters.

For example, character \d describes digits from 0 to 9, meaning that the keyword number\d will match words number0, number1, number2, etc.

FlashText is not designed with special character matching in mind, so it can only match equal string literals. The word number1 can only be matched with the keyword number1 and there is no way of using a single keyword to match multiple different words.

Conclusion

As we've already seen, FlashText is a very simple yet powerful tool. It is pretty lightweight, easy to learn, and very time-efficient no matter the number of keywords to be replaced.

As with any other tool, the key is to know what is the best use-case scenario for it. If you have more than 500 keywords to be replaced and those keywords are simple, without any special character matching, there is no reason not to go with the FlashText over regular expressions.

On the other hand, if you have less than 500 keywords or some kind of special character matching, you should probably ditch FlashText and go with good old regular expressions given their flexibility and syntax.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...