Introduction
The constant growth of data on the Internet creates a demand for a tool that could process textual information in a faster way with no effort from the ordinary user.
Moreover, it's highly important that this instrument of text analysis could implement solutions for both low and high-level NLP tasks such as counting word frequencies, calculating sentiment analysis of the texts or detecting patterns in relationships between words.
TextBlob is a great lightweight library for a wide variety of NLP tasks.
In this tutorial we will shed some light how to perform N-Grams Detection in Python using TextBlob.
What are N-Grams?
N-grams represent a continuous sequence of N elements from a given set of texts. In broad terms, such items do not necessarily stand for strings of words, they also can be phonemes, syllables or letters, depending on what you'd like to accomplish.
However, in Natural Language Processing it is more commonly referring to N-grams as strings of words, where n stands for an amount of words that you are looking for.
The following types of N-grams are usually distinguished:
-
Unigram - An N-gram with simply one string inside (for example, it can be a unique word - YouTube or TikTok from a given sentence e.g. YouTube is launching a new short-form video format that seems an awful lot like TikTok).
-
2-gram or Bigram - Typically a combination of two strings or words that appear in a document: short-form video or video format will be likely a search result of bigrams in a certain corpus of texts (and not format video, video short-form as the word order remains the same).
-
3-gram or Trigram - An N-gram containing up to three elements that are processed together (e.g. short-form video format or new short-form video) etc.
N-grams found its primary application in an area of probabilistic language models. As they estimate the probability of the next item in a word sequence.
This approach for language modeling assumes a tight relationship between the position of each element in a string, calculating the occurrence of the next word with respect to the previous one. In particular, the N-gram model determines the probability as follows - N-1
.
For instance, a trigram model (with N = 3) will predict the next word in a string based on the preceding two words as N-1 = 2
.
The other cases of implementation of N-grams models in the industry can be detection of plagiarism, where N-grams obtained from two different texts are compared with each other to figure out the degree of similarity of the analysed documents.
N-gram Detecion in Python Using TextBlob
Analysis of a Sentence
To start out detecting the N-grams in Python, you will first have to install the TexBlob package. Note that this library is applicable for both Python 2 and Python 3.
We'll also want to download the required text corpora for it to work with:
$ pip install -U textblob
$ python -m textblob.download_corpora
Once the environment is set up, you are ready to load the package and compute N-grams in a sample sentence. In the beginning, we will look at N-grams in the quote of M.Mullenweg: Technology is best when it brings people together.
Let's get started:
from textblob import TextBlob
# Sample sentence for N-gram detection
sentence = "Technology is best when it brings people together"
We've created a sentence
string containing the sentence we want to analyze. We've then passed that string to the TextBlob
constructor, injecting it into the TextBlob
instance that we'll run operations on:
ngram_object = TextBlob(sentence)
Now, let's run N-gram detection. For starters, let's do 2-gram detection. This is specifiec in the argument list of the ngrams()
function call:
ngrams = ngram_object.ngrams(n=2) # Computing Bigrams
print(ngrams)
The ngrams()
function returns a list of tuples of n successive words. In our sentence, a bigram model will give us the following set of strings:
[WordList(['Technology', 'is']),
WordList(['is', 'best']),
WordList(['best', 'when']),
WordList(['when', 'it']),
WordList(['it', 'brings']),
WordList(['brings', 'people']),
WordList(['people', 'together'])]
Document Analysis
Despite the simple nature of this Python library, TextBlob also provides a range of advanced features for analysis. More often than not, we aren't working with single sentences for N-grams detection. It's much more common to work with documents, articles or larger corporas.
In our next example, we will use an article from the CNBC news portal regarding Bill Gates.
Let's create a text document and call it something along the lines of Input.txt
for the next analysis:
import sys
# Opening and reading the `Input.txt` file
corpus = open("Input.txt").read()
Then, as usual, we'll instantiate a TextBlob
instance, by passing the corpus
to the constructor, and run the ngrams()
function:
ngram_object = TextBlob(corpus)
trigrams = ngram_object.ngrams(n=3) # Computing Trigrams
print(trigrams)
This will print out the Trigrams of the content we've provided. However, note that the output can differ depending on the approach you apply to handle punctuation marks:
[WordList(['Bill', 'Gates', 'says']),
WordList(['Gates', 'says', 'that']),
WordList(['says', 'that', 'antitrust']),
WordList(['that', 'antitrust', 'regulators']),
WordList(['antitrust', 'regulators', 'should'])
<...>]
In comparison, Bigram analysis for the given article will provide us a different list:
ngram_object = TextBlob(corpus)
Bigram = ngram_object.ngrams(n=) # Computing Bigrams
print(Bigram)
A snippet from the output:
[WordList(['Bill', 'Gates']),
WordList(['Gates', 'says']),
WordList(['says', 'that']),
WordList(['that', 'antitrust']),
WordList(['antitrust', 'regulators'])
<...>]
Conclusion
N-Grams detection is a simple and common task in a lot of NLP projects. In this article, we've gone over how to perform N-Gram detection in Python using TextBlob.
from Planet Python
via read more
No comments:
Post a Comment