This is the eighth article in my series of articles on Python for NLP. In my previous article, I explained how Python's TextBlob library can be used to perform a variety of NLP tasks ranging from tokenization to POS tagging, and text classification to sentiment analysis. In this article, we will explore Python's Pattern library, which is another extremely useful Natural Language Processing library.
The Pattern library is a multipurpose library capable of handling the following tasks:
- Natural Language Processing: Performing tasks such as tokenization, stemming, POS tagging, sentiment analysis, etc.
- Data Mining: It contains APIs to mine data from sites like Twitter, Facebook, Wikipedia, etc.
- Machine Learning: Contains machine learning models such as SVM, KNN, and perceptron, which can be used for classification, regression, and clustering tasks.
In this article, we will see the first two applications of the Pattern library from the above list. We will explore the use of the Pattern Library for NLP by performing tasks such as tokenization, stemming and sentiment analysis. We will also see how the Pattern library can be used for web mining.
Installing the Library
To install the library, you can use the following pip command:
$ pip install pattern
Otherwise if you are using the Anaconda distribution of Python, you can use the following Anaconda command to download the library:
$ conda install -c asmeurer pattern
Pattern Library Functions for NLP
In this section, we will see some of the NLP applications of the Pattern Library.
Tokenizing, POS Tagging, and Chunking
In the NLTK and spaCy libraries, we have a separate function for tokenizing, POS tagging, and finding noun phrases in text documents. On the other hand, in the Pattern library there is the all-in-one parse
method that takes a text string as an input parameter and returns corresponding tokens in the string, along with the POS tag.
The parse
method also tells us if a token is a noun phrase or verb phrase, or subject or object. You can also retrieve lemmatized tokens by setting lemmata
parameter to True
. The syntax of the parse
method along with the default values for different parameters is as follows:
parse(string,
tokenize=True, # Split punctuation marks from words?
tags=True, # Parse part-of-speech tags? (NN, JJ, ...)
chunks=True, # Parse chunks? (NP, VP, PNP, ...)
relations=False, # Parse chunk relations? (-SBJ, -OBJ, ...)
lemmata=False, # Parse lemmata? (ate => eat)
encoding='utf-8', # Input string encoding.
tagset=None # Penn Treebank II (default) or UNIVERSAL.
)
Let's see the parse
method in action:
from pattern.en import parse
from pattern.en import pprint
pprint(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True))
To use the parse
method, you have to import the en
module from the pattern
library. The en
module contains English language NLP functions. If you use the pprint
method to print the output of the parse
method on the console, you should see the following output:
WORD TAG CHUNK ROLE ID PNP LEMMA
I PRP NP SBJ 1 - i
drove VBD VP - 1 - drive
my PRP$ NP OBJ 1 - my
car NN NP ^ OBJ 1 - car
to TO - - - - to
the DT NP - - - the
hospital NN NP ^ - - - hospital
yesterday NN NP ^ - - - yesterday
In the output, you can see the tokenized words along with their POS tag, the chunk that the tokens belong to, and the role. You can also see the lemmatized form of the tokens.
If you call the split
method on the object returned by the parse
method, the output will be a list of sentences, where each sentence is a list of tokens and each token is a list of words, along with the tags associated with the words.
For instance look at the following script:
from pattern.en import parse
from pattern.en import pprint
print(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True).split())
The output of the script above looks like this:
[[['I', 'PRP', 'B-NP', 'O', 'NP-SBJ-1', 'i'], ['drove', 'VBD', 'B-VP', 'O', 'VP-1', 'drive'], ['my', 'PRP$', 'B-NP', 'O', 'NP-OBJ-1', 'my'], ['car', 'NN', 'I-NP', 'O', 'NP-OBJ-1', 'car'], ['to', 'TO', 'O', 'O', 'O', 'to'], ['the', 'DT', 'B-NP', 'O', 'O', 'the'], ['hospital', 'NN', 'I-NP', 'O', 'O', 'hospital'], ['yesterday', 'NN', 'I-NP', 'O', 'O', 'yesterday']]]
Pluralizing and Singularizing the Tokens
The pluralize
and singularize
methods are used to convert singular words to plurals and vice versa, respectively.
from pattern.en import pluralize, singularize
print(pluralize('leaf'))
print(singularize('theives'))
The output looks like this:
leaves
theif
Converting Adjective to Comparative and Superlative Degrees
You can retrieve comparative and superlative degrees of an adjective using comparative
and superlative
functions. For instance, the comparative degree of good is better and the superlative degree of good is best. Let's see this in action:
from pattern.en import comparative, superlative
print(comparative('good'))
print(superlative('good'))
Output:
better
best
Finding N-Grams
N-Grams refer to "n" combination of words in a sentence. For instance, for the sentence "He goes to hospital", 2-grams would be (He goes), (goes to) and (to hospital). N-Grams can play a crucial role in text classification and language modeling.
In the Pattern library, the ngram
method is used to find the all the n-grams in a text string. The first parameter to the ngram
method is the text string. The number of n-grams is passed to the n
parameter of the method. Look at the following example:
from pattern.en import ngrams
print(ngrams("He goes to hospital", n=2))
Output:
[('He', 'goes'), ('goes', 'to'), ('to', 'hospital')]
Finding Sentiments
Sentiment refers to an opinion or feeling towards a certain thing. The Pattern library offers functionality to find sentiment from a text string.
In Pattern, the sentiment
object is used to find the polarity (positivity or negativity) of a text along with its subjectivity.
Depending upon the most commonly occurring positive (good, best, excellent, etc.) and negative (bad, awful, pathetic, etc.) adjectives, a sentiment score between 1 and -1 is assigned to the text. This sentiment score is also called the polarity.
In addition to the sentiment score, subjectivity is also returned. The subjectivity value can be between 0 and 1. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.
from pattern.en import sentiment
print(sentiment("This is an excellent movie to watch. I really love it"))
When you run the above script, you should see the following output:
(0.75, 0.8)
The sentence "This is an excellent movie to watch. I really love it" has a sentiment of 0.75, which shows that it is highly positive. Similarly, the subjectivity of 0.8 refers to the fact that the sentence is a personal opinion of the user.
Checking if a Statement is a Fact
The modality
function from the Pattern library can be used to find the degree of certainty in the text string. The modality
function returns a value between -1 to 1. For facts, the modality
function returns a value greater than 0.5.
Here is an example of it in action:
from pattern.en import parse, Sentence
from pattern.en import modality
text = "Paris is the capital of France"
sent = parse(text, lemmata=True)
sent = Sentence(sent)
print(modality(sent))
1.0
In the script above we first import the parse
method along with the Sentence
class. On the second line, we import the modality
function. The parse
method takes text as input and returns a tokenized form of the text, which is then passed to the Sentence
class constructor. The modality
method takes the Sentence
class object and returns the modality of the sentence.
Since the text string "Paris is the capital of France" is a fact, in the output, you will see a value of 1.
Similarly, for a sentence which is not certain, the value returned by the modality
method is around 0.0. Look at the following script:
text = "I think we can complete this task"
sent = parse(text, lemmata=True)
sent = Sentence(sent)
print(modality(sent))
0.25
Since the string in the above example is not very certain, the modality of the above string will be 0.25.
Spelling Corrections
The suggest
method can be used to find if a word is spelled correctly or not. The suggest
method returns 1 if a word is 100% correctly spelled. Otherwise the suggest
method returns the possible corrections for the word along with their probability of correctness.
Look at the following example:
from pattern.en import suggest
print(suggest("Whitle"))
In the script above we have a word Whitle
which is incorrectly spelled. In the output, you will see possible suggestions for this word.
[('While', 0.6459209419680404), ('White', 0.2968881412952061), ('Title', 0.03280067283431455), ('Whistle', 0.023549201009251473), ('Chile', 0.0008410428931875525)]
According to the suggest
method, there is a 0.64 probability that the word is "While", similarly there is a probability of 0.29 that the word is "White", and so on.
Now let's spell a word correctly:
from pattern.en import suggest
print(suggest("Fracture"))
Output:
[('Fracture', 1.0)]
From the output, you can see that there is a 100% chance that the word is spelled correctly.
Working with Numbers
The Pattern library contains functions that can be used to convert numbers in the form of text strings into their numeric counterparts and vice versa. To convert from text to numeric representation the number
function is used. Similarly to convert back from numbers to their corresponding text representation the numerals
function is used. Look at the following script:
from pattern.en import number, numerals
print(number("one hundred and twenty two"))
print(numerals(256.390, round=2))
Output:
122
two hundred and fifty-six point thirty-nine
In the output, you will see 122 which is the numeric representation of text "one hundred and twenty-two". Similarly, you should see "two hundred and fifty-six point thirty-nine" which is text representation of the number 256.390.
Remember, for numerals
function we have to provide the integer value that we want our number to be rounded-off to.
The quantify
function is used to get a word count estimation of the items in the list, which provides a phrase for referring to the group. If a list has 3-8 similar items, the quantify
function will quantify it to "several". Two items are quantified to a "couple".
from pattern.en import quantify
print(quantify(['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'mango', 'mango']))
In the list, we have three apples, three bananas, and two mangoes. The output of the quantify
function for this list looks like this:
several bananas, several apples and a pair of mangoes
Similarly, the following example demonstrates the other word count estimations.
from pattern.en import quantify
print(quantify({'strawberry': 200, 'peach': 15}))
print(quantify('orange', amount=1200))
Output:
hundreds of strawberries and a number of peaches
thousands of oranges
Pattern Library Functions for Data Mining
In the previous section, we saw some of the most commonly used functions of the Pattern library for NLP. In this section, we will see how the Pattern library can be used to perform a variety of data mining tasks.
The web
module of the Pattern library is used for web mining tasks.
Accessing Web Pages
The URL
object is used to retrieve contents from the webpages. It has several methods that can be used to open a webpage, download the contents from a webpage and read a webpage.
You can directly use the download
method to download the HTML contents of any webpage. The following script downloads the HTML source code for the Wikipedia article on artificial intelligence.
from pattern.web import download
page_html = download('https://en.wikipedia.org/wiki/Artificial_intelligence', unicode=True)
You can also download files from webpages, for example, images using the URL method:
from pattern.web import URL, extension
page_url = URL('https://upload.wikimedia.org/wikipedia/commons/f/f1/RougeOr_football.jpg')
file = open('football' + extension(page_url.page), 'wb')
file.write(page_url.download())
file.close()
In the script above we first make a connection with the webpage using the URL
method. Next, we call the extension
method on the opened page, which returns the file extension. The file extension is appended at the end of the string "football". The open method is called to read this path and finally, the download()
method downloads the image and writes it to the default execution path.
Finding URLs within Text
You can use the findurl
method to extract URLs from text strings. Here is an example:
from pattern.web import find_urls
print(find_urls('To search anything, go to www.google.com', unique=True))
In the output, you will see the URL for the Google website as shown below:
['www.google.com']
Making Asynchronous Requests for Webpages
Webpages can be very large and it can take quite a bit of time download the complete contents of the webpage, which can block a user from performing any other task on the application until the complete webpage is downloaded. However, the web
module of the Pattern library contains a function asynchronous
, which downloads contents of a webpage in a parallel manner. The asynchronous
method runs in the background so that the user can interact with the application while the webpage is being downloaded.
Let's take a very simple example of the asynchronous
method:
from pattern.web import asynchronous, time, Google
asyn_req = asynchronous(Google().search, 'artificial intelligence', timeout=4)
while not asyn_req.done:
time.sleep(0.1)
print('searching...')
print(asyn_req.value)
print(find_urls(asyn_req.value, unique=True))
In the above script, we retrieve the Google search result of page 1 for the search query "artificial intelligence", you can see that while the page downloads we execute a while loop in parallel. Finally, the results retrieved by the query are printed using the value
attribute of the object returned by the asynchronous
module. Next, we extract the URLs from the search, which are then printed on the screen.
Getting Search Engine Results with APIs
The pattern library contains SearchEngine
class which is derived by the classes that can be used to connect to call API's of different search engines and websites such as Google, Bing, Facebook, Wikipedia, Twitter, etc. The SearchEngine
object construct accepts three parameters:
license
: The developer license key for the corresponding search engine or websitethrottle
: Corresponds to the time difference between successive request to the serverlangauge
: Specifies the language for the results
The search
method of the SearchEngine
class is used to make a request to search engine for certain search query. The search
method can take the following parameters:
query
: The search stringtype:
The type of data you want to search, it can take three values:SEARCH
,NEWS
andIMAGE
.start
: The page from which you want to start the searchcount
: The number of results per page.
The search engine classes that inherit the SearchEngine
class along with its search
method are: Google
, Bing
, Twitter
, Facebook
, Wikipedia
, and Flickr
.
The search query returns objects for each item. The result
object can then be used to retrieve the information about the searched result. The attributes of the result
object are url
, title
, text
, language
, author
, date
.
Now let's see a very simple example of how we can search something on Google via pattern library. Remember, to make this example work, you will have to use your developer license key for the Google API.
from pattern.web import Google
google = Google(license=None)
for search_result in google.search('artificial intelligence'):
print(search_result.url)
print(search_result.text)
In the script above, we create an object of Google class. In the constructor of Google, pass your own license key to the license
parameter. Next, we pass the string artificial intelligence
to the search
method. By default, the first 10 results from the first page will be returned which are then iterated, and the url and text of each result is displayed on the screen.
The process is similar for Bing search engine, you only have to replace the Bing
class with Google
in the script above.
Let's now search Twitter for the three latest tweets that contain the text "artificial intelligence". Execute the following script:
from pattern.web import Twitter
twitter = Twitter()
index = None
for j in range(3):
for tweet in twitter.search('artificial intelligence', start=index, count=3):
print(tweet.text)
index = tweet.id
In the script above we first import the Twitter
class from the pattern.web
module. Next, We iterate over the tweets returned by the Twitter
class and display the text of the tweet on the console. You do not need any license key to run the above script.
Converting HTML Data to Plain Text
The download
method of the URL
class returns data in the form of HTML. However, if you want to do a semantic analysis of the text, for instance, sentiment classification, you need data cleaned data without HTML tags. You can clean the data with the plaintext
method. The method takes as a parameter, the HTML content returned by the download
method, and returns cleaned text.
Look at the following script:
from pattern.web import URL, plaintext
html_content = URL('https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/').download()
cleaned_page = plaintext(html_content.decode('utf-8'))
print(cleaned_page)
In the output, you should see the cleaned text from the webpage:
https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/.
It is important to remember that if you are using Python 3, you will need to call decode('utf-8')
method to convert the data from byte to string format.
Parsing PDF Documments
The Pattern library contains PDF object that can be used to parse a PDF document. PDF (Portable Document Format) is a cross platform file which contains images, texts, and fonts in a stand-alone document.
Let's see how a PDF document can be parsed with the PDF object:
from pattern.web import URL, PDF
pdf_doc = URL('http://demo.clab.cs.cmu.edu/NLP/syllabus_f18.pdf').download()
print(PDF(pdf_doc.decode('utf-8')))
In the script we download a document using the download
function. Next, the downloaded HTML document is passed to the PDF class which finally prints it on the console.
Clearing the Cache
The results returned by the methods such as SearchEngine.search()
and URL.download()
are, by default, stored in the local cache. To clear the cache after downloading an HTML document, we can use clear
method of the cache class, as shown below:
from pattern.web import cache
cache.clear()
Conclusion
The Pattern library is one of the most useful natural language processing libraries in Python. Although it is not as well-known as spaCy or NLTK, it contains functionalities such as finding superlatives and comparatives, and fact and opinion detection which distinguishes it from the other NLP libraries.
In this article, we studied the application of the Pattern library for natural language processing, and data mining and web scraping. We saw how to perform basic NLP tasks such as tokenization, lemmatization and sentiment analysis with the Pattern library. Finally, we also saw how to use Pattern for making search engine queries, mining online tweets and cleaning HTML documents.
from Planet Python
via read more
No comments:
Post a Comment