Friday, May 3, 2019

Stack Abuse: Python for NLP: Getting Started with the StanfordCoreNLP Library

This is the ninth article in my series of articles on Python for NLP. In the previous article, we saw how Python's Pattern library can be used to perform a variety of NLP tasks ranging from tokenization to POS tagging, and text classification to sentiment analysis. Before that we explored the TextBlob library for performing similar natural language processing tasks.

In this article, we will explore StanfordCoreNLP library which is another extremely handy library for natural language processing. We will see different features of StanfordCoreNLP with the help of examples. So before wasting any further time, let's get started.

Setting up the Environment

The installation process for StanfordCoreNLP is not as straight forward as the other Python libraries. As a matter of fact, StanfordCoreNLP is a library that's actually written in Java. Therefore make sure you have Java installed on your system. You can download the latest version of Java freely.

Once you have Java installed, you need to download the JAR files for the StanfordCoreNLP libraries. The JAR file contains models that are used to perform different NLP tasks. To download the JAR files for the English models, download and unzip the folder located at the official StanfordCoreNLP website.

Next thing you have to do is run the server that will serve the requests sent by the Python wrapper to the StanfordCoreNLP library. Navigate to the path where you unzipped the JAR files folder. Navigate inside the folder and execute the following command on the command prompt:

$ java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000

The above command initiates the StanfordCoreNLP server. The parameter -mx6g specifies that the memory used by the server should not exceed 6 gigabytes. It is important to mention that you should be running 64-bit system in order to have a heap as big as 6GB. If you are running a 32-bit system, you might have to reduce the memory size dedicated to the server.

Once you run the above command, you should see the following output:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP -     Threads: 8
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

The server is running at port 9000.

Now the final step is to install the Python wrapper for the StanfordCoreNLP library. The wrapper we will be using is pycorenlp. The following script downloads the wrapper library:

$ pip install pycorenlp

Now we are all set to connect to the StanfordCoreNLP server and perform the desired NLP tasks.

To connect to the server, we have to pass the address of the StanfordCoreNLP server that we initialized earlier to the StanfordCoreNLP class of the pycorenlp module. The object returned can then be used to perform NLP tasks. Look at the following script:

from pycorenlp import StanfordCoreNLP

nlp_wrapper = StanfordCoreNLP('http://localhost:9000')  

Performing NLP Tasks

In this section, we will briefly explore the use of StanfordCoreNLP library for performing common NLP tasks.

Lemmatization, POS Tagging and Named Entity Recognition

Lemmatization, parts of speech tagging, and named entity recognition are the most basic NLP tasks. The StanfordCoreNLP library supports pipeline functionality that can be used to perform these tasks in a structured way.

In the following script, we will create an annotator which first splits a document into sentences and then further splits the sentences into words or tokens. The words are then annotated with the POS and named entity recognition tags.

doc = "Ronaldo has moved from Real Madrid to Juventus. While messi still plays for Barcelona"  
annot_doc = nlp_wrapper.annotate(doc,  
    properties={
        'annotators': 'ner, pos',
        'outputFormat': 'json',
        'timeout': 1000,
    })

In the script above we have a document with two sentences. We use the annotate method of the StanfordCoreNLP wrapper object that we initialized earlier. The method takes three parameters. The annotator parameter takes the type of annotation we want to perform on the text. We pass 'ner, pos' as the value for the annotator parameter which specifies that we want to annotate our document for POS tags and named entities.

The outputFormat variable defines the format in which you want the annotated text. The possible values are json for JSON objects, xml for XML format, text for plain text, and serialize for serialized data.

The final parameter is the timeout in milliseconds which defines the time that the wrapper should wait for the response from the server before timing out.

In the output, you should see a JSON object as follows:

{'sentences': [{'index': 0, 'entitymentions': [{'docTokenBegin': 0, 'docTokenEnd': 1, 'tokenBegin': 0, 'tokenEnd': 1, 'text': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'ner': 'PERSON'}, {'docTokenBegin': 4, 'docTokenEnd': 6, 'tokenBegin': 4, 'tokenEnd': 6, 'text': 'Real Madrid', 'characterOffsetBegin': 23, 'characterOffsetEnd': 34, 'ner': 'ORGANIZATION'}, {'docTokenBegin': 7, 'docTokenEnd': 8, 'tokenBegin': 7, 'tokenEnd': 8, 'text': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'Ronaldo', 'originalText': 'Ronaldo', 'lemma': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'pos': 'NNP', 'ner': 'PERSON', 'before': '', 'after': ' '}, {'index': 2, 'word': 'has', 'originalText': 'has', 'lemma': 'have', 'characterOffsetBegin': 8, 'characterOffsetEnd': 11, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'moved', 'originalText': 'moved', 'lemma': 'move', 'characterOffsetBegin': 12, 'characterOffsetEnd': 17, 'pos': 'VBN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'from', 'originalText': 'from', 'lemma': 'from', 'characterOffsetBegin': 18, 'characterOffsetEnd': 22, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'Real', 'originalText': 'Real', 'lemma': 'real', 'characterOffsetBegin': 23, 'characterOffsetEnd': 27, 'pos': 'JJ', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Madrid', 'originalText': 'Madrid', 'lemma': 'Madrid', 'characterOffsetBegin': 28, 'characterOffsetEnd': 34, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 7, 'word': 'to', 'originalText': 'to', 'lemma': 'to', 'characterOffsetBegin': 35, 'characterOffsetEnd': 37, 'pos': 'TO', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 8, 'word': 'Juventus', 'originalText': 'Juventus', 'lemma': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}, {'index': 9, 'word': '.', 'originalText': '.', 'lemma': '.', 'characterOffsetBegin': 46, 'characterOffsetEnd': 47, 'pos': '.', 'ner': 'O', 'before': '', 'after': ' '}]}, {'index': 1, 'entitymentions': [{'docTokenBegin': 14, 'docTokenEnd': 15, 'tokenBegin': 5, 'tokenEnd': 6, 'text': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'While', 'originalText': 'While', 'lemma': 'while', 'characterOffsetBegin': 48, 'characterOffsetEnd': 53, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 2, 'word': 'messi', 'originalText': 'messi', 'lemma': 'messus', 'characterOffsetBegin': 54, 'characterOffsetEnd': 59, 'pos': 'NNS', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'still', 'originalText': 'still', 'lemma': 'still', 'characterOffsetBegin': 60, 'characterOffsetEnd': 65, 'pos': 'RB', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'plays', 'originalText': 'plays', 'lemma': 'play', 'characterOffsetBegin': 66, 'characterOffsetEnd': 71, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'for', 'originalText': 'for', 'lemma': 'for', 'characterOffsetBegin': 72, 'characterOffsetEnd': 75, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Barcelona', 'originalText': 'Barcelona', 'lemma': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}]}]}

If you look at the above script carefully, you can find the POS tags, named entities and lemmatized version of each word.

Lemmatization

Let's now explore the annotated results. We'll first print the lemmatizations for the words in the two sentences in our dataset:

for sentence in annot_doc["sentences"]:  
    for word in sentence["tokens"]:
        print(word["word"] + " => " + word["lemma"])

In the script above, the outer loop iterates through each sentence in the document and the inner loop iterates through each word in the sentence. Inside the inner loop, the word and it's corresponding lemmatized form are printed on the console. The output looks like this:

Ronaldo=>Ronaldo  
has=>have  
moved=>move  
from=>from  
Real=>real  
Madrid=>Madrid  
to=>to  
Juventus=>Juventus  
.=>.
While=>while  
messi=>messus  
still=>still  
plays=>play  
for=>for  
Barcelona=>Barcelona  

For example, you can see the word moved has been lemmatized to move, similarly the word plays has been lemmatized to play.

POS Tagging

In the same way, we can find the POS tags for each word. Look at the following script:

for sentence in annot_doc["sentences"]:  
    for word in sentence["tokens"]:
        print (word["word"] + "=>" + word["pos"])

In the output, you should see the following results:

Ronaldo=>NNP  
has=>VBZ  
moved=>VBN  
from=>IN  
Real=>JJ  
Madrid=>NNP  
to=>TO  
Juventus=>NNP  
.=>.
While=>IN  
messi=>NNS  
still=>RB  
plays=>VBZ  
for=>IN  
Barcelona=>NNP  

The tag set used for POS tags is the Penn Treebank tagset and can be found here.

Named Entity Recognition

To find named entities in our document, we can use the following script:

for sentence in annot_doc["sentences"]:  
    for word in sentence["tokens"]:
        print (word["word"] + "=>" + word["ner"])

The output looks like this:

Ronaldo=>PERSON  
has=>O  
moved=>O  
from=>O  
Real=>ORGANIZATION  
Madrid=>ORGANIZATION  
to=>O  
Juventus=>ORGANIZATION  
.=>O
While=>O  
messi=>O  
still=>O  
plays=>O  
for=>O  
Barcelona=>ORGANIZATION  

We can see that Ronaldo has been identified as a PERSON while Barcelona has been identified as Organization, which in this case is correct.

Sentiment Analysis

To find the sentiment of a sentence, all you have to is pass sentiment as the value for the annotators property. Look at the following script:

doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. Its a very tasty chocolate. This is so bad"  
annot_doc = nlp_wrapper.annotate(doc,  
    properties={
       'annotators': 'sentiment',
       'outputFormat': 'json',
       'timeout': 1000,
    })

To find the sentiment, we can iterate over each sentence and then use sentimentValue property to find the sentiment. The sentimentValue returns a value between 1 and 4 where 1 corresponds to highly negative sentiment while 4 corresponds to highly positive sentiment. The sentiment property can be used to get sentiment in verbal form i.e positive, negative or neutral.

The following script finds the sentiment for each sentence in the document we defined above.

for sentence in annot_doc["sentences"]:  
    print ( " ".join([word["word"] for word in sentence["tokens"]]) + " => " \
        + str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"])

Output:

I like this chocolate . => 2 = Neutral  
This chocolate is not good . => 1 = Negative  
The chocolate is delicious . => 3 = Positive  
Its a very tasty chocolate . => 3 = Positive  
This is so bad => 1 = Negative  

Conclusion

StanfordCoreNLP is another extremely handy library for natural language processing. In this article, we studied how to set up the environment to run StanfordCoreNLP. We then explored the use of StanfordCoreNLP library for common NLP tasks such as lemmatization, POS tagging and named entity recognition and finally, we rounded off the article with sentimental analysis using StanfordCoreNLP.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...