Sentiment analysis is a powerful tool that allows computers to understand the underlying subjective tone of a piece of writing. This is something that humans have difficulty with, and as you might imagine, it isn’t always so easy for computers, either. But with the right tools and Python, you can use sentiment analysis to better understand the sentiment of a piece of writing.
Why would you want to do that? There are a lot of uses for sentiment analysis, such as understanding how stock traders feel about a particular company by using social media data or aggregating reviews, which you’ll get to do by the end of this tutorial.
In this tutorial, you’ll learn:
- How to use natural language processing (NLP) techniques
- How to use machine learning to determine the sentiment of text
- How to use spaCy to build an NLP pipeline that feeds into a sentiment analysis classifier
This tutorial is ideal for beginning machine learning practitioners who want a project-focused guide to building sentiment analysis pipelines with spaCy.
You should be familiar with basic machine learning techniques like binary classification as well as the concepts behind them, such as training loops, data batches, and weights and biases. If you’re unfamiliar with machine learning, then you can kickstart your journey by learning about logistic regression.
When you’re ready, you can follow along with the examples in this tutorial by downloading the source code from the link below:
Get the Source Code: Click here to get the source code you’ll use to learn about sentiment analysis with natural language processing in this tutorial.
Using Natural Language Processing to Preprocess and Clean Text Data
Any sentiment analysis workflow begins with loading data. But what do you do once the data’s been loaded? You need to process it through a natural language processing pipeline before you can do anything interesting with it.
The necessary steps include (but aren’t limited to) the following:
- Tokenizing sentences to break text down into sentences, words, or other units
- Removing stop words like “if,” “but,” “or,” and so on
- Normalizing words by condensing all forms of a word into a single form
- Vectorizing text by turning the text into a numerical representation for consumption by your classifier
All these steps serve to reduce the noise inherent in any human-readable text and improve the accuracy of your classifier’s results. There are lots of great tools to help with this, such as the Natural Language Toolkit, TextBlob, and spaCy. For this tutorial, you’ll use spaCy.
Note: spaCy is a very powerful tool with many features. For a deep dive into many of these features, check out Natural Language Processing With spaCy.
Before you go further, make sure you have spaCy and its English model installed:
$ pip install spacy
$ python -m spacy download en_core_web_sm
The first command installs spaCy, and the second uses spaCy to download its English language model. spaCy supports a number of different languages, which are listed on the spaCy website.
Next, you’ll learn how to use spaCy to help with the preprocessing steps you learned about earlier, starting with tokenization.
Tokenizing
Tokenization is the process of breaking down chunks of text into smaller pieces. spaCy comes with a default processing pipeline that begins with tokenization, making this process a snap. In spaCy, you can do either sentence tokenization or word tokenization:
- Word tokenization breaks text down into individual words.
- Sentence tokenization breaks text down into individual sentences.
In this tutorial, you’ll use word tokenization to separate the text into individual words. First, you’ll load the text into spaCy, which does the work of tokenization for you:
>>> import spacy
>>> text = """
Dave watched as the forest burned up on the hill,
only a few miles from his house. The car had
been hastily packed and Marta was inside trying to round
up the last of the pets. "Where could she be?" he wondered
as he continued to wait for Marta to appear with the pets.
"""
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp(text)
>>> token_list = [token for token in doc]
>>> token_list
[
, Dave, watched, as, the, forest, burned, up, on, the, hill, ,,
, only, a, few, miles, from, his, house, ., The, car, had,
, been, hastily, packed, and, Marta, was, inside, trying, to, round,
, up, the, last, of, the, pets, ., ", Where, could, she, be, ?, ", he, wondered,
, as, he, continued, to, wait, for, Marta, to, appear, with, the, pets, .,
]
In this code, you set up some example text to tokenize, load spaCy’s English model, and then tokenize the text by passing it into the nlp
constructor. This model includes a default processing pipeline that you can customize, as you’ll see later in the project section.
After that, you generate a list of tokens and print it. As you may have noticed, “word tokenization” is a slightly misleading term, as captured tokens include punctuation and other nonword strings.
Read the full article at https://realpython.com/sentiment-analysis-python/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
from Planet Python
via read more
No comments:
Post a Comment