Introduction
A commonly requested skill for software development positions is experience with NoSQL databases, including MongoDB. This tutorial will explore collecting data using an API, storing it in a MongoDB database, and doing some analysis of the data.
However, before jumping into the code let's take a moment to go over MongoDB and APIs, to make sure we understand how we'll be dealing with the data we're collecting.
MongoDB and NoSQL
MongoDB is a form of NoSQL database, enabling the storage of data in non-relational forms. NoSQL databases are best understood by comparing them to their progenitor/rivals - SQL databases.
SQL stands for Structure Query Language and it is a type of relational database management tool. A relational database is a database that stores data as a series of keys and values, with each row in a data table having its own unique key. Values in the database can be retrieved by looking up the corresponding key. This is how SQL databases store data, but NoSQL databases can store data in non-relational ways.
NoSQL stands for "Not Only SQL", which refers to the fact that although SQL-esque queries can be done with NoSQL systems, they can also do things SQL databases struggle with. NoSQL databases have a wider range of storage options for the data they handle, and because the data is less rigidly related it can be retrieved in more ways, making some operations quicker. NoSQL databases can make the addition of nodes or fields simpler in comparison to SQL databases.
There are many popular NoSQL frameworks, including MongoDB, OrientDB, InfinityDB, Aerospike, and CosmosDB. MongoDB is one specific NoSQL framework which stores data in the form of documents, acting as a document-oriented database.
MongoDB is popular because of its versatility and easy cloud integration, and able to be used for a wide variety of tasks. MongoDB stores data using the JSON format. Queries of MongoDB databases are also made in the JSON format, and because both the storage and retrieval commands are based on the JSON format, it is simple to remember and compose commands for MongoDB.
What are APIs?
APIs are Application Programming Interfaces, and their function is to make communications between clients and servers easier. APIs are often created to facilitate the collection of information by those who are less experienced with the language used by the application's developers.
APIs can also be helpful methods of controlling the flow of information from a server, encouraging those interested in accessing its information to use official channels to do so, rather than construct a web scraper. The most common APIs for websites are REST (Representational State Transfer) APIs, which make use of standard HTTP requests and responses to send, receive, delete, and modify data. We'll be accessing a REST API and making our requests in HTTP format for this tutorial.
What API will we be Using?
The API we'll be using is GameSpot's API. GameSpot is one of the biggest video game review sites on the web, and its API can be reached here.
Getting Set Up
Before we begin, you should be sure to get yourself an API key for GameSpot. You should also be sure to have MongoDB and its Python library installed. The installation instructions for Mongo can be found here.
The PyMongo library can be installed simply by running:
$ pip install pymongo
You may also wish to install the MongoDB Compass program, which lets you easily visualize and edit aspect of MongoDB databases with a GUI.
Creating the MongoDB Database
We can now start our project by creating the MongoDB database. First, we'll handle our imports. We'll import the MongoClient
from PyMongo, as well as requests
and pandas
:
from pymongo import MongoClient
import requests
import pandas as pd
When creating a database with MongoDB, we first need to connect to the client and then use the client to create the database we desire:
client = MongoClient('127.0.0.1', 27017)
db_name = 'gamespot_reviews'
# connect to the database
db = client[db_name]
MongoDB can store multiple data collections within a single database, so we also need to define the name of the collection we want to use:
# open the specific collection
reviews = db.reviews
That's it. Our database and collection has been created and we are ready to start inserting data into it. That was pretty simple, wasn't it?
Using the API
We're now ready to make use of the GameSpot API to collect data. By taking a look at the documentation for the API here, we can determine the format that our requests need to be in.
We need to make our requests to a base URL that contains our API key. GameSpot's API has multiple resources of its own that we can pull data from. For instance, they have a resource that lists data about games like release date and consoles.
However, we're interested in their resource for game reviews, and we'll be pulling a few specific fields from the API resource. Also, GameSpot asks that you specify a unique user agent identifier when making requests, which we'll do by creating a header that we'll pass in to the requests
function:
headers = {
"user_agent": "[YOUR IDENTIFIER] API Access"
}
games_base = "http://www.gamespot.com/api/reviews/?api_key=[YOUR API KEY HERE]&format=json"
We'll want the following data fields: id
, title
, score
, deck
, body
, good
, bad
:
review_fields = "id,title,score,deck,body,good,bad"
GameSpot only allows the return of 100 results at a time. For this reason, in order to get a decent number of reviews to analyze, we'll need to create a range of numbers and loop through them, retrieving 100 results at a time.
You can select any number you want. I chose to get all their reviews, which cap out at 14,900:
pages = list(range(0, 14900))
pages_list = pages[0:14900:100]
We're going to create a function that joins together the base URL, the list of fields we want to return, a sorting scheme (ascending or descending), and the offset for the query.
We'll take the number of pages we want to loop through, and then for every 100 entries we'll create a new URL and request the data:
def get_games(url_base, num_pages, fields, collection):
field_list = "&field_list=" + fields + "&sort=score:desc" + "&offset="
for page in num_pages:
url = url_base + field_list + str(page)
print(url)
response = requests.get(url, headers=headers).json()
print(response)
video_games = response['results']
for i in video_games:
collection.insert_one(i)
print("Data Inserted")
Recall that MongoDB stores data as JSON. For that reason we need to convert our response data to JSON format using the json()
method.
After the data has been converted into JSON, we'll get the "results" property from the response, as this is the portion which actually contains our data of interest. We'll then go through the 100 different results and insert each of them into our collection using the insert_one()
command from PyMongo. You could also put them all in a list and use insert_many()
instead.
Let's now call the function and have it collect the data:
get_games(review_base, pages_list, review_fields, reviews)
Why don't we check to see that our data has been inserted into our database as we expect it? We can view the database and its contents directly with the Compass program:
We can see the data has been properly inserted.
We can also make some database retrievals and print them. To do that, we'll just create an empty list to store our entries and use the .find()
command on the "reviews" collection.
When using the find
function from PyMongo, the retrieval needs to be formatted as JSON as well. The parameters given to the find
function will have a field and value.
By default, MongoDB always returns the _id
field (its own unique ID field, not the ID we pulled from GameSpot), but we can tell it to suppress this by specifying a 0
value. The fields we do want to return, like the score
field in this case, should be given a 1
value:
scores = []
for score in list(reviews.find({}, {"_id":0, "score": 1})):
scores.append(score)
print(scores[:900])
Here's what was successfully pulled and printed:
[{'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'} ...
We can also convert the query results to a data-frame easily by using Pandas:
scores_data = pd.DataFrame(scores, index=None)
print(scores_data.head(20))
Here's what was returned:
score
0 10.0
1 10.0
2 10.0
3 10.0
4 10.0
5 10.0
6 10.0
7 10.0
8 10.0
9 10.0
10 10.0
11 10.0
12 10.0
13 10.0
14 10.0
15 10.0
16 10.0
17 9.9
18 9.9
19 9.9
Before we start analyzing some of the data, let's take a moment to see how we could potentially join two collections together. As mentioned, GameSpot has multiple resources to pull data from, and we may want to get values from a second database like the Games database.
MongoDB is a NoSQL database, so unlike SQL it isn't intended to handle relations between databases and join data fields together. However, there is a function which can approximate a database join - lookup()
.
The lookup()
function mimics a database join and it can be done by specifying a pipeline, which contains the database you want to join elements from, as well as the fields you want from both the input documents (localField
) and the "from" documents (foreignField
).
Finally, you choose a moniker to convert the foreign documents to and they will be displayed under this new name in our query response table. If you had a second database called games
and wanted to join them together in a query, it could be done like this:
pipeline = [{
'$lookup': {
'from': 'reviews',
'localField': 'id',
'foreignField': 'score',
'as': 'score'
}
},]
for doc in (games.aggregate(pipeline)):
print(doc)
Analyzing the Data
Now we can get around to analyzing and visualizing some of the data found within our newly created database. Let's make sure we have all the functions we'll need for analysis.
from pymongo import MongoClient
import pymongo
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import string
import en_core_web_sm
import seaborn as sns
Let's say we want to do some analysis of the words found in GameSpot's game reviews. We have that information in our database, we just have to get it.
We can start by collecting the top 40 (or whatever number you want) reviews from our database using the find()
function like before, but this time we'll specify that we want to sort by the score
variable and that we to sort in descending order:
d_name = 'gamespot_reviews'
collection_name = 'gamespot'
client = MongoClient('127.0.0.1', 27017)
db = client[d_name]
reviews = db.reviews
review_bodies = []
for body in list(reviews.find({}, {"_id":0, "body": 1}).sort("score", pymongo.DESCENDING).limit(40)):
review_bodies.append(body)
We'll turn that response into a Pandas data-frame and convert it into a string. Then we'll extract all the values within the <p>
HTML tag containing the review text, which we'll do with BeautifulSoup:
reviews_data = pd.DataFrame(review_bodies, index=None)
def extract_comments(input):
soup = BeautifulSoup(str(input), "html.parser")
comments = soup.find_all('p')
return comments
review_entries = extract_comments(str(review_bodies))
print(review_entries[:500])
See the print
statement to see the review text has been collected:
[<p>For anyone who hasn't actually seen the game on a TV right in front of them, the screenshots look too good to be true. In fact, when you see NFL 2K for the first time right in front of you...]
Now that we have the review text data, we want to analyze it in several different ways. Let's try getting some intuition for what kinds of words are commonly used in the top 40 reviews. We can do this several different ways:
- We can create a word cloud
- We can count all of the words and sort by their number of occurrences
- We can do named entity recognition
Before we can do any analysis of the data though, we have to preprocess it.
To preprocess the data, we want to create a function to filter the entries. The text data is still full of all kinds of tags and non-standard characters, and we want to remove those by getting the raw text of the review comments. We'll be using regular expressions to substitute the non-standard characters with blank spaces.
We'll also use some stop words from NTLK (highly common words that add little meaning to our text) and remove them from our text by creating a list to hold all the words and then appending words to that list only if they are not in our list of stop words.
Word Cloud
Let's grab a subset of the review words to visualize as a corpus. If it's too large when generating it can cause some problems with the word cloud.
For example, I've filtered out the first 5000 words:
stop_words = set(stopwords.words('english'))
def filter_entries(entries, stopwords):
text_entries = BeautifulSoup(str(entries), "lxml").text
subbed_entries = re.sub('[^A-Za-z0-9]+', ' ', text_entries)
split_entries = subbed_entries.split()
stop_words = stopwords
entries_words = []
for word in split_entries:
if word not in stop_words:
entries_words.append(word)
return entries_words
review_words = filter_entries(review_entries, stop_words)
review_words = review_words[5000:]
We can now make a word cloud very easily by using a pre-made WordCloud library found here.
This word cloud does give us some information on what kinds of words are commonly used in the top reviews:
It is unfortunately still full of common words, which is why it would be a good idea to do filtering of the review words with a tf-idf filtering scheme, but for the purposes of this simple demonstration, this is good enough.
We do, in fact, have some information about what kinds of concepts are talked about in game reviews: gameplay, story, characters, world, action, locations, etc.
We can confirm for ourselves that these words are commonly found in game reviews by looking at one of the top 40 reviews we selected: Mike Mahardy's review of Uncharted 4:
Sure enough, the review discusses action, gameplay, characters, and story.
The size of the words gives us intuition about how commonly words appear in these reviews, but we can also just count how often certain words show up.
Counter
We can get a list of the most common words by splitting the words up and adding them to a dictionary of words along with their individual count, which will be incremented every time the same word is seen.
We then just need to use Counter
and the most_common()
function:
def get_word_counts(words_list):
word_count = {}
for word in words_list:
word = word.translate(translator).lower()
if word not in stop_words:
if word not in word_count:
word_count[word] = 1
else:
word_count[word] += 1
return word_count
review_word_count = get_word_counts(review_words)
review_word_count = Counter(review_word_count)
review_list = review_word_count.most_common()
print(review_list)
Here's the counts of some of the most common words:
[('game', 1231), ('one', 405), ('also', 308), ('time', 293), ('games', 289), ('like', 285), ('get', 278), ('even', 271), ('well', 224), ('much', 212), ('new', 200), ('play', 199), ('level', 195), ('different', 195), ('players', 193) ...]
Named Entity Recognition
We can also do named entity recognition using en_core_web_sm
, a language model included with spaCy. The various concepts and linguistic features that can detect are listed here.
We need to grab the list of detected named entities and concepts from the document (list of words):
doc = nlp(str(review_words))
labels = [x.label_ for x in doc.ents]
items = [x.text for x in doc.ents]
We can print out the found entities as well as a count of the entities.
# Example of named entities and their categories
print([(X.text, X.label_) for X in doc.ents])
# All categories and their counts
print(Counter(labels))
# Most common named entities
print(Counter(items).most_common(20))
Here's what is printed:
[('Nintendo', 'ORG'), ('NES', 'ORG'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('15', 'CARDINAL'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('Super', 'WORK_OF_ART') ...]
Counter({'PERSON': 1227, 'CARDINAL': 496, 'ORG': 478, 'WORK_OF_ART': 204, 'ORDINAL': 200, 'NORP': 110, 'PRODUCT': 88, 'GPE': 63, 'TIME': 12, 'DATE': 12, 'LOC': 12, 'QUANTITY': 4 ...]
[('first', 147), ('two', 110), ('Metal', 85), ('Solid', 82), ('GTAIII', 78), ('Warcraft', 72), ('2', 59), ('Mario', 56), ('four', 54), ('three', 42), ('NBA', 41) ...]
Let's say we wanted to plot the most common recognized terms for different categories, like persons and organizations. We just need to make a function to get the counts of the different classes of entities and then use it to get the entities we desire.
We'll get a list of named entities/people, organizations, and GPEs (locations):
def word_counter(doc, ent_name, col_name):
ent_list = []
for ent in doc.ents:
if ent.label_ == ent_name:
ent_list.append(ent.text)
df = pd.DataFrame(data=ent_list, columns=[col_name])
return df
review_persons = word_counter(doc, 'PERSON', 'Named Entities')
review_org = word_counter(doc, 'ORG', 'Organizations')
review_gpe = word_counter(doc, 'GPE', 'GPEs')
Now all we have to do is plot the counts with a function:
def plot_categories(column, df, num):
sns.countplot(x=column, data=df,
order=df[column].value_counts().iloc[0:num].index)
plt.xticks(rotation=-45)
plt.show()
plot_categories("Named Entities", review_persons, 30)
plot_categories("Organizations", review_org, 30)
plot_categories("GPEs", review_gpe, 30)
Let's take a look at the plots that were generated.
As would be expected of named entities, most of the results returned are names of video game characters. This isn't perfect, as it does misclassify some terms like "Xbox" as being a named entity rather than an organization, but this still gives us some idea of what characters are discussed in the top reviews.
The organization plot shows some proper game developers and publishers like Playstation and Nintendo, but it also tags things like "480p" as being an organization.
Above is the plot for GPEs, or geographical locations. It looks like "Hollywood" and "Miami" pop up often in reviews of games. (Settings for games? Or maybe the reviewer is describing something in-game as Hollywood-style?)
As you can see, carrying out named entity recognition and concept recognition isn't perfect, but it can give you some intuition about what kinds of topics are discussed in a body of text.
Plotting Numerical Values
Finally, we can try plotting numerical values from the database. Let's get the score values from the reviews collection, count them up, and then plot them:
scores = []
for score in list(reviews.find({}, {"_id":0, "score": 1})):
scores.append(score)
scores = pd.DataFrame(scores, index=None).reset_index()
counts = scores['score'].value_counts()
sns.countplot(x="score", data=scores)
plt.xticks(rotation=-90)
plt.show()
Above is the graph for the total number of review scores given, running from 0 to 9.9. It looks like the most commonly given scores were 7 and 8, which makes sense intuitively. Seven is often considered average on a ten point review scale.
Conclusion
Collecting, storing, retrieving, and analyzing data are skills that are highly in-demand in today's world, and MongoDB is one of the most commonly used NoSQL database platforms.
Knowing how to use NoSQL databases and how to interpret the data in them will equip you to carry out many common data analysis tasks.
from Planet Python
via read more
No comments:
Post a Comment