Saturday, November 28, 2020

Stack Abuse: Seaborn Distribution/Histogram Plot - Tutorial and Examples

Introduction

Seaborn is one of the most widely used data visualization libraries in Python, as an extension to Matplotlib. It offers a simple, intuitive, yet highly customizable API for data visualization.

In this tutorial, we'll take a look at how to plot a histogram plot in Seaborn. We'll cover how to plot a histogram with Seaborn, how to change Histogram bin sizes, as well as plot Kernel Density Estimation plots on top of Histograms and show distribution data instead of count data.

Import Data

We'll be using the Netflix Shows dataset and visualizing the distributions from there.

Let's import Pandas and load in the dataset:

import pandas as pd

df = pd.read_csv('netflix_titles.csv')

How to Plot a Histogram with Seaborn?

Well, Seaborn doesn't have a regular histplot() function anymore. Specifically, Seaborn has different types of distribution plots that you might want to use.

These plot types are: Distribution Plots (displot()) and Count Plots (countplot()). By default, the displot() is the closest you'd get to Matplotlib's hist() function, as the default approach of a displot() is to plot a histogram.

Note: Since Seaborn 0.11, distplot() became displot(). If you're using an older version, you'll have to use the older function as well.

Let's start plotting.

Plot Histogram/Distribution Plot (displot) with Seaborn

Let's go ahead and import the required modules and generate a Histogram/Distribution Plot.

We'll visualize the distribution of the release_year feature, to see when Netflix was the most active with new additions:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data
df = pd.read_csv('netflix_titles.csv')
# Extract feature we're interested in
data = df['release_year']

# Generate histogram/distribution plot
sns.displot(data)

plt.show()

Now, if we run the code, we'll be greeted with a histogram plot, showing the count of the occurences of these release_year values:

histogram plot seaborn

Plot Histogram with Density Information with Seaborn

Now, as with Matplotlib, the default histogram approach is to count the number of occurences. Instead, you can visualize the distribution of each of these release_years in percentages.

Let's modify the displot() call to change that:

# Extract feature we're interested in
data = df['release_year']

# Generate histogram/distribution plot
sns.displot(data, stat = 'density')

plt.show()

The only thing we need to change is to provide the stat argument, and let it know that we'd like to see the density, instead of the 'count'.

Now, instead of the count we've seen before, we'll be presented with the density of entries:

histogram density information seaborn

Change Histogram Plot Bin Size with Seaborn

Sometimes, the automatic bin sizes don't work very well for us. They're too big or too small. By default, the size is chosen based on the observed variance in the data, but this sometimes can't be different than what we'd like to bring to light.

In our plot, they're a bit too small and awkwardly placed with gaps between them. We can change the bin size either by setting the binwidth for each bin, or by setting the number of bins:

data = df['release_year']

sns.displot(data, binwidth = 3)

plt.show()

This will make each bin encompass data in ranges of 3 years:

change histogram bin sizes seaborn

Or, we can set a fixed number of bins:

data = df['release_year']

sns.displot(data, bins = 30)

plt.show()

Now, the data will be packed into 30 bins and depending on the range of your dataset, this will either be a lot of bins, or a really small amount:

histogram bin number seaborn

Another great way to get rid of the awkward gaps is to set the discrete argument to True:

data = df['release_year']

sns.displot(data, discrete=True)

plt.show()

This results in:

histogram discrete data seaborn

Plot Histogram with KDE

A common plot to plot alongside a Histogram is the Kernel Density Estimation plot. They're smooth and you don't lose any value by snatching ranges of values into bins. You can set a larger bin value, overlay a KDE plot over the Histogram and have all the relevant information on screen.

Thankfully, since this was a really common thing to do, Seaborn lets us plot a KDE plot simply by setting the kde argument to True:

data = df['release_year']

sns.displot(data, discrete = True, kde = True)

plt.show()

This now results in:

plot histogram with kde seaborn

Plot Joint Plot Histogram with Seaborn

Sometimes, you might want to visualize multiple features against each other, and their distributions. For example, we might want to visualize the distribution of the show ratings, as well as year of their addition. If we were looking to see if Netflix started adding more kid-friendly content over the years, this would be a great pairing for a Joint Plot.

Let's make a jointplot():

df = pd.read_csv('netflix_titles.csv')
df.dropna(inplace=True)

sns.jointplot(x = "rating", y = "release_year", data = df)

plt.show()

We've dropped null values here since Seaborn will have trouble converting them to usable values.

Here, we've made a Histogram plot for the rating feature, as well as a Histogram plot for the release_year feature:

joint histogram plot seaborn

We can see that most of the added entries are TV-MA, however, there's also a lot of TV-14 entries so there's a nice selection of shows for the entire family.

Conclusion

In this tutorial, we've gone over several ways to plot a histogram plot using Seaborn and Python.

If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.

Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...