Thursday, November 26, 2020

Stack Abuse: Matplotlib Histogram Plot - Tutorial and Examples

Introduction

Matplotlib is one of the most widely used data visualization libraries in Python. From simple to complex visualizations, it's the go-to library for most.

In this tutorial, we'll take a look at how to plot a histogram plot in Matplotlib. Histogram plots are a great way to visualize distributions of data - In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range.

A histogram displays the shape and spread of continuous sample data.

Import Data

We'll be using the Netflix Shows dataset and visualizing the distributions from there.

Let's import Pandas and load in the dataset:

import pandas as pd

df = pd.read_csv('netflix_titles.csv')

Plot a Histogram Plot in Matplotlib

Now, with the dataset loaded in, let's import Matplotlib's PyPlot module and visualize the distribution of release_years of the shows that are live on Netflix:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('netflix_titles.csv')
plt.hist(df['release_year'])

plt.show()

Here, we've got a minimum-setup scenario. We load in the data into a DataFrame (df), then, we use the PyPlot instance and call the hist() function to plot a histogram for the release_year feature. By default, this'll count the number of occurences of these years, populate bars in ranges and plot the histogram.

Running this code results in:

matplotlib simple histogram plot tutorial

Here, the movie bins (ranges) are set to 10 years. Each bar here includes all shows/movies in batches of 10 years. For example, we can see that around ~750 shows were released between 2000. and 2010. At the same time, ~5000 were released between 2010. and 2020.

These are pretty big ranges for the movie industry, it makes more sense to visualize this for ranges smaller than 10 years.

Change Histogram Bin Size in Matplotlib

Say, let's visualize a histogram (distribution) plot in batches of 1 year, since this is a much more realistic time-frame for movie and show releases.

We'll import numpy, as it'll help us calculate the size of the bins:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('netflix_titles.csv')
data = df['release_year']

plt.hist(data, bins = np.arange(min(data), max(data) + 1, 1))

plt.show()

This time around, we've extracted the DataFrame column into a data variable, just to make it a bit easier to work with.

We've passed the data to the hist() function, and set the bins argument. It accepts a list, which you can set manually, if you'd like, especially if you want a non-uniform bin distribution.

Since we'd like to pool these entries each in the same time-span (1 year), we'll create a Numpy array, that starts with the lowest value (min(data)), ends at the highest value (max(data)) and goes in increments of 1.

This time around, running this code results in:

change histogram bin size in matplolib

Instead of a list, you can give a single bins value. This will be the total number of bins in the plot. Using 1 will result in 1 bar for the entire plot.

Say, we want to have 20 bins, we'd use:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('netflix_titles.csv')
data = df['release_year']

plt.hist(data, bins = 20)

plt.show()

This results in 20 equal bins, with data within those bins pooled and visualized in their respective bars:

change hisogram bin size uniformly in matplotlib

This results in 5-year intervals, considering we've got ~100 years worth of data. Splitting it up in 20 bins means that each will include 5 years worth of data.

Plot Histogram with Density

Sometimes, instead of the count of the features, we'd want to check what the density of each bar/bin is. That is, how common it is to see a range within a given dataset. Since we're working with 1-year intervals, this'll result in the probablity that a movie/show was released in that year.

To do this, we can simply set the density argument to True:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
bins = np.arange(min(data), max(data) + 1, 1)

plt.hist(data, bins = bins, density = True)
plt.ylabel('Density')
plt.xlabel('Year')

plt.show()

Now, instead of the count we've seen before, we'll be presented with the density of entries:

histogram plot with density matplotlib

We can see that ~18% of the entries were released in 2018, followed by ~14% in 2019.

Customizing Histogram Plots in Matplotlib

Other than these settings, there's a plethora of various arguments you can set to customize and change the way your plot looks like. Let's change a few of the common options people like to fiddle around with to change plots to their tastes:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
bins = np.arange(min(data), max(data) + 1, 1)

plt.hist(data, bins = bins, density = True, histtype = 'step', alpha = 0.5, align = 'right', orientation = 'horizontal', log = True)

plt.show()

Here, we've set various arguments:

  • bins - Number of bins in the plot
  • density - Whether PyPlot uses count or density to populate the plot
  • histtype - The type of histogram plot (default is bar, though other values such as step or stepfilled are available)
  • alpha - The alpha/transparency of the lines
  • align - To which side of the bins are the bars alligned, default is mid
  • orientation - Horizontal/Vertical orientation, default is vertical
  • log - Whether the plot should be put on a logarithmic scale or not

This now results in:

customize matplotlib histogram

Since we've put the align to right, we can see that the bar is offset a bit, to the vertical right of the 2020 bin.

Conclusion

In this tutorial, we've gone over several ways to plot a histogram plot using Matplotlib and Python.

If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.

Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.



from Planet Python
via read more

1 comment:

  1. Remarkable post. I simply came across your blog and desired to say that I have really enjoyed searching your blog posts. Thank you for sharing such blogs. Philippines Export Data

    ReplyDelete

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...