Monday, December 21, 2020

Stack Abuse: Seaborn Violin Plot - Tutorial and Examples

Introduction

Seaborn is one of the most widely used data visualization libraries in Python, as an extension to Matplotlib. It offers a simple, intuitive, yet highly customizable API for data visualization.

In this tutorial, we'll take a look at how to plot a Violin Plot in Seaborn.

Violin plots are used to visualize data distributions, displaying the range, median, and distribution of the data.

Violin plots show the same summary statistics as box plots, but they also include Kernel Density Estimations that represent the shape/distribution of the data.

Importing Data

To start with, we’ll want to choose a dataset that is suited to the creation of violin plots.

The dataset should have continuous, numerical features. This is because Violin Plots are used to visualize distributions of continuous data. They display the range, median, and distribution of the data.

Violin Plots essentially show the same summary statistics as box plots, but they also include additional information. The shape of the “Violin” in a Violin Plot is a Kernel Density Estimation that represents the shape/distribution of the data.

For this tutorial, we will be working with the Gapminder dataset.

We’ll start by importing Seaborn, the PyPlot module from Matplotlib, and Pandas:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We’ll then need to import the data. We’ll print the head of the dataset to ensure that the data has been properly loaded and to take a look at the names of the columns/features.

We’ll also check to make sure that there is no missing data:

dataframe = pd.read_csv("gapminder_full.csv", error_bad_lines=False, encoding="ISO-8859-1")
print(dataframe.head())
print(dataframe.isnull().values.any())

This results in:

       country  year  population continent  life_exp     gdp_cap
0  Afghanistan  1952     8425333      Asia    28.801  779.445314
1  Afghanistan  1957     9240934      Asia    30.332  820.853030
2  Afghanistan  1962    10267083      Asia    31.997  853.100710
3  Afghanistan  1967    11537966      Asia    34.020  836.197138
4  Afghanistan  1972    13079460      Asia    36.088  739.981106

We’ll select our continuous features and save them as variables to pass in to the Seaborn functions:

country = dataframe.country
continent = dataframe.continent
population = dataframe.population
life_exp = dataframe.life_exp
gdp_cap = dataframe.gdp_cap

Plotting a Simple Violin Plot in Seaborn

Now that we have loaded in the data and selected our features of choice, we can create the violin plot.

In order to create a violin plot, we just use the violinplot() function in Seaborn.

We pass in the dataframe as well as the variables we want to visualize. We can pass in just the X variable and the function will automatically compute the values on the Y-axis:

sns.violinplot(x=life_exp)

plt.show()

alt

Alternatively, you don't need to extract the features beforehand. By supplying the data argument, and assigning it to our DataFrame, you can simply reference the variable name, which is then matched to the dataset:

sns.violinplot(x="life_exp", data = dataframe)

This produces the exact same result.

Please note: In this image, Seaborn is plotting the distribution for life expectancy across all countries, as we've only supplied the life_exp variable. Most of the time, we'll want to also separate a variable like this based on another variable, such as country or continent.

Plotting Violin Plot with X and Y Variables

Here we will pass in a categorical X-variable and a continuous Y-variable, as there is a specific distribution we would like to see segmented by type.

In this dataset, we've got a lot of countries. If we plot them all, there will be too many to practically view and the figure will be way too overcrowded. We could subset the dataset and just plot, say, 10 countries.

Instead, we could plot by continent as well.

sns.violinplot(x=continent, y=life_exp, data=dataframe)

alt

Customizing The Plot

We can customize our violin plot in a few different ways.

Change Violin Plot Labels with Seaborn

Let's say we'd like to add some titles and labels to our plot to assist others in interpreting the data. Although Seaborn will automatically label the X and Y axes, we may want to change the labels.

This can be done with the set_title() and set_label() functions after creating an axes object. We just pass the title we want to give our plot into the set_title() function.

In order to label the axes, we use the set() function and provide labels to the xlabel and ylabel arguments, or use the wrapper set_xlabel()/set_ylabel() functions:

ax = sns.violinplot(x=continent, y=life_exp)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")

plt.show()

alt

Change Violin Plot Color with Seaborn

One way that we can customize the plot is to assign it specific colors. We can create a list of pre-chosen colors and pass them into the palette parameter:

colors_list = ['#78C850', '#F08030',  '#6890F0',  '#A8B820',  '#F8D030', '#E0C068', '#C03028', '#F85888', '#98D8D8']

ax = sns.violinplot(x=continent, y=life_exp, palette=colors_list)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")

plt.show()

alt

Overlaying Swarmplot Over Violin Plot in Seaborn

We could even overlay a Swarmplot onto the Violin Plot in order to see the distribution and samples of the points comprising that distribution. In order to do this, we just create a single figure object and then create two different plots:

colors_list = ['#78C850', '#F08030',  '#6890F0',  '#A8B820',  '#F8D030', '#E0C068', '#C03028', '#F85888', '#98D8D8']

plt.figure(figsize=(10,6))
sns.violinplot(x=continent, y=life_exp,palette=colors_list)
sns.swarmplot(x=continent, y=life_exp, color="k", alpha=0.8)
plt.title("Life Expectancy By Country")
plt.ylabel("Gapminder Life Expectancy")
plt.xlabel("Nations")

plt.show()

alt

Change Violin Plot Style with Seaborn

We can easily change the style and color palette of our plot by using the set_style() and set_palette() functions respectively.

Seaborn supports a number of different options to change the style and palette of the figure:

plt.figure(figsize=(10,6))
sns.set_palette("RdBu")
sns.set_style("darkgrid")
sns.violinplot(x=continent, y=life_exp, data=dataframe)
sns.swarmplot(x=continent, y=life_exp, data=dataframe, color="k", alpha=0.8)
plt.title("Life Expectancy By Country")
plt.ylabel("Gapminder Life Expectancy")
plt.xlabel("Nations")

plt.show()

alt

Subplotting Violin Plots with Seaborn

Finally, if we wanted to split the columns up into their own subplots, we could do this by creating a figure and then using the add_gridspec() function to create a grid where we can place our subplot.

We then just use the add_subplot() function and specify where in the grid we want to place the current subplot, creating the plot as we normally would, using the axes object.

Here, we can either set y=variable, or use data=variable.

fig = plt.figure(figsize=(6, 6))
gs = fig.add_gridspec(1, 3)

ax = fig.add_subplot(gs[0, 0])

sns.violinplot(data=population)
ax.set_xlabel("Population")

ax = fig.add_subplot(gs[0, 1])
sns.violinplot(data=life_exp)
ax.set_xlabel("Life Exp.")

ax = fig.add_subplot(gs[0, 2])
sns.violinplot(data=gdp_cap)
ax.set_xlabel("GDP Capacity")

fig.tight_layout()
plt.show()

alt

Grouping Violin Plots by Hue

A really useful thing to do with Violin Plots is to group by hue. If you have a categorical value, that has two values (typically, a true/false-style variable), you can group plots by hue.

For example, you could have a dataset of people, and an employment column, with employed and unemployed as values. You can then group Violin Plots by "hue" - these two flavors of employment.

Since the Gapminder dataset doesn't have a column like this, we can make one ourselves. Let's calculate the mean life expectancy for a subset of countries. Say, we calculate the mean life expectancy of European countries.

Then, we can assign a Yes/No value to a new column - above_average_life_exp for each country. If the average life expectancy is higher than the continent-wide average, this value is Yes, and vice versa:

# Separate European countries from the original dataset
europe = dataframe.loc[dataframe["continent"] == "Europe"]

# Calculate mean of the `life_exp` variable
avg_life_exp = dataframe["life_exp"].mean()

# Declare an empty list
above_average_life_exp = []

# Iterate through the rows in the dataset, assigning Yes/No
# Depending on the value of the variable in the iterated row
for index, row in europe.iterrows():
        if row["life_exp"] > avg_life_exp:
                above_average_life_exp.append("Yes")
        else:
                above_average_life_exp.append("No")

# Add new column to dataset
europe["above_average_life_exp"] = above_average_life_exp

Now, if we print our dataset, we have something along the lines of:

             country  year  population continent  life_exp       gdp_cap avle
12           Albania  1952     1282697    Europe    55.230   1601.056136  No
13           Albania  1957     1476505    Europe    59.280   1942.284244  No
14           Albania  1962     1728137    Europe    64.820   2312.888958  Yes
15           Albania  1967     1984060    Europe    66.220   2760.196931  Yes
16           Albania  1972     2263554    Europe    67.690   3313.422188  Yes
...              ...   ...         ...       ...       ...           ...  ...
1603  United Kingdom  1987    56981620    Europe    75.007  21664.787670  Yes
1604  United Kingdom  1992    57866349    Europe    76.420  22705.092540  Yes
1605  United Kingdom  1997    58808266    Europe    77.218  26074.531360  Yes
1606  United Kingdom  2002    59912431    Europe    78.471  29478.999190  Yes
1607  United Kingdom  2007    60776238    Europe    79.425  33203.261280  Yes

The variable name is truncated to avle for brevity's sake.

Now, let's select a smaller subset of these countries using europe.head() and plot Violin plots grouped by the new column we've inserted:

europe = europe.tail(50)

ax = sns.violinplot(x=europe.country, y=europe.life_exp, hue=europe.above_average_life_exp)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")

plt.show()

This now results in:

alt

Now, countries with a lesser-than-average life expectancy are colored with an orange, while the other countries are colored with blue. Though, even this doesn't tell us everything. Maybe we'd like to check how many people in Turkey have a lesser-than-average life expectancy.

Here's where splitting kicks in.

Splitting Violin Plots by Hue

Seaborn Violin Plots let you pass in the split argument, which can be set to either True or False.

If you set it to True, and a hue argument is present, it'll split the Violins between the hue values.

In our case, one side of the violin will represent the left side of the violin as entries with higher-than-average life expectancy, while the right side will be used to plot lesser-than-average life expectancies:

alt

Conclusion

In this tutorial, we've gone over several ways to plot a Violin Plot using Seaborn and Python. We've also covered how to customize change the labels and color, as well as overlay Swarmplots, subplot multiple Violin Plots, and finally - how to group plots by hue and create split Violin Plots based on a variable.

If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.

Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...