Introduction
Seaborn is one of the most widely used data visualization libraries in Python, as an extension to Matplotlib. It offers a simple, intuitive, yet highly customizable API for data visualization.
In this tutorial, we'll take a look at how to plot a Violin Plot in Seaborn.
Violin plots are used to visualize data distributions, displaying the range, median, and distribution of the data.
Violin plots show the same summary statistics as box plots, but they also include Kernel Density Estimations that represent the shape/distribution of the data.
Importing Data
To start with, we’ll want to choose a dataset that is suited to the creation of violin plots.
The dataset should have continuous, numerical features. This is because Violin Plots are used to visualize distributions of continuous data. They display the range, median, and distribution of the data.
Violin Plots essentially show the same summary statistics as box plots, but they also include additional information. The shape of the “Violin” in a Violin Plot is a Kernel Density Estimation that represents the shape/distribution of the data.
For this tutorial, we will be working with the Gapminder dataset.
We’ll start by importing Seaborn, the PyPlot module from Matplotlib, and Pandas:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
We’ll then need to import the data. We’ll print the head of the dataset to ensure that the data has been properly loaded and to take a look at the names of the columns/features.
We’ll also check to make sure that there is no missing data:
dataframe = pd.read_csv("gapminder_full.csv", error_bad_lines=False, encoding="ISO-8859-1")
print(dataframe.head())
print(dataframe.isnull().values.any())
This results in:
country year population continent life_exp gdp_cap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
We’ll select our continuous features and save them as variables to pass in to the Seaborn functions:
country = dataframe.country
continent = dataframe.continent
population = dataframe.population
life_exp = dataframe.life_exp
gdp_cap = dataframe.gdp_cap
Plotting a Simple Violin Plot in Seaborn
Now that we have loaded in the data and selected our features of choice, we can create the violin plot.
In order to create a violin plot, we just use the violinplot()
function in Seaborn.
We pass in the dataframe as well as the variables we want to visualize. We can pass in just the X variable and the function will automatically compute the values on the Y-axis:
sns.violinplot(x=life_exp)
plt.show()
Alternatively, you don't need to extract the features beforehand. By supplying the data
argument, and assigning it to our DataFrame, you can simply reference the variable name, which is then matched to the dataset:
sns.violinplot(x="life_exp", data = dataframe)
This produces the exact same result.
Please note: In this image, Seaborn is plotting the distribution for life expectancy across all countries, as we've only supplied the life_exp
variable. Most of the time, we'll want to also separate a variable like this based on another variable, such as country
or continent
.
Plotting Violin Plot with X and Y Variables
Here we will pass in a categorical X-variable and a continuous Y-variable, as there is a specific distribution we would like to see segmented by type.
In this dataset, we've got a lot of countries. If we plot them all, there will be too many to practically view and the figure will be way too overcrowded. We could subset the dataset and just plot, say, 10 countries.
Instead, we could plot by continent as well.
sns.violinplot(x=continent, y=life_exp, data=dataframe)
Customizing The Plot
We can customize our violin plot in a few different ways.
Change Violin Plot Labels with Seaborn
Let's say we'd like to add some titles and labels to our plot to assist others in interpreting the data. Although Seaborn will automatically label the X and Y axes, we may want to change the labels.
This can be done with the set_title()
and set_label()
functions after creating an axes object. We just pass the title we want to give our plot into the set_title()
function.
In order to label the axes, we use the set()
function and provide labels to the xlabel
and ylabel
arguments, or use the wrapper set_xlabel()
/set_ylabel()
functions:
ax = sns.violinplot(x=continent, y=life_exp)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")
plt.show()
Change Violin Plot Color with Seaborn
One way that we can customize the plot is to assign it specific colors. We can create a list of pre-chosen colors and pass them into the palette
parameter:
colors_list = ['#78C850', '#F08030', '#6890F0', '#A8B820', '#F8D030', '#E0C068', '#C03028', '#F85888', '#98D8D8']
ax = sns.violinplot(x=continent, y=life_exp, palette=colors_list)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")
plt.show()
Overlaying Swarmplot Over Violin Plot in Seaborn
We could even overlay a Swarmplot onto the Violin Plot in order to see the distribution and samples of the points comprising that distribution. In order to do this, we just create a single figure object and then create two different plots:
colors_list = ['#78C850', '#F08030', '#6890F0', '#A8B820', '#F8D030', '#E0C068', '#C03028', '#F85888', '#98D8D8']
plt.figure(figsize=(10,6))
sns.violinplot(x=continent, y=life_exp,palette=colors_list)
sns.swarmplot(x=continent, y=life_exp, color="k", alpha=0.8)
plt.title("Life Expectancy By Country")
plt.ylabel("Gapminder Life Expectancy")
plt.xlabel("Nations")
plt.show()
Change Violin Plot Style with Seaborn
We can easily change the style and color palette of our plot by using the set_style()
and set_palette()
functions respectively.
Seaborn supports a number of different options to change the style and palette of the figure:
plt.figure(figsize=(10,6))
sns.set_palette("RdBu")
sns.set_style("darkgrid")
sns.violinplot(x=continent, y=life_exp, data=dataframe)
sns.swarmplot(x=continent, y=life_exp, data=dataframe, color="k", alpha=0.8)
plt.title("Life Expectancy By Country")
plt.ylabel("Gapminder Life Expectancy")
plt.xlabel("Nations")
plt.show()
Subplotting Violin Plots with Seaborn
Finally, if we wanted to split the columns up into their own subplots, we could do this by creating a figure and then using the add_gridspec()
function to create a grid where we can place our subplot.
We then just use the add_subplot()
function and specify where in the grid we want to place the current subplot, creating the plot as we normally would, using the axes object.
Here, we can either set y=variable
, or use data=variable
.
fig = plt.figure(figsize=(6, 6))
gs = fig.add_gridspec(1, 3)
ax = fig.add_subplot(gs[0, 0])
sns.violinplot(data=population)
ax.set_xlabel("Population")
ax = fig.add_subplot(gs[0, 1])
sns.violinplot(data=life_exp)
ax.set_xlabel("Life Exp.")
ax = fig.add_subplot(gs[0, 2])
sns.violinplot(data=gdp_cap)
ax.set_xlabel("GDP Capacity")
fig.tight_layout()
plt.show()
Grouping Violin Plots by Hue
A really useful thing to do with Violin Plots is to group by hue. If you have a categorical value, that has two values (typically, a true
/false
-style variable), you can group plots by hue.
For example, you could have a dataset of people, and an employment
column, with employed
and unemployed
as values. You can then group Violin Plots by "hue" - these two flavors of employment.
Since the Gapminder dataset doesn't have a column like this, we can make one ourselves. Let's calculate the mean life expectancy for a subset of countries. Say, we calculate the mean life expectancy of European countries.
Then, we can assign a Yes
/No
value to a new column - above_average_life_exp
for each country. If the average life expectancy is higher than the continent-wide average, this value is Yes
, and vice versa:
# Separate European countries from the original dataset
europe = dataframe.loc[dataframe["continent"] == "Europe"]
# Calculate mean of the `life_exp` variable
avg_life_exp = dataframe["life_exp"].mean()
# Declare an empty list
above_average_life_exp = []
# Iterate through the rows in the dataset, assigning Yes/No
# Depending on the value of the variable in the iterated row
for index, row in europe.iterrows():
if row["life_exp"] > avg_life_exp:
above_average_life_exp.append("Yes")
else:
above_average_life_exp.append("No")
# Add new column to dataset
europe["above_average_life_exp"] = above_average_life_exp
Now, if we print our dataset, we have something along the lines of:
country year population continent life_exp gdp_cap avle
12 Albania 1952 1282697 Europe 55.230 1601.056136 No
13 Albania 1957 1476505 Europe 59.280 1942.284244 No
14 Albania 1962 1728137 Europe 64.820 2312.888958 Yes
15 Albania 1967 1984060 Europe 66.220 2760.196931 Yes
16 Albania 1972 2263554 Europe 67.690 3313.422188 Yes
... ... ... ... ... ... ... ...
1603 United Kingdom 1987 56981620 Europe 75.007 21664.787670 Yes
1604 United Kingdom 1992 57866349 Europe 76.420 22705.092540 Yes
1605 United Kingdom 1997 58808266 Europe 77.218 26074.531360 Yes
1606 United Kingdom 2002 59912431 Europe 78.471 29478.999190 Yes
1607 United Kingdom 2007 60776238 Europe 79.425 33203.261280 Yes
The variable name is truncated to avle
for brevity's sake.
Now, let's select a smaller subset of these countries using europe.head()
and plot Violin plots grouped by the new column we've inserted:
europe = europe.tail(50)
ax = sns.violinplot(x=europe.country, y=europe.life_exp, hue=europe.above_average_life_exp)
ax.set_title("Life Expectancy By Country")
ax.set_ylabel("Gapminder Life Expectancy")
ax.set_xlabel("Nations")
plt.show()
This now results in:
Now, countries with a lesser-than-average life expectancy are colored with an orange, while the other countries are colored with blue. Though, even this doesn't tell us everything. Maybe we'd like to check how many people in Turkey have a lesser-than-average life expectancy.
Here's where splitting kicks in.
Splitting Violin Plots by Hue
Seaborn Violin Plots let you pass in the split
argument, which can be set to either True
or False
.
If you set it to True
, and a hue
argument is present, it'll split the Violins between the hue
values.
In our case, one side of the violin will represent the left side of the violin as entries with higher-than-average life expectancy, while the right side will be used to plot lesser-than-average life expectancies:
Conclusion
In this tutorial, we've gone over several ways to plot a Violin Plot using Seaborn and Python. We've also covered how to customize change the labels and color, as well as overlay Swarmplots, subplot multiple Violin Plots, and finally - how to group plots by hue and create split Violin Plots based on a variable.
If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.
Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.
from Planet Python
via read more
No comments:
Post a Comment