Introduction
Seaborn is one of the most widely used data visualization libraries in Python, as an extension to Matplotlib. It offers a simple, intuitive, yet highly customizable API for data visualization.
In this tutorial, we'll take a look at how to plot a Box Plot in Seaborn.
Box plots are used to visualize summary statistics of a dataset, displaying attributes of the distribution like the data’s range and distribution.
Import Data
We’ll need to select a dataset with continuous features in order to create a Box Plot, because Box Plots display summary statistics for continuous variables - the median and range of a dataset. We’ll be working with the Forest Fires dataset.
We’ll begin with importing Pandas to load and parse the dataset. We’ll obviously want to import Seaborn as well. Finally, we’ll import the Pyplot module from Matplotlib, so that we can show the visualizations:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
Let's use Pandas to read the CSV file, and check how our DataFrame
looks like by printing its head. Additionally, we'll want to check if the dataset contains any missing values:
dataframe = pd.read_csv("forestfires.csv")
print(dataframe.head())
print(dataframe.isnull().values.any())
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
False
The second print statement returns False
, which means that there isn't any missing data. If there were, we'd have to handle missing DataFrame values.
After we check for the consistency of our dataset, we want to select the continuous features that we want to visualize. We’ll save these as their own variables for convenience:
FFMC = dataframe["FFMC"]
DMC = dataframe["DMC"]
DC = dataframe["DC"]
RH = dataframe["RH"]
ISI = dataframe["ISI"]
temp = dataframe["temp"]
Plotting a Box Plot in Seaborn
Now that we have loaded in the data and selected the features that we want to visualize, we can create the Box Plots!
We can create the boxplot just by using Seaborn’s boxplot
function. We pass in the dataframe as well as the variables we want to visualize:
sns.boxplot(x=DMC)
plt.show()
If we want to visualize just the distribution of a categorical variable, we can provide our chosen variable as the x
argument. If we do this, Seaborn will calculate the values on the Y-axis automatically, as we can see on the previous image.
However, if there’s a specific distribution that we want to see segmented by type, we can also provide a categorical X-variable and a continuous Y-variable.
day = dataframe["day"]
sns.boxplot(x=DMC, y=day)
plt.show()
This time around, we can see a Box Plot generated for each day in the week, as specified in the dataset.
If we want to visualize multiple columns at the same time, what do we provide to the x
and y
arguments? Well, we provide the labels for the data we want, and provide the actual data using the data
argument.
We can create a new DataFrame
containing just the data we want to visualize, and melt()
it into the data
argument, providing labels such as x='variable'
and y='value'
:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()
Customize a Seaborn Box Plot
Change Box Plot Colors
Seaborn will automatically assign the different colors to different variables so we can easily visually differentiate them. Though, we can also supply a list of colors to be used if we'd like to specify them.
After choosing a list of colors with hex values (or any valid Matplotlib color), we can pass them into the palette
argument:
day = dataframe["day"]
colors = ['#78C850', '#F08030', '#6890F0','#F8D030', '#F85888', '#705898', '#98D8D8']
sns.boxplot(x=DMC, y=day, palette=colors)
plt.show()
Customize Axis Labels
We can adjust the X-axis and Y-axis labels easily using Seaborn, such as changing the font size, changing the labels, or rotating them to make ticks easier to read:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df))
boxplot.axes.set_title("Distribution of Forest Fire Conditions", fontsize=16)
boxplot.set_xlabel("Conditions", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14)
plt.show()
Ordering Box Plots
If we want to view the boxes in a specific order, we can do that by making use of the order
argument, and supplying the column names in the order you want to see them in:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df), order=["DC", "DMC", "FFMC", "ISI"])
boxplot.axes.set_title("Distribution of Forest Fire Conditions", fontsize=16)
boxplot.set_xlabel("Conditions", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14)
plt.show()
Creating Subplots
If we wanted to separate out the plots for the individual features into their own subplots, we could do that by creating a figure and axes with the subplots
function from Matplotlib. Then, we use the axes
object and access them via their index. The boxplot()
function accepts an ax
argument, specifying on which axes
it should be plotted on:
fig, axes = plt.subplots(1, 2)
sns.boxplot(x=day, y=DMC, orient='v', ax=axes[0])
sns.boxplot(x=day, y=DC, orient='v', ax=axes[1])
plt.show()
Boxplot With Data Points
We could even overlay a swamplot onto the Box Plot in order to see the distribution and samples of the points comprising that distribution, with a bit more detail.
In order to do this, we just create a single figure object and then create two different plots. The stripplot()
will be overlayed over the boxplot()
, since they're on the same axes
/figure
:
df = pd.DataFrame(data=dataframe, columns=["FFMC", "DMC", "DC", "ISI"])
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(df), order=["DC", "DMC", "FFMC", "ISI"])
boxplot = sns.stripplot(x="variable", y="value", data=pd.melt(df), marker="o", alpha=0.3, color="black", order=["DC", "DMC", "FFMC", "ISI"])
boxplot.axes.set_title("Distribution of Forest Fire Conditions", fontsize=16)
boxplot.set_xlabel("Conditions", fontsize=14)
boxplot.set_ylabel("Values", fontsize=14)
plt.show()
Conclusion
In this tutorial, we've gone over several ways to plot a Box Plot using Seaborn and Python. We've also covered how to customize the colors, labels, ordering, as well as overlay Swarmplots and subplot multiple Box Plots.
If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.
Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.
from Planet Python
via read more
No comments:
Post a Comment