Pandas Box Plots
Box plots (also known as box and whisker plots) are powerful visualization tools that help you understand the distribution of your data. They are particularly useful for identifying outliers and comparing distributions across different groups. In this tutorial, we'll explore how to create box plots using Pandas' built-in plotting functionality.
Introduction to Box Plots
Box plots provide a standardized way to display the distribution of data based on a five-number summary:
- Minimum (excluding outliers)
- First quartile (Q1 - 25th percentile)
- Median (Q2 - 50th percentile)
- Third quartile (Q3 - 75th percentile)
- Maximum (excluding outliers)
Outliers are typically plotted as individual points beyond the whiskers.
Setting Up Your Environment
Before we begin, make sure you have the necessary packages installed:
pip install pandas matplotlib
Let's import the libraries we'll need:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set the style for better-looking plots
plt.style.use('seaborn-v0_8')
Creating Basic Box Plots
Let's start with a simple example by creating a dataset and visualizing it with a box plot:
# Create a sample DataFrame
np.random.seed(42) # For reproducibility
df = pd.DataFrame({
'Group A': np.random.normal(0, 1, 100),
'Group B': np.random.normal(1, 1, 100),
'Group C': np.random.normal(-1, 1, 100)
})
# Create a basic box plot
box_plot = df.plot.box()
plt.title('Basic Box Plot')
plt.ylabel('Value')
plt.show()
This produces a box plot that looks like this:
From this visualization, you can immediately see that Group B has the highest median value, Group C has the lowest, and Group A is in the middle.
Understanding Box Plot Components
Let's break down what each part of a box plot represents:
- Box: The box spans from Q1 (25th percentile) to Q3 (75th percentile), representing the interquartile range (IQR)
- Line inside the box: Represents the median (50th percentile)
- Whiskers: Typically extend to the smallest and largest values within 1.5 times the IQR from the box edges
- Points outside whiskers: Considered outliers
Customizing Box Plots
Pandas allows you to customize box plots to better suit your needs:
# Create a more customized box plot
fig, ax = plt.subplots(figsize=(10, 6))
df.plot.box(ax=ax, color={'boxes': 'darkgreen',
'whiskers': 'darkblue',
'medians': 'red',
'caps': 'black'},
grid=True,
vert=True, # True for vertical box plots, False for horizontal
sym='r+') # Symbol for outliers
ax.set_title('Customized Box Plot', fontsize=15)
ax.set_ylabel('Value', fontsize=12)
plt.show()
This creates a more visually appealing box plot with custom colors and formatting:
Grouped Box Plots
You can also create box plots to compare groups across categories:
# Create a sample DataFrame with multiple categories
np.random.seed(42)
data = {
'category': ['A']*100 + ['B']*100 + ['C']*100,
'measurement1': np.concatenate([
np.random.normal(0, 1, 100),
np.random.normal(1, 1, 100),
np.random.normal(-1, 1, 100)
]),
'measurement2': np.concatenate([
np.random.normal(5, 1, 100),
np.random.normal(6, 1, 100),
np.random.normal(4, 1, 100)
])
}
df_grouped = pd.DataFrame(data)
# Create grouped box plots
fig, ax = plt.subplots(figsize=(12, 7))
df_grouped.boxplot(column=['measurement1', 'measurement2'], by='category', ax=ax)
plt.title('Measurements by Category', fontsize=15)
plt.suptitle('') # Remove default suptitle
plt.xlabel('Category', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.show()
This creates a grouped box plot that compares two measurements across three categories:
Horizontal Box Plots
Sometimes, horizontal box plots work better, especially with long category names:
# Create a horizontal box plot
fig, ax = plt.subplots(figsize=(10, 6))
df.plot.box(ax=ax, vert=False) # Set vertical to False for horizontal orientation
ax.set_title('Horizontal Box Plot', fontsize=15)
ax.set_xlabel('Value', fontsize=12)
plt.show()
Real-World Application: Analyzing Iris Dataset
Let's apply what we've learned to a real dataset - the famous Iris dataset:
# Load the Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]
# Create box plots for each feature by species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
features = iris.feature_names
for i, feature in enumerate(features):
iris_df.boxplot(column=feature, by='species', ax=axes[i])
axes[i].set_title(f'{feature} by Species', fontsize=14)
axes[i].set_xlabel('')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.suptitle('Iris Flower Features by Species', fontsize=16, y=1.0)
plt.show()
This creates a set of box plots comparing the four features of iris flowers across three species:
From these visualizations, we can see that:
- Petal length and width are excellent discriminators between species
- Setosa has the smallest measurements for petal dimensions with very little variance
- Versicolor and virginica show more overlap in sepal dimensions
Adding Notches to Box Plots
Notches provide a visual indication of the confidence interval around the median:
# Box plot with notches
fig, ax = plt.subplots(figsize=(10, 6))
df.plot.box(ax=ax, notch=True) # Add notches to the plot
ax.set_title('Box Plot with Notches', fontsize=15)
ax.set_ylabel('Value', fontsize=12)
plt.show()
If the notches of two boxes do not overlap, there is strong evidence that their medians differ significantly.
Comparing Box Plots with Other Visualizations
Let's compare box plots with other common visualization methods to highlight their strengths:
# Create a dataset with skewed distribution
np.random.seed(42)
skewed_data = np.random.exponential(scale=2.0, size=1000)
df_skewed = pd.DataFrame({'skewed': skewed_data})
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Histogram
df_skewed.plot.hist(ax=axes[0], bins=30)
axes[0].set_title('Histogram')
# Box plot
df_skewed.plot.box(ax=axes[1])
axes[1].set_title('Box Plot')
# KDE (Kernel Density Estimate) plot
df_skewed.plot.kde(ax=axes[2])
axes[2].set_title('KDE Plot')
plt.tight_layout()
plt.show()
While histograms and KDE plots show the full distribution, box plots excel at:
- Identifying outliers
- Comparing distributions across multiple groups
- Showcasing summary statistics in a compact form
Summary
Box plots are versatile visualization tools in Pandas that help you:
- Understand the distribution of your data through the five-number summary
- Identify outliers in your dataset
- Compare distributions across different groups or categories
- Present statistical information in a compact, standardized format
When working with box plots in Pandas, remember:
- The basic syntax is
df.plot.box()
ordf.boxplot()
- You can customize colors, orientations, and styles to enhance clarity
- Grouped box plots allow for easy comparison across categories
- Notched box plots can indicate statistical significance in median differences
Additional Resources
- Pandas Visualization Documentation
- Matplotlib Boxplot Documentation
- Seaborn Boxplot Tutorial (for more advanced box plots)
Exercises
- Create a box plot for the Titanic dataset (available in Seaborn) comparing passenger ages across different classes.
- Generate a dataset with three different distributions (normal, uniform, and exponential) and visualize them using box plots.
- Create a horizontal box plot for a dataset with at least 10 categories.
- Use box plots to identify outliers in a real-world dataset of your choice, then analyze what those outliers represent.
- Create a notched box plot and interpret whether the differences between groups are statistically significant.
Happy visualizing!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)