Skip to main content

Pandas Box Plots

Box plots (also known as box and whisker plots) are powerful visualization tools that help you understand the distribution of your data. They are particularly useful for identifying outliers and comparing distributions across different groups. In this tutorial, we'll explore how to create box plots using Pandas' built-in plotting functionality.

Introduction to Box Plots

Box plots provide a standardized way to display the distribution of data based on a five-number summary:

  1. Minimum (excluding outliers)
  2. First quartile (Q1 - 25th percentile)
  3. Median (Q2 - 50th percentile)
  4. Third quartile (Q3 - 75th percentile)
  5. Maximum (excluding outliers)

Outliers are typically plotted as individual points beyond the whiskers.

Box Plot Anatomy

Setting Up Your Environment

Before we begin, make sure you have the necessary packages installed:

bash
pip install pandas matplotlib

Let's import the libraries we'll need:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set the style for better-looking plots
plt.style.use('seaborn-v0_8')

Creating Basic Box Plots

Let's start with a simple example by creating a dataset and visualizing it with a box plot:

python
# Create a sample DataFrame
np.random.seed(42) # For reproducibility
df = pd.DataFrame({
'Group A': np.random.normal(0, 1, 100),
'Group B': np.random.normal(1, 1, 100),
'Group C': np.random.normal(-1, 1, 100)
})

# Create a basic box plot
box_plot = df.plot.box()
plt.title('Basic Box Plot')
plt.ylabel('Value')
plt.show()

This produces a box plot that looks like this:

Basic Box Plot

From this visualization, you can immediately see that Group B has the highest median value, Group C has the lowest, and Group A is in the middle.

Understanding Box Plot Components

Let's break down what each part of a box plot represents:

  • Box: The box spans from Q1 (25th percentile) to Q3 (75th percentile), representing the interquartile range (IQR)
  • Line inside the box: Represents the median (50th percentile)
  • Whiskers: Typically extend to the smallest and largest values within 1.5 times the IQR from the box edges
  • Points outside whiskers: Considered outliers

Customizing Box Plots

Pandas allows you to customize box plots to better suit your needs:

python
# Create a more customized box plot
fig, ax = plt.subplots(figsize=(10, 6))

df.plot.box(ax=ax, color={'boxes': 'darkgreen',
'whiskers': 'darkblue',
'medians': 'red',
'caps': 'black'},
grid=True,
vert=True, # True for vertical box plots, False for horizontal
sym='r+') # Symbol for outliers

ax.set_title('Customized Box Plot', fontsize=15)
ax.set_ylabel('Value', fontsize=12)
plt.show()

This creates a more visually appealing box plot with custom colors and formatting:

Customized Box Plot

Grouped Box Plots

You can also create box plots to compare groups across categories:

python
# Create a sample DataFrame with multiple categories
np.random.seed(42)
data = {
'category': ['A']*100 + ['B']*100 + ['C']*100,
'measurement1': np.concatenate([
np.random.normal(0, 1, 100),
np.random.normal(1, 1, 100),
np.random.normal(-1, 1, 100)
]),
'measurement2': np.concatenate([
np.random.normal(5, 1, 100),
np.random.normal(6, 1, 100),
np.random.normal(4, 1, 100)
])
}

df_grouped = pd.DataFrame(data)

# Create grouped box plots
fig, ax = plt.subplots(figsize=(12, 7))
df_grouped.boxplot(column=['measurement1', 'measurement2'], by='category', ax=ax)

plt.title('Measurements by Category', fontsize=15)
plt.suptitle('') # Remove default suptitle
plt.xlabel('Category', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.show()

This creates a grouped box plot that compares two measurements across three categories:

Grouped Box Plot

Horizontal Box Plots

Sometimes, horizontal box plots work better, especially with long category names:

python
# Create a horizontal box plot
fig, ax = plt.subplots(figsize=(10, 6))
df.plot.box(ax=ax, vert=False) # Set vertical to False for horizontal orientation

ax.set_title('Horizontal Box Plot', fontsize=15)
ax.set_xlabel('Value', fontsize=12)
plt.show()

Horizontal Box Plot

Real-World Application: Analyzing Iris Dataset

Let's apply what we've learned to a real dataset - the famous Iris dataset:

python
# Load the Iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]

# Create box plots for each feature by species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

features = iris.feature_names

for i, feature in enumerate(features):
iris_df.boxplot(column=feature, by='species', ax=axes[i])
axes[i].set_title(f'{feature} by Species', fontsize=14)
axes[i].set_xlabel('')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.suptitle('Iris Flower Features by Species', fontsize=16, y=1.0)
plt.show()

This creates a set of box plots comparing the four features of iris flowers across three species:

Iris Box Plots

From these visualizations, we can see that:

  • Petal length and width are excellent discriminators between species
  • Setosa has the smallest measurements for petal dimensions with very little variance
  • Versicolor and virginica show more overlap in sepal dimensions

Adding Notches to Box Plots

Notches provide a visual indication of the confidence interval around the median:

python
# Box plot with notches
fig, ax = plt.subplots(figsize=(10, 6))
df.plot.box(ax=ax, notch=True) # Add notches to the plot

ax.set_title('Box Plot with Notches', fontsize=15)
ax.set_ylabel('Value', fontsize=12)
plt.show()

If the notches of two boxes do not overlap, there is strong evidence that their medians differ significantly.

Notched Box Plot

Comparing Box Plots with Other Visualizations

Let's compare box plots with other common visualization methods to highlight their strengths:

python
# Create a dataset with skewed distribution
np.random.seed(42)
skewed_data = np.random.exponential(scale=2.0, size=1000)
df_skewed = pd.DataFrame({'skewed': skewed_data})

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Histogram
df_skewed.plot.hist(ax=axes[0], bins=30)
axes[0].set_title('Histogram')

# Box plot
df_skewed.plot.box(ax=axes[1])
axes[1].set_title('Box Plot')

# KDE (Kernel Density Estimate) plot
df_skewed.plot.kde(ax=axes[2])
axes[2].set_title('KDE Plot')

plt.tight_layout()
plt.show()

Comparison Plots

While histograms and KDE plots show the full distribution, box plots excel at:

  • Identifying outliers
  • Comparing distributions across multiple groups
  • Showcasing summary statistics in a compact form

Summary

Box plots are versatile visualization tools in Pandas that help you:

  • Understand the distribution of your data through the five-number summary
  • Identify outliers in your dataset
  • Compare distributions across different groups or categories
  • Present statistical information in a compact, standardized format

When working with box plots in Pandas, remember:

  1. The basic syntax is df.plot.box() or df.boxplot()
  2. You can customize colors, orientations, and styles to enhance clarity
  3. Grouped box plots allow for easy comparison across categories
  4. Notched box plots can indicate statistical significance in median differences

Additional Resources

Exercises

  1. Create a box plot for the Titanic dataset (available in Seaborn) comparing passenger ages across different classes.
  2. Generate a dataset with three different distributions (normal, uniform, and exponential) and visualize them using box plots.
  3. Create a horizontal box plot for a dataset with at least 10 categories.
  4. Use box plots to identify outliers in a real-world dataset of your choice, then analyze what those outliers represent.
  5. Create a notched box plot and interpret whether the differences between groups are statistically significant.

Happy visualizing!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)