Pandas Histogram
Introduction
Histograms are one of the most fundamental tools in data visualization, providing a graphical representation of the distribution of numerical data. In this tutorial, we'll explore how to create histograms using the Pandas library, which provides a convenient interface to matplotlib for creating these visualizations.
A histogram divides the data into bins (intervals) and shows the frequency of observations that fall into each bin. This gives you a visual understanding of:
- The central tendency of your data
- The spread or dispersion of your data
- The presence of outliers
- The shape of the distribution (symmetric, skewed, bimodal, etc.)
Basic Histogram in Pandas
Creating a Simple Histogram
Let's start by creating a simple histogram using Pandas' built-in plotting functionality. First, we'll import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set the style for better visualizations
plt.style.use('ggplot')
Now, let's create some sample data and visualize it using a histogram:
# Create a DataFrame with random data
np.random.seed(42) # For reproducibility
df = pd.DataFrame({
'values': np.random.normal(0, 1, 1000) # 1000 samples from a normal distribution
})
# Create a basic histogram
df['values'].hist()
plt.title('Basic Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
The output will be a histogram showing the distribution of our randomly generated data, which should approximate a normal distribution:
Understanding the Histogram Method
The .hist()
method in Pandas is a convenient wrapper around matplotlib's histogram functionality. Here's the basic syntax:
Series.hist(bins=10, range=None, figsize=None, grid=True, ...)
Some important parameters include:
bins
: Number of bins or the bin edges (default is 10)range
: Tuple specifying (min, max) of the binsfigsize
: Tuple specifying figure dimensionsgrid
: Whether to show grid linesalpha
: Transparency levelcolor
: Color of the bars
Customizing Histograms
Adjusting the Number of Bins
The number of bins can significantly affect how your histogram looks and the insights you can draw from it:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Different bin sizes
df['values'].hist(bins=5, ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('5 Bins')
df['values'].hist(bins=10, ax=axes[0, 1], color='green')
axes[0, 1].set_title('10 Bins')
df['values'].hist(bins=20, ax=axes[1, 0], color='purple')
axes[1, 0].set_title('20 Bins')
df['values'].hist(bins=50, ax=axes[1, 1], color='orange')
axes[1, 1].set_title('50 Bins')
plt.tight_layout()
plt.show()
This creates a 2x2 grid of histograms with different bin counts:
Customizing Appearance
You can customize various aspects of the histogram:
df['values'].hist(
bins=30,
color='teal',
alpha=0.7,
edgecolor='black',
grid=False,
figsize=(10, 6)
)
plt.title('Customized Histogram', fontsize=16)
plt.xlabel('Values', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.axvline(x=0, color='red', linestyle='--', label='Mean')
plt.legend()
plt.show()
This produces a more visually appealing histogram with customized colors, borders, and additional reference lines:
Advanced Histogram Techniques
Creating Multiple Histograms
Pandas makes it easy to compare distributions by plotting multiple histograms:
# Create a DataFrame with multiple columns
np.random.seed(42)
df_multi = pd.DataFrame({
'Group A': np.random.normal(0, 1, 1000),
'Group B': np.random.normal(2, 1.5, 1000),
'Group C': np.random.exponential(2, 1000)
})
# Plot histograms for all columns
df_multi.hist(bins=20, figsize=(12, 6), layout=(1, 3), alpha=0.7)
plt.tight_layout()
plt.show()
This creates separate histograms for each column in our DataFrame:
Overlaid Histograms
To directly compare distributions, you might want to overlay histograms:
plt.figure(figsize=(10, 6))
# Plot histograms on the same axes
plt.hist(df_multi['Group A'], bins=30, alpha=0.5, label='Group A')
plt.hist(df_multi['Group B'], bins=30, alpha=0.5, label='Group B')
plt.title('Comparison of Distributions')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
This shows both distributions overlaid for direct comparison:
Normalized Histograms
Sometimes you want to compare distributions with different sample sizes. In this case, you can normalize the histograms to show density instead of frequency:
plt.figure(figsize=(10, 6))
# Plot normalized histograms
plt.hist(df_multi['Group A'], bins=30, alpha=0.5, density=True, label='Group A')
plt.hist(df_multi['Group C'], bins=30, alpha=0.5, density=True, label='Group C')
plt.title('Normalized Distributions')
plt.xlabel('Values')
plt.ylabel('Density')
plt.legend()
plt.show()
The output shows probability density rather than raw counts:
Real-World Applications
Analyzing Distribution of Housing Prices
Let's create an example with real-world context:
# Simulating housing price data
np.random.seed(42)
house_prices = pd.DataFrame({
'Price': np.random.lognormal(mean=6, sigma=0.3, size=1000) * 10000
})
# Create a histogram of house prices
house_prices['Price'].hist(bins=30, edgecolor='black', figsize=(10, 6))
plt.title('Distribution of Housing Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.grid(alpha=0.3)
plt.show()
# Calculate and display statistics
print("Housing Prices Statistics:")
print(house_prices['Price'].describe())
Output:
Housing Prices Statistics:
count 1000.000000
mean 545236.803194
std 177951.345581
min 198266.317736
25% 417470.937599
50% 513256.304328
75% 639446.286508
max 1628936.450517
The histogram shows that housing prices follow a right-skewed distribution, which is common for price data.
Comparing Test Scores Across Classes
Let's create another practical example:
# Simulating test scores for two classes
np.random.seed(42)
test_scores = pd.DataFrame({
'Class A': np.random.normal(75, 15, 40).clip(0, 100),
'Class B': np.random.normal(68, 20, 35).clip(0, 100)
})
# Create a figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Individual histograms
test_scores['Class A'].hist(bins=10, ax=axes[0], alpha=0.7, color='blue')
axes[0].set_title('Class A Test Scores')
axes[0].set_xlabel('Score')
axes[0].set_ylabel('Frequency')
test_scores['Class B'].hist(bins=10, ax=axes[1], alpha=0.7, color='green')
axes[1].set_title('Class B Test Scores')
axes[1].set_xlabel('Score')
axes[1].set_ylabel('Frequency')
# Overlaid histogram
axes[2].hist(test_scores['Class A'], bins=10, alpha=0.5, color='blue', label='Class A')
axes[2].hist(test_scores['Class B'], bins=10, alpha=0.5, color='green', label='Class B')
axes[2].set_title('Comparison of Test Scores')
axes[2].set_xlabel('Score')
axes[2].set_ylabel('Frequency')
axes[2].legend()
plt.tight_layout()
plt.show()
# Print statistics for comparison
print("Class A Statistics:")
print(test_scores['Class A'].describe())
print("\nClass B Statistics:")
print(test_scores['Class B'].describe())
Output:
Class A Statistics:
count 40.000000
mean 75.629264
std 13.327318
min 45.289561
25% 67.916804
50% 75.512119
75% 85.350025
max 100.000000
Class B Statistics:
count 35.000000
mean 69.846493
std 18.067334
min 32.390229
25% 55.792956
50% 69.525851
75% 83.440112
max 100.000000
From these histograms, we can see that Class A has higher average scores and less variation compared to Class B.
Summary
Histograms are powerful visualization tools for understanding the distribution of your data. With Pandas, creating histograms is straightforward and highly customizable. In this tutorial, you've learned:
- How to create basic histograms using Pandas'
.hist()
method - How to customize the appearance of histograms by adjusting bins, colors, and other parameters
- Advanced techniques like creating multiple histograms and overlaying distributions
- How to normalize histograms to compare distributions with different sample sizes
- Real-world applications of histograms in data analysis
Histograms are just the beginning of your data visualization journey with Pandas. They provide valuable insights into your data's distribution, helping you make informed decisions in your data analysis process.
Additional Resources and Exercises
Resources
- Pandas Documentation on Plotting
- Matplotlib Histogram Documentation
- Towards Data Science: Data Visualization with Pandas
Exercises
-
Basic Exercise: Load a dataset of your choice (e.g., from
seaborn.load_dataset()
) and create a histogram of a numerical column. -
Intermediate Exercise: Compare the distributions of a numerical variable across different categorical groups using multiple histograms.
-
Advanced Exercise: Create a function that takes a DataFrame and automatically generates histograms for all numerical columns with appropriate bin sizes based on the Freedman-Diaconis rule.
-
Challenge: Create a dashboard-style visualization with multiple histograms showing different aspects of a dataset, including both raw frequencies and normalized distributions.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)