Skip to main content

Pandas Histogram

Introduction

Histograms are one of the most fundamental tools in data visualization, providing a graphical representation of the distribution of numerical data. In this tutorial, we'll explore how to create histograms using the Pandas library, which provides a convenient interface to matplotlib for creating these visualizations.

A histogram divides the data into bins (intervals) and shows the frequency of observations that fall into each bin. This gives you a visual understanding of:

  • The central tendency of your data
  • The spread or dispersion of your data
  • The presence of outliers
  • The shape of the distribution (symmetric, skewed, bimodal, etc.)

Basic Histogram in Pandas

Creating a Simple Histogram

Let's start by creating a simple histogram using Pandas' built-in plotting functionality. First, we'll import the necessary libraries:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set the style for better visualizations
plt.style.use('ggplot')

Now, let's create some sample data and visualize it using a histogram:

python
# Create a DataFrame with random data
np.random.seed(42) # For reproducibility
df = pd.DataFrame({
'values': np.random.normal(0, 1, 1000) # 1000 samples from a normal distribution
})

# Create a basic histogram
df['values'].hist()
plt.title('Basic Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

The output will be a histogram showing the distribution of our randomly generated data, which should approximate a normal distribution:

Understanding the Histogram Method

The .hist() method in Pandas is a convenient wrapper around matplotlib's histogram functionality. Here's the basic syntax:

python
Series.hist(bins=10, range=None, figsize=None, grid=True, ...)

Some important parameters include:

  • bins: Number of bins or the bin edges (default is 10)
  • range: Tuple specifying (min, max) of the bins
  • figsize: Tuple specifying figure dimensions
  • grid: Whether to show grid lines
  • alpha: Transparency level
  • color: Color of the bars

Customizing Histograms

Adjusting the Number of Bins

The number of bins can significantly affect how your histogram looks and the insights you can draw from it:

python
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Different bin sizes
df['values'].hist(bins=5, ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('5 Bins')

df['values'].hist(bins=10, ax=axes[0, 1], color='green')
axes[0, 1].set_title('10 Bins')

df['values'].hist(bins=20, ax=axes[1, 0], color='purple')
axes[1, 0].set_title('20 Bins')

df['values'].hist(bins=50, ax=axes[1, 1], color='orange')
axes[1, 1].set_title('50 Bins')

plt.tight_layout()
plt.show()

This creates a 2x2 grid of histograms with different bin counts:

Customizing Appearance

You can customize various aspects of the histogram:

python
df['values'].hist(
bins=30,
color='teal',
alpha=0.7,
edgecolor='black',
grid=False,
figsize=(10, 6)
)

plt.title('Customized Histogram', fontsize=16)
plt.xlabel('Values', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.axvline(x=0, color='red', linestyle='--', label='Mean')
plt.legend()
plt.show()

This produces a more visually appealing histogram with customized colors, borders, and additional reference lines:

Advanced Histogram Techniques

Creating Multiple Histograms

Pandas makes it easy to compare distributions by plotting multiple histograms:

python
# Create a DataFrame with multiple columns
np.random.seed(42)
df_multi = pd.DataFrame({
'Group A': np.random.normal(0, 1, 1000),
'Group B': np.random.normal(2, 1.5, 1000),
'Group C': np.random.exponential(2, 1000)
})

# Plot histograms for all columns
df_multi.hist(bins=20, figsize=(12, 6), layout=(1, 3), alpha=0.7)
plt.tight_layout()
plt.show()

This creates separate histograms for each column in our DataFrame:

Overlaid Histograms

To directly compare distributions, you might want to overlay histograms:

python
plt.figure(figsize=(10, 6))

# Plot histograms on the same axes
plt.hist(df_multi['Group A'], bins=30, alpha=0.5, label='Group A')
plt.hist(df_multi['Group B'], bins=30, alpha=0.5, label='Group B')

plt.title('Comparison of Distributions')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

This shows both distributions overlaid for direct comparison:

Normalized Histograms

Sometimes you want to compare distributions with different sample sizes. In this case, you can normalize the histograms to show density instead of frequency:

python
plt.figure(figsize=(10, 6))

# Plot normalized histograms
plt.hist(df_multi['Group A'], bins=30, alpha=0.5, density=True, label='Group A')
plt.hist(df_multi['Group C'], bins=30, alpha=0.5, density=True, label='Group C')

plt.title('Normalized Distributions')
plt.xlabel('Values')
plt.ylabel('Density')
plt.legend()
plt.show()

The output shows probability density rather than raw counts:

Real-World Applications

Analyzing Distribution of Housing Prices

Let's create an example with real-world context:

python
# Simulating housing price data
np.random.seed(42)
house_prices = pd.DataFrame({
'Price': np.random.lognormal(mean=6, sigma=0.3, size=1000) * 10000
})

# Create a histogram of house prices
house_prices['Price'].hist(bins=30, edgecolor='black', figsize=(10, 6))
plt.title('Distribution of Housing Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.grid(alpha=0.3)
plt.show()

# Calculate and display statistics
print("Housing Prices Statistics:")
print(house_prices['Price'].describe())

Output:

Housing Prices Statistics:
count 1000.000000
mean 545236.803194
std 177951.345581
min 198266.317736
25% 417470.937599
50% 513256.304328
75% 639446.286508
max 1628936.450517

The histogram shows that housing prices follow a right-skewed distribution, which is common for price data.

Comparing Test Scores Across Classes

Let's create another practical example:

python
# Simulating test scores for two classes
np.random.seed(42)
test_scores = pd.DataFrame({
'Class A': np.random.normal(75, 15, 40).clip(0, 100),
'Class B': np.random.normal(68, 20, 35).clip(0, 100)
})

# Create a figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Individual histograms
test_scores['Class A'].hist(bins=10, ax=axes[0], alpha=0.7, color='blue')
axes[0].set_title('Class A Test Scores')
axes[0].set_xlabel('Score')
axes[0].set_ylabel('Frequency')

test_scores['Class B'].hist(bins=10, ax=axes[1], alpha=0.7, color='green')
axes[1].set_title('Class B Test Scores')
axes[1].set_xlabel('Score')
axes[1].set_ylabel('Frequency')

# Overlaid histogram
axes[2].hist(test_scores['Class A'], bins=10, alpha=0.5, color='blue', label='Class A')
axes[2].hist(test_scores['Class B'], bins=10, alpha=0.5, color='green', label='Class B')
axes[2].set_title('Comparison of Test Scores')
axes[2].set_xlabel('Score')
axes[2].set_ylabel('Frequency')
axes[2].legend()

plt.tight_layout()
plt.show()

# Print statistics for comparison
print("Class A Statistics:")
print(test_scores['Class A'].describe())
print("\nClass B Statistics:")
print(test_scores['Class B'].describe())

Output:

Class A Statistics:
count 40.000000
mean 75.629264
std 13.327318
min 45.289561
25% 67.916804
50% 75.512119
75% 85.350025
max 100.000000

Class B Statistics:
count 35.000000
mean 69.846493
std 18.067334
min 32.390229
25% 55.792956
50% 69.525851
75% 83.440112
max 100.000000

From these histograms, we can see that Class A has higher average scores and less variation compared to Class B.

Summary

Histograms are powerful visualization tools for understanding the distribution of your data. With Pandas, creating histograms is straightforward and highly customizable. In this tutorial, you've learned:

  • How to create basic histograms using Pandas' .hist() method
  • How to customize the appearance of histograms by adjusting bins, colors, and other parameters
  • Advanced techniques like creating multiple histograms and overlaying distributions
  • How to normalize histograms to compare distributions with different sample sizes
  • Real-world applications of histograms in data analysis

Histograms are just the beginning of your data visualization journey with Pandas. They provide valuable insights into your data's distribution, helping you make informed decisions in your data analysis process.

Additional Resources and Exercises

Resources

Exercises

  1. Basic Exercise: Load a dataset of your choice (e.g., from seaborn.load_dataset()) and create a histogram of a numerical column.

  2. Intermediate Exercise: Compare the distributions of a numerical variable across different categorical groups using multiple histograms.

  3. Advanced Exercise: Create a function that takes a DataFrame and automatically generates histograms for all numerical columns with appropriate bin sizes based on the Freedman-Diaconis rule.

  4. Challenge: Create a dashboard-style visualization with multiple histograms showing different aspects of a dataset, including both raw frequencies and normalized distributions.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)