Pandas Scatter Plots

Scatter plots are one of the most effective visualization techniques for exploring relationships between two numerical variables in your data. In this tutorial, we'll explore how to create scatter plots using Pandas, a powerful data manipulation library in Python that offers convenient visualization capabilities built on top of Matplotlib.

Introduction to Scatter Plots

A scatter plot displays values for two variables as points on a two-dimensional graph. Each point represents an observation where:

The position on the x-axis represents the value of one variable
The position on the y-axis represents the value of another variable

Scatter plots are excellent for:

Identifying correlations between variables
Detecting outliers in your data
Recognizing patterns and clusters
Visualizing distribution of data points

Let's dive into creating scatter plots with Pandas!

Basic Scatter Plot with Pandas

First, let's set up our environment and create a simple scatter plot:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set the style for better visualization
plt.style.use('seaborn')

# Create a sample DataFrame
np.random.seed(42)  # For reproducibility
df = pd.DataFrame({
    'x': np.random.normal(0, 1, 100),
    'y': np.random.normal(0, 1, 100)
})

# Create a basic scatter plot
df.plot.scatter(x='x', y='y')
plt.title('Basic Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.show()

This code generates a simple scatter plot with random data points. The output looks like:

Customizing Scatter Plots

Changing Point Size and Color

You can customize the appearance of your scatter plot by adjusting parameters like point size, color, and transparency:

# Create a scatter plot with customized appearance
df.plot.scatter(
    x='x', 
    y='y',
    s=100,                # Point size
    c='darkblue',         # Point color
    alpha=0.5,            # Transparency
    figsize=(10, 6)       # Figure size
)

plt.title('Customized Scatter Plot', fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

Color Mapping Based on Values

You can color points based on a third variable to add another dimension to your visualization:

# Create a DataFrame with a third variable
df['category'] = np.random.randint(0, 5, 100)  # Values from 0 to 4

# Create a scatter plot with color mapping
scatter = df.plot.scatter(
    x='x',
    y='y',
    c='category',              # Color based on 'category' values
    cmap='viridis',            # Color map
    s=80,                      # Point size
    figsize=(10, 6)            # Figure size
)

plt.colorbar(scatter, label='Category')
plt.title('Scatter Plot with Color Mapping', fontsize=16)
plt.grid(True)
plt.show()

In this example, we're using the 'category' column to determine the color of each point, creating a visual representation of three variables at once.

Varying Point Sizes

You can also vary the size of points based on a third variable:

# Create a DataFrame with a sizing variable
df['size_var'] = np.random.randint(10, 100, 100)

# Create a scatter plot with varying point sizes
df.plot.scatter(
    x='x',
    y='y',
    s=df['size_var'],     # Point size varies with 'size_var'
    alpha=0.6,            # Transparency
    figsize=(10, 6)       # Figure size
)

plt.title('Scatter Plot with Varying Point Sizes', fontsize=16)
plt.grid(True)
plt.show()

Real-World Examples

Example 1: Analyzing Car Fuel Efficiency

Let's analyze the relationship between car weight and fuel efficiency:

# Import the fuel efficiency dataset
from seaborn import load_dataset
cars = load_dataset('mpg')

# Create a scatter plot to analyze relationship
cars.plot.scatter(
    x='weight',           # Car weight
    y='mpg',              # Miles per gallon
    c='origin',           # Color by origin (country)
    cmap='Set1',          # Color palette
    s=50,                 # Point size
    figsize=(10, 6),      # Figure size
    alpha=0.7             # Transparency
)

plt.title('Car Weight vs. Fuel Efficiency', fontsize=16)
plt.xlabel('Car Weight (pounds)')
plt.ylabel('Fuel Efficiency (miles per gallon)')
plt.grid(True, alpha=0.3)
plt.colorbar(label='Country of Origin')
plt.show()

This visualization helps us see the negative correlation between car weight and fuel efficiency, while also showing differences between cars from different countries.

Example 2: Housing Price Analysis

Let's explore housing price data to see relationships between house size and price:

# Create a housing dataset
np.random.seed(42)
housing = pd.DataFrame({
    'size_sqft': np.random.normal(2000, 500, 100),
    'price': np.random.normal(300000, 100000, 100),
    'age_years': np.random.randint(1, 50, 100)
})

# Add some correlation
housing['price'] = housing['price'] + housing['size_sqft'] * 100 - housing['age_years'] * 1000
housing['neighborhood'] = np.random.choice(['Downtown', 'Suburbs', 'Rural'], 100)

# Map neighborhoods to numeric values for coloring
neighborhood_map = {'Downtown': 0, 'Suburbs': 1, 'Rural': 2}
housing['neighborhood_code'] = housing['neighborhood'].map(neighborhood_map)

# Create the scatter plot
scatter = housing.plot.scatter(
    x='size_sqft',
    y='price',
    c='neighborhood_code',
    cmap='viridis',
    s=housing['age_years']*2,
    alpha=0.7,
    figsize=(12, 7)
)

plt.title('House Size vs. Price by Neighborhood', fontsize=16)
plt.xlabel('House Size (square feet)')
plt.ylabel('Price (USD)')
plt.grid(True, alpha=0.3)

# Create a custom legend for neighborhoods
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(0)), 
           markersize=10, label='Downtown'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(1)), 
           markersize=10, label='Suburbs'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(2)), 
           markersize=10, label='Rural')
]
plt.legend(handles=legend_elements, title='Neighborhood')

plt.colorbar(scatter, label='Neighborhood').set_visible(False)  # Hide the default colorbar
plt.show()

In this example, we're visualizing:

House size vs. price on the x and y axes
Different neighborhoods using different colors
House age through the size of each point

Scatter Plot Matrix

For a multivariate analysis, you can create a scatter plot matrix to examine relationships between multiple variables at once:

# Create a scatter matrix
pd.plotting.scatter_matrix(
    cars[['mpg', 'weight', 'acceleration', 'displacement']], 
    figsize=(12, 10),
    diagonal='kde',       # Show kernel density estimate on diagonal
    alpha=0.7,            # Transparency
    marker='o',           # Marker style
    s=30                  # Marker size
)

plt.suptitle('Scatter Plot Matrix of Car Features', fontsize=16)
plt.tight_layout()
plt.subplots_adjust(top=0.95)
plt.show()

This creates a grid of scatter plots showing the relationships between multiple variables simultaneously, with kernel density plots along the diagonal.

Adding Trendlines to Scatter Plots

Adding a trendline can help visualize the relationship between variables:

# Create a scatter plot with a trendline
fig, ax = plt.subplots(figsize=(10, 6))

# Create the scatter plot
cars.plot.scatter(
    x='weight', 
    y='mpg', 
    alpha=0.6, 
    s=50, 
    ax=ax
)

# Add a polynomial trendline
import numpy as np
from numpy.polynomial.polynomial import Polynomial

x = cars['weight']
y = cars['mpg']
p = Polynomial.fit(x, y, 2)  # Fit a 2nd-degree polynomial
x_new = np.linspace(x.min(), x.max(), 100)
y_new = p(x_new)

# Plot the trendline
ax.plot(x_new, y_new, 'r-', linewidth=2)

plt.title('Car Weight vs. MPG with Trendline', fontsize=16)
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)
plt.show()

Summary

Scatter plots in Pandas offer a powerful way to visualize relationships between variables in your data. In this tutorial, you've learned:

How to create basic scatter plots using plot.scatter()
How to customize point appearance with color, size, and transparency
How to use color mapping to represent additional dimensions
How to create scatter plot matrices for multivariate analysis
How to add trendlines to better visualize relationships
How to apply these techniques to real-world datasets

Scatter plots are an essential tool in data exploration and analysis, helping you uncover patterns and relationships that might not be obvious in raw data.

Additional Resources

Exercises

Create a scatter plot using a dataset of your choice, coloring points based on a categorical variable.
Generate a scatter plot that visualizes three numerical variables (x-axis, y-axis, and point size).
Create a scatter plot matrix for a dataset with at least 4 numerical variables.
Implement a scatter plot with a linear regression line to show the trend between two variables.
Create an animated scatter plot that shows how data points change over time using multiple frames.

Happy visualizing!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Scatter Plots​

Basic Scatter Plot with Pandas​

Customizing Scatter Plots​

Changing Point Size and Color​

Color Mapping Based on Values​

Varying Point Sizes​

Real-World Examples​

Example 1: Analyzing Car Fuel Efficiency​

Example 2: Housing Price Analysis​

Scatter Plot Matrix​

Adding Trendlines to Scatter Plots​

Summary​

Additional Resources​

Exercises​