Pandas Scatter Plots
Scatter plots are one of the most effective visualization techniques for exploring relationships between two numerical variables in your data. In this tutorial, we'll explore how to create scatter plots using Pandas, a powerful data manipulation library in Python that offers convenient visualization capabilities built on top of Matplotlib.
Introduction to Scatter Plots
A scatter plot displays values for two variables as points on a two-dimensional graph. Each point represents an observation where:
- The position on the x-axis represents the value of one variable
- The position on the y-axis represents the value of another variable
Scatter plots are excellent for:
- Identifying correlations between variables
- Detecting outliers in your data
- Recognizing patterns and clusters
- Visualizing distribution of data points
Let's dive into creating scatter plots with Pandas!
Basic Scatter Plot with Pandas
First, let's set up our environment and create a simple scatter plot:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set the style for better visualization
plt.style.use('seaborn')
# Create a sample DataFrame
np.random.seed(42) # For reproducibility
df = pd.DataFrame({
'x': np.random.normal(0, 1, 100),
'y': np.random.normal(0, 1, 100)
})
# Create a basic scatter plot
df.plot.scatter(x='x', y='y')
plt.title('Basic Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.show()
This code generates a simple scatter plot with random data points. The output looks like:
Customizing Scatter Plots
Changing Point Size and Color
You can customize the appearance of your scatter plot by adjusting parameters like point size, color, and transparency:
# Create a scatter plot with customized appearance
df.plot.scatter(
x='x',
y='y',
s=100, # Point size
c='darkblue', # Point color
alpha=0.5, # Transparency
figsize=(10, 6) # Figure size
)
plt.title('Customized Scatter Plot', fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
Color Mapping Based on Values
You can color points based on a third variable to add another dimension to your visualization:
# Create a DataFrame with a third variable
df['category'] = np.random.randint(0, 5, 100) # Values from 0 to 4
# Create a scatter plot with color mapping
scatter = df.plot.scatter(
x='x',
y='y',
c='category', # Color based on 'category' values
cmap='viridis', # Color map
s=80, # Point size
figsize=(10, 6) # Figure size
)
plt.colorbar(scatter, label='Category')
plt.title('Scatter Plot with Color Mapping', fontsize=16)
plt.grid(True)
plt.show()
In this example, we're using the 'category' column to determine the color of each point, creating a visual representation of three variables at once.
Varying Point Sizes
You can also vary the size of points based on a third variable:
# Create a DataFrame with a sizing variable
df['size_var'] = np.random.randint(10, 100, 100)
# Create a scatter plot with varying point sizes
df.plot.scatter(
x='x',
y='y',
s=df['size_var'], # Point size varies with 'size_var'
alpha=0.6, # Transparency
figsize=(10, 6) # Figure size
)
plt.title('Scatter Plot with Varying Point Sizes', fontsize=16)
plt.grid(True)
plt.show()
Real-World Examples
Example 1: Analyzing Car Fuel Efficiency
Let's analyze the relationship between car weight and fuel efficiency:
# Import the fuel efficiency dataset
from seaborn import load_dataset
cars = load_dataset('mpg')
# Create a scatter plot to analyze relationship
cars.plot.scatter(
x='weight', # Car weight
y='mpg', # Miles per gallon
c='origin', # Color by origin (country)
cmap='Set1', # Color palette
s=50, # Point size
figsize=(10, 6), # Figure size
alpha=0.7 # Transparency
)
plt.title('Car Weight vs. Fuel Efficiency', fontsize=16)
plt.xlabel('Car Weight (pounds)')
plt.ylabel('Fuel Efficiency (miles per gallon)')
plt.grid(True, alpha=0.3)
plt.colorbar(label='Country of Origin')
plt.show()
This visualization helps us see the negative correlation between car weight and fuel efficiency, while also showing differences between cars from different countries.
Example 2: Housing Price Analysis
Let's explore housing price data to see relationships between house size and price:
# Create a housing dataset
np.random.seed(42)
housing = pd.DataFrame({
'size_sqft': np.random.normal(2000, 500, 100),
'price': np.random.normal(300000, 100000, 100),
'age_years': np.random.randint(1, 50, 100)
})
# Add some correlation
housing['price'] = housing['price'] + housing['size_sqft'] * 100 - housing['age_years'] * 1000
housing['neighborhood'] = np.random.choice(['Downtown', 'Suburbs', 'Rural'], 100)
# Map neighborhoods to numeric values for coloring
neighborhood_map = {'Downtown': 0, 'Suburbs': 1, 'Rural': 2}
housing['neighborhood_code'] = housing['neighborhood'].map(neighborhood_map)
# Create the scatter plot
scatter = housing.plot.scatter(
x='size_sqft',
y='price',
c='neighborhood_code',
cmap='viridis',
s=housing['age_years']*2,
alpha=0.7,
figsize=(12, 7)
)
plt.title('House Size vs. Price by Neighborhood', fontsize=16)
plt.xlabel('House Size (square feet)')
plt.ylabel('Price (USD)')
plt.grid(True, alpha=0.3)
# Create a custom legend for neighborhoods
from matplotlib.lines import Line2D
legend_elements = [
Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(0)),
markersize=10, label='Downtown'),
Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(1)),
markersize=10, label='Suburbs'),
Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(2)),
markersize=10, label='Rural')
]
plt.legend(handles=legend_elements, title='Neighborhood')
plt.colorbar(scatter, label='Neighborhood').set_visible(False) # Hide the default colorbar
plt.show()
In this example, we're visualizing:
- House size vs. price on the x and y axes
- Different neighborhoods using different colors
- House age through the size of each point
Scatter Plot Matrix
For a multivariate analysis, you can create a scatter plot matrix to examine relationships between multiple variables at once:
# Create a scatter matrix
pd.plotting.scatter_matrix(
cars[['mpg', 'weight', 'acceleration', 'displacement']],
figsize=(12, 10),
diagonal='kde', # Show kernel density estimate on diagonal
alpha=0.7, # Transparency
marker='o', # Marker style
s=30 # Marker size
)
plt.suptitle('Scatter Plot Matrix of Car Features', fontsize=16)
plt.tight_layout()
plt.subplots_adjust(top=0.95)
plt.show()
This creates a grid of scatter plots showing the relationships between multiple variables simultaneously, with kernel density plots along the diagonal.
Adding Trendlines to Scatter Plots
Adding a trendline can help visualize the relationship between variables:
# Create a scatter plot with a trendline
fig, ax = plt.subplots(figsize=(10, 6))
# Create the scatter plot
cars.plot.scatter(
x='weight',
y='mpg',
alpha=0.6,
s=50,
ax=ax
)
# Add a polynomial trendline
import numpy as np
from numpy.polynomial.polynomial import Polynomial
x = cars['weight']
y = cars['mpg']
p = Polynomial.fit(x, y, 2) # Fit a 2nd-degree polynomial
x_new = np.linspace(x.min(), x.max(), 100)
y_new = p(x_new)
# Plot the trendline
ax.plot(x_new, y_new, 'r-', linewidth=2)
plt.title('Car Weight vs. MPG with Trendline', fontsize=16)
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles Per Gallon')
plt.grid(True, alpha=0.3)
plt.show()
Summary
Scatter plots in Pandas offer a powerful way to visualize relationships between variables in your data. In this tutorial, you've learned:
- How to create basic scatter plots using
plot.scatter()
- How to customize point appearance with color, size, and transparency
- How to use color mapping to represent additional dimensions
- How to create scatter plot matrices for multivariate analysis
- How to add trendlines to better visualize relationships
- How to apply these techniques to real-world datasets
Scatter plots are an essential tool in data exploration and analysis, helping you uncover patterns and relationships that might not be obvious in raw data.
Additional Resources
- Pandas Visualization Documentation
- Matplotlib Scatter Plot Documentation
- Seaborn's Enhanced Scatter Plots
Exercises
- Create a scatter plot using a dataset of your choice, coloring points based on a categorical variable.
- Generate a scatter plot that visualizes three numerical variables (x-axis, y-axis, and point size).
- Create a scatter plot matrix for a dataset with at least 4 numerical variables.
- Implement a scatter plot with a linear regression line to show the trend between two variables.
- Create an animated scatter plot that shows how data points change over time using multiple frames.
Happy visualizing!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)