Pandas Advanced Indexing
Understanding how to efficiently access and manipulate data is essential for any data analysis task. While pandas provides basic indexing capabilities, mastering advanced indexing techniques will significantly improve your data manipulation skills and make your code more efficient. This guide will walk you through the powerful indexing methods available in pandas.
Introduction to Advanced Indexing
Pandas offers several sophisticated ways to access data beyond simple row and column selection. Advanced indexing allows you to:
- Work with multi-level indexes
- Filter data using boolean conditions
- Select data using arrays (fancy indexing)
- Slice data along multiple dimensions
These techniques can dramatically simplify complex data manipulation tasks, making your code cleaner and more efficient.
Prerequisites
Before diving into advanced indexing, make sure you have pandas installed:
# Install pandas if you haven't already
# !pip install pandas
import pandas as pd
import numpy as np
Hierarchical Indexing (MultiIndex)
Hierarchical indexing (also known as MultiIndex) allows you to have multiple levels of indexes on a single axis. This is particularly useful when dealing with high-dimensional data in a two-dimensional DataFrame.
Creating a DataFrame with MultiIndex
# Create a MultiIndex DataFrame
arrays = [
['California', 'California', 'New York', 'New York', 'Texas', 'Texas'],
['San Francisco', 'Los Angeles', 'New York City', 'Buffalo', 'Austin', 'Dallas']
]
index = pd.MultiIndex.from_arrays(arrays, names=['State', 'City'])
data = pd.DataFrame({
'Population': [884363, 3967000, 8336817, 278349, 964177, 1345047],
'GDP (billions)': [592.3, 710.9, 1500.1, 43.8, 148.6, 534.3]
}, index=index)
print(data)
Output:
Population GDP (billions)
State City
California San Francisco 884363 592.3
Los Angeles 3967000 710.9
New York New York City 8336817 1500.1
Buffalo 278349 43.8
Texas Austin 964177 148.6
Dallas 1345047 534.3
Accessing Data with MultiIndex
Selecting by outer index level
# Select data for California
california_data = data.loc['California']
print(california_data)
Output:
Population GDP (billions)
City
San Francisco 884363 592.3
Los Angeles 3967000 710.9
Selecting by multiple index levels
# Select data for New York City
nyc_data = data.loc[('New York', 'New York City')]
print(nyc_data)
Output:
Population 8336817.0
GDP (billions) 1500.1
Name: (New York, New York City), dtype: float64
Cross-section selection with xs()
The xs()
method allows for more flexible cross-sectional selection:
# Select all cities in Texas
texas_cities = data.xs('Texas', level='State')
print(texas_cities)
Output:
Population GDP (billions)
City
Austin 964177 148.6
Dallas 1345047 534.3
Manipulating MultiIndex DataFrames
Adding a level to MultiIndex
# Add a Country level
data_with_country = data.copy()
data_with_country['Country'] = 'USA'
data_with_country = data_with_country.set_index('Country', append=True)
print(data_with_country.head())
Output:
Population GDP (billions)
State City Country
California San Francisco USA 884363 592.3
Los Angeles USA 3967000 710.9
New York New York City USA 8336817 1500.1
Buffalo USA 278349 43.8
Texas Austin USA 964177 148.6
Swapping index levels
# Swap levels to have City as the first level
swapped_index = data.swaplevel('State', 'City')
print(swapped_index.head())
Output:
Population GDP (billions)
City State
San Francisco California 884363 592.3
Los Angeles California 3967000 710.9
New York City New York 8336817 1500.1
Buffalo New York 278349 43.8
Austin Texas 964177 148.6
Stacking and unstacking levels
The stack()
and unstack()
methods transform between a "long" and "wide" format:
# Unstacking moves the innermost index level to become columns
unstacked = data.unstack(level='City')
print(unstacked.head())
# Stacking does the opposite - converts columns back to an index level
restacked = unstacked.stack()
print(restacked.equals(data)) # Should be True
Boolean Indexing
Boolean indexing is a powerful technique that allows you to select data based on conditions.
Basic Boolean Filtering
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
})
# Filter rows where column A > 2
filtered = df[df['A'] > 2]
print(filtered)
Output:
A B C
2 3 30 300
3 4 40 400
4 5 50 500
Combining Multiple Conditions
# Filter with multiple conditions
# Rows where A > 2 AND B < 50
filtered = df[(df['A'] > 2) & (df['B'] < 50)]
print(filtered)
Output:
A B C
2 3 30 300
3 4 40 400
# Rows where A > 2 OR C > 400
filtered = df[(df['A'] > 2) | (df['C'] > 400)]
print(filtered)
Output:
A B C
2 3 30 300
3 4 40 400
4 5 50 500
Using query()
Method
The query()
method provides a more readable syntax for boolean filtering:
# Same as df[(df['A'] > 2) & (df['B'] < 50)]
filtered = df.query('A > 2 and B < 50')
print(filtered)
Output:
A B C
2 3 30 300
3 4 40 400
Working with isin()
for Multiple Values
# Filter rows where A is in a list of values
filtered = df[df['A'].isin([1, 3, 5])]
print(filtered)
Output:
A B C
0 1 10 100
2 3 30 300
4 5 50 500
Using Boolean Masks for Advanced Filtering
# Create a mask for even values in column A
mask = df['A'] % 2 == 0
filtered = df[mask]
print(filtered)
Output:
A B C
1 2 20 200
3 4 40 400
Fancy Indexing
Fancy indexing involves using arrays of integers or labels to select data.
Using Arrays of Integers
# Select specific rows by position
selected = df.iloc[[0, 2, 4]]
print(selected)
Output:
A B C
0 1 10 100
2 3 30 300
4 5 50 500
Using Arrays of Labels
# Create a DataFrame with string indices
df_labeled = df.copy()
df_labeled.index = ['a', 'b', 'c', 'd', 'e']
# Select specific rows by label
selected = df_labeled.loc[['a', 'c', 'e']]
print(selected)
Output:
A B C
a 1 10 100
c 3 30 300
e 5 50 500
Combining Row and Column Selection
# Select specific rows and columns
selected = df_labeled.loc[['a', 'c', 'e'], ['A', 'C']]
print(selected)
Output:
A C
a 1 100
c 3 300
e 5 500
Advanced Slicing
Pandas provides flexible slicing capabilities that go beyond basic indexing.
Slicing with Labels
# Slice from 'b' to 'd'
sliced = df_labeled.loc['b':'d']
print(sliced)
Output:
A B C
b 2 20 200
c 3 30 300
d 4 40 400
Slicing with MultiIndex
# Slice in a MultiIndex DataFrame
states_slice = data.loc['California':'New York']
print(states_slice)
Output:
Population GDP (billions)
State City
California San Francisco 884363 592.3
Los Angeles 3967000 710.9
New York New York City 8336817 1500.1
Buffalo 278349 43.8
Conditional Slicing with where()
# Replace values that don't meet the condition with NaN
filtered = df.where(df > 10)
print(filtered)
Output:
A B C
0 NaN 10.0 100.0
1 NaN 20.0 200.0
2 NaN 30.0 300.0
3 NaN 40.0 400.0
4 NaN 50.0 500.0
Real-World Applications
Let's look at some practical examples of using advanced indexing in real-world scenarios.
Example 1: Analysis of Sales Data
# Create a sales dataset
sales_data = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
'Region': ['North', 'South', 'East', 'West'] * 3,
'Product': ['A', 'B', 'C'] * 4,
'Sales': np.random.randint(100, 1000, 12),
'Units': np.random.randint(10, 100, 12)
})
# Set multi-level index
sales_indexed = sales_data.set_index(['Region', 'Product', 'Date'])
print(sales_indexed.head())
# Find highest sales in each region
highest_sales = sales_indexed.xs('A', level='Product')['Sales'].groupby('Region').max()
print("\nHighest sales for Product A by region:")
print(highest_sales)
# Get sales data for the North region in Q1
q1_north_sales = sales_indexed.loc['North'].loc[:, '2023-01-01':'2023-03-31']
print("\nQ1 sales in North region:")
print(q1_north_sales)
Example 2: Data Cleaning with Boolean Indexing
# Create a dataset with some missing values
df_missing = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Max'],
'Age': [28, np.nan, 34, 29, np.nan],
'Salary': [50000, 60000, np.nan, 55000, 52000],
'Department': ['IT', 'HR', 'IT', 'Marketing', 'Finance']
})
# Find rows with any missing values
rows_with_na = df_missing[df_missing.isna().any(axis=1)]
print("Rows with missing values:")
print(rows_with_na)
# Find rows with complete data
complete_rows = df_missing[~df_missing.isna().any(axis=1)]
print("\nRows with complete data:")
print(complete_rows)
# Filter IT department with salary above 45000
it_high_salary = df_missing[(df_missing['Department'] == 'IT') &
(df_missing['Salary'] > 45000)]
print("\nIT department with high salary:")
print(it_high_salary)
Example 3: Time Series Analysis with Advanced Indexing
# Create a time series dataset
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
ts_data = pd.DataFrame({
'Date': dates,
'Value': np.random.normal(0, 1, len(dates))
})
# Set date as index
ts_indexed = ts_data.set_index('Date')
# Get data for specific month using partial string indexing
march_data = ts_indexed['2023-03']
print("March data (first 5 rows):")
print(march_data.head())
# Get data for specific date range
q2_data = ts_indexed['2023-04-01':'2023-06-30']
print("\nQ2 data summary:")
print(q2_data.describe())
# Filter days where the value is above 2 standard deviations
outliers = ts_indexed[np.abs(ts_indexed['Value']) > 2]
print(f"\nFound {len(outliers)} outlier days")
Summary
Advanced indexing in pandas is a powerful toolset that allows you to efficiently access, filter, and manipulate data in various ways:
- Hierarchical Indexing (MultiIndex) enables working with multi-dimensional data in a 2D structure
- Boolean Indexing allows filtering data based on conditions
- Fancy Indexing makes it possible to select data using arrays of indexes or labels
- Advanced Slicing provides flexible ways to access portions of your data
Mastering these techniques will significantly improve your data manipulation capabilities and make your code more efficient and concise.
Additional Resources
To deepen your understanding of pandas advanced indexing, check out these resources:
- Pandas Official Documentation on Indexing
- Pandas Official Documentation on MultiIndex
- 10 Minutes to Pandas - Indexing Section
Exercises
Test your knowledge with these exercises:
-
Create a MultiIndex DataFrame with data about different car models from various manufacturers, and use advanced indexing to find the average price per manufacturer.
-
Given a dataset of student scores across multiple subjects, use boolean indexing to find students who scored above 90 in Math but below 70 in English.
-
Use fancy indexing to extract every third row from a dataset and only select specific columns.
-
Create a time series dataset with daily temperatures for a year, and use advanced slicing to compare summer vs. winter averages.
-
Combine boolean and hierarchical indexing to find products that have sales above average in each region from a sales dataset.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)