Pandas Advanced Indexing

Understanding how to efficiently access and manipulate data is essential for any data analysis task. While pandas provides basic indexing capabilities, mastering advanced indexing techniques will significantly improve your data manipulation skills and make your code more efficient. This guide will walk you through the powerful indexing methods available in pandas.

Introduction to Advanced Indexing

Pandas offers several sophisticated ways to access data beyond simple row and column selection. Advanced indexing allows you to:

Work with multi-level indexes
Filter data using boolean conditions
Select data using arrays (fancy indexing)
Slice data along multiple dimensions

These techniques can dramatically simplify complex data manipulation tasks, making your code cleaner and more efficient.

Prerequisites

Before diving into advanced indexing, make sure you have pandas installed:

# Install pandas if you haven't already
# !pip install pandas

import pandas as pd
import numpy as np

Hierarchical Indexing (MultiIndex)

Hierarchical indexing (also known as MultiIndex) allows you to have multiple levels of indexes on a single axis. This is particularly useful when dealing with high-dimensional data in a two-dimensional DataFrame.

Creating a DataFrame with MultiIndex

# Create a MultiIndex DataFrame
arrays = [
    ['California', 'California', 'New York', 'New York', 'Texas', 'Texas'],
    ['San Francisco', 'Los Angeles', 'New York City', 'Buffalo', 'Austin', 'Dallas']
]
index = pd.MultiIndex.from_arrays(arrays, names=['State', 'City'])
data = pd.DataFrame({
    'Population': [884363, 3967000, 8336817, 278349, 964177, 1345047],
    'GDP (billions)': [592.3, 710.9, 1500.1, 43.8, 148.6, 534.3]
}, index=index)

print(data)

Output:

                       Population  GDP (billions)
State      City                                  
California San Francisco    884363          592.3
           Los Angeles     3967000          710.9
New York   New York City   8336817         1500.1
           Buffalo          278349           43.8
Texas      Austin           964177          148.6
           Dallas          1345047          534.3

Accessing Data with MultiIndex

Selecting by outer index level

# Select data for California
california_data = data.loc['California']
print(california_data)

Output:

              Population  GDP (billions)
City                                    
San Francisco     884363          592.3
Los Angeles      3967000          710.9

Selecting by multiple index levels

# Select data for New York City
nyc_data = data.loc[('New York', 'New York City')]
print(nyc_data)

Output:

Population      8336817.0
GDP (billions)   1500.1
Name: (New York, New York City), dtype: float64

Cross-section selection with `xs()`

The xs() method allows for more flexible cross-sectional selection:

# Select all cities in Texas
texas_cities = data.xs('Texas', level='State')
print(texas_cities)

Output:

        Population  GDP (billions)
City                              
Austin      964177          148.6
Dallas     1345047          534.3

Manipulating MultiIndex DataFrames

Adding a level to MultiIndex

# Add a Country level
data_with_country = data.copy()
data_with_country['Country'] = 'USA'
data_with_country = data_with_country.set_index('Country', append=True)
print(data_with_country.head())

Output:

                             Population  GDP (billions)
State      City           Country                      
California San Francisco USA          884363        592.3
           Los Angeles   USA         3967000        710.9
New York   New York City USA         8336817       1500.1
           Buffalo       USA          278349         43.8
Texas      Austin        USA          964177        148.6

Swapping index levels

# Swap levels to have City as the first level
swapped_index = data.swaplevel('State', 'City')
print(swapped_index.head())

Output:

                     Population  GDP (billions)
City           State                           
San Francisco   California       884363        592.3
Los Angeles     California      3967000        710.9
New York City   New York       8336817       1500.1
Buffalo         New York        278349         43.8
Austin          Texas           964177        148.6

Stacking and unstacking levels

The stack() and unstack() methods transform between a "long" and "wide" format:

# Unstacking moves the innermost index level to become columns
unstacked = data.unstack(level='City')
print(unstacked.head())

# Stacking does the opposite - converts columns back to an index level
restacked = unstacked.stack()
print(restacked.equals(data))  # Should be True

Boolean Indexing

Boolean indexing is a powerful technique that allows you to select data based on conditions.

Basic Boolean Filtering

# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# Filter rows where column A > 2
filtered = df[df['A'] > 2]
print(filtered)

Output:

Combining Multiple Conditions

# Filter with multiple conditions
# Rows where A > 2 AND B < 50
filtered = df[(df['A'] > 2) & (df['B'] < 50)]
print(filtered)

Output:

   A   B    C
2  3  30  300
3  4  40  400

# Rows where A > 2 OR C > 400
filtered = df[(df['A'] > 2) | (df['C'] > 400)]
print(filtered)

Output:

Using `query()` Method

The query() method provides a more readable syntax for boolean filtering:

# Same as df[(df['A'] > 2) & (df['B'] < 50)]
filtered = df.query('A > 2 and B < 50')
print(filtered)

Output:

   A   B    C
2  3  30  300
3  4  40  400

Working with `isin()` for Multiple Values

# Filter rows where A is in a list of values
filtered = df[df['A'].isin([1, 3, 5])]
print(filtered)

Output:

Using Boolean Masks for Advanced Filtering

# Create a mask for even values in column A
mask = df['A'] % 2 == 0
filtered = df[mask]
print(filtered)

Output:

   A   B    C
1  2  20  200
3  4  40  400

Fancy Indexing

Fancy indexing involves using arrays of integers or labels to select data.

Using Arrays of Integers

# Select specific rows by position
selected = df.iloc[[0, 2, 4]]
print(selected)

Output:

Using Arrays of Labels

# Create a DataFrame with string indices
df_labeled = df.copy()
df_labeled.index = ['a', 'b', 'c', 'd', 'e']

# Select specific rows by label
selected = df_labeled.loc[['a', 'c', 'e']]
print(selected)

Output:

   A   B    C
a  1  10  100
c  3  30  300
e  5  50  500

Combining Row and Column Selection

# Select specific rows and columns
selected = df_labeled.loc[['a', 'c', 'e'], ['A', 'C']]
print(selected)

Output:

Advanced Slicing

Pandas provides flexible slicing capabilities that go beyond basic indexing.

Slicing with Labels

# Slice from 'b' to 'd'
sliced = df_labeled.loc['b':'d']
print(sliced)

Output:

   A   B    C
b  2  20  200
c  3  30  300
d  4  40  400

Slicing with MultiIndex

# Slice in a MultiIndex DataFrame
states_slice = data.loc['California':'New York']
print(states_slice)

Output:

                       Population  GDP (billions)
State      City                                  
California San Francisco    884363          592.3
           Los Angeles     3967000          710.9
New York   New York City   8336817         1500.1
           Buffalo          278349           43.8

Conditional Slicing with `where()`

# Replace values that don't meet the condition with NaN
filtered = df.where(df > 10)
print(filtered)

Output:

     A     B      C
NaN  10.0  100.0
NaN  20.0  200.0
NaN  30.0  300.0
NaN  40.0  400.0
NaN  50.0  500.0

Real-World Applications

Let's look at some practical examples of using advanced indexing in real-world scenarios.

Example 1: Analysis of Sales Data

# Create a sales dataset
sales_data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
    'Region': ['North', 'South', 'East', 'West'] * 3,
    'Product': ['A', 'B', 'C'] * 4,
    'Sales': np.random.randint(100, 1000, 12),
    'Units': np.random.randint(10, 100, 12)
})

# Set multi-level index
sales_indexed = sales_data.set_index(['Region', 'Product', 'Date'])
print(sales_indexed.head())

# Find highest sales in each region
highest_sales = sales_indexed.xs('A', level='Product')['Sales'].groupby('Region').max()
print("\nHighest sales for Product A by region:")
print(highest_sales)

# Get sales data for the North region in Q1
q1_north_sales = sales_indexed.loc['North'].loc[:, '2023-01-01':'2023-03-31']
print("\nQ1 sales in North region:")
print(q1_north_sales)

Example 2: Data Cleaning with Boolean Indexing

# Create a dataset with some missing values
df_missing = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Max'],
    'Age': [28, np.nan, 34, 29, np.nan],
    'Salary': [50000, 60000, np.nan, 55000, 52000],
    'Department': ['IT', 'HR', 'IT', 'Marketing', 'Finance']
})

# Find rows with any missing values
rows_with_na = df_missing[df_missing.isna().any(axis=1)]
print("Rows with missing values:")
print(rows_with_na)

# Find rows with complete data
complete_rows = df_missing[~df_missing.isna().any(axis=1)]
print("\nRows with complete data:")
print(complete_rows)

# Filter IT department with salary above 45000
it_high_salary = df_missing[(df_missing['Department'] == 'IT') & 
                         (df_missing['Salary'] > 45000)]
print("\nIT department with high salary:")
print(it_high_salary)

Example 3: Time Series Analysis with Advanced Indexing

# Create a time series dataset
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
ts_data = pd.DataFrame({
    'Date': dates,
    'Value': np.random.normal(0, 1, len(dates))
})

# Set date as index
ts_indexed = ts_data.set_index('Date')

# Get data for specific month using partial string indexing
march_data = ts_indexed['2023-03']
print("March data (first 5 rows):")
print(march_data.head())

# Get data for specific date range
q2_data = ts_indexed['2023-04-01':'2023-06-30']
print("\nQ2 data summary:")
print(q2_data.describe())

# Filter days where the value is above 2 standard deviations
outliers = ts_indexed[np.abs(ts_indexed['Value']) > 2]
print(f"\nFound {len(outliers)} outlier days")

Summary

Advanced indexing in pandas is a powerful toolset that allows you to efficiently access, filter, and manipulate data in various ways:

Hierarchical Indexing (MultiIndex) enables working with multi-dimensional data in a 2D structure
Boolean Indexing allows filtering data based on conditions
Fancy Indexing makes it possible to select data using arrays of indexes or labels
Advanced Slicing provides flexible ways to access portions of your data

Mastering these techniques will significantly improve your data manipulation capabilities and make your code more efficient and concise.

Additional Resources

To deepen your understanding of pandas advanced indexing, check out these resources:

Exercises

Test your knowledge with these exercises:

Create a MultiIndex DataFrame with data about different car models from various manufacturers, and use advanced indexing to find the average price per manufacturer.
Given a dataset of student scores across multiple subjects, use boolean indexing to find students who scored above 90 in Math but below 70 in English.
Use fancy indexing to extract every third row from a dataset and only select specific columns.
Create a time series dataset with daily temperatures for a year, and use advanced slicing to compare summer vs. winter averages.
Combine boolean and hierarchical indexing to find products that have sales above average in each region from a sales dataset.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Advanced Indexing​

Prerequisites​

Hierarchical Indexing (MultiIndex)​

Creating a DataFrame with MultiIndex​

Accessing Data with MultiIndex​

Selecting by outer index level​

Selecting by multiple index levels​

Cross-section selection with xs()​

Manipulating MultiIndex DataFrames​

Adding a level to MultiIndex​

Swapping index levels​

Stacking and unstacking levels​

Boolean Indexing​

Basic Boolean Filtering​

Combining Multiple Conditions​

Using query() Method​

Working with isin() for Multiple Values​

Using Boolean Masks for Advanced Filtering​

Fancy Indexing​

Using Arrays of Integers​

Using Arrays of Labels​

Combining Row and Column Selection​

Advanced Slicing​

Slicing with Labels​

Slicing with MultiIndex​

Conditional Slicing with where()​

Real-World Applications​

Example 1: Analysis of Sales Data​

Example 2: Data Cleaning with Boolean Indexing​

Example 3: Time Series Analysis with Advanced Indexing​

Summary​

Additional Resources​

Exercises​

Introduction to Advanced Indexing

Prerequisites

Hierarchical Indexing (MultiIndex)

Creating a DataFrame with MultiIndex

Accessing Data with MultiIndex

Selecting by outer index level

Selecting by multiple index levels

Cross-section selection with `xs()`

Manipulating MultiIndex DataFrames

Adding a level to MultiIndex

Swapping index levels

Stacking and unstacking levels

Boolean Indexing

Basic Boolean Filtering

Combining Multiple Conditions

Using `query()` Method

Working with `isin()` for Multiple Values

Using Boolean Masks for Advanced Filtering

Fancy Indexing

Using Arrays of Integers

Using Arrays of Labels

Combining Row and Column Selection

Advanced Slicing

Slicing with Labels

Slicing with MultiIndex

Conditional Slicing with `where()`

Real-World Applications

Example 1: Analysis of Sales Data

Example 2: Data Cleaning with Boolean Indexing

Example 3: Time Series Analysis with Advanced Indexing

Summary

Additional Resources

Exercises