Pandas Advanced Selection
In this tutorial, we'll explore advanced techniques for selecting and filtering data in pandas. After mastering the basics of indexing and selection, these advanced methods will give you more power and flexibility when working with complex datasets.
Introduction
Pandas provides numerous ways to select and filter data beyond simple indexing. These advanced selection techniques allow you to:
- Extract data based on complex conditions
- Work with hierarchical indexes
- Apply sophisticated filtering logic to your DataFrames
- Combine multiple selection criteria
These skills are essential for effective data analysis and manipulation in real-world applications.
Prerequisites
Before diving into advanced selection techniques, make sure you're familiar with:
- Basic pandas operations
- DataFrame and Series objects
- Basic indexing and selection with
.loc
and.iloc
Let's import pandas to get started:
import pandas as pd
import numpy as np
Boolean Indexing
Boolean indexing is one of the most powerful selection techniques in pandas. It lets you filter data based on conditions.
Basic Boolean Filtering
Let's create a sample DataFrame:
# Create a simple DataFrame
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'age': [28, 34, 19, 42, 31],
'city': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
'salary': [50000, 65000, 30000, 85000, 60000]
}
df = pd.DataFrame(data)
print(df)
Output:
name age city salary
0 John 28 New York 50000
1 Anna 34 Paris 65000
2 Peter 19 Berlin 30000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000
To select rows based on a condition, use a boolean expression:
# Select rows where age is over 30
over_30 = df[df['age'] > 30]
print(over_30)
Output:
name age city salary
1 Anna 34 Paris 65000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000
Multiple Conditions
You can combine multiple conditions using logical operators &
(and), |
(or), and ~
(not):
# Select people over 30 years old AND with salary above 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print(filtered_df)
Output:
name age city salary
3 Linda 42 London 85000
# Select people from Paris OR Tokyo
cities_filter = df[(df['city'] == 'Paris') | (df['city'] == 'Tokyo')]
print(cities_filter)
Output:
name age city salary
1 Anna 34 Paris 65000
4 Bob 31 Tokyo 60000
Note: When combining multiple conditions, always wrap each condition in parentheses.
Using isin()
for Multiple Values
To check if values are in a specific set, use isin()
:
# Select rows where city is in a list of cities
selected_cities = ['New York', 'London', 'Berlin']
city_filter = df[df['city'].isin(selected_cities)]
print(city_filter)
Output:
name age city salary
0 John 28 New York 50000
2 Peter 19 Berlin 30000
3 Linda 42 London 85000
Filtering on Strings
Pandas provides string methods through .str
accessor:
# Filter names that start with 'J'
j_names = df[df['name'].str.startswith('J')]
print(j_names)
Output:
name age city salary
0 John 28 New York 50000
# Filter cities containing the letter 'o'
o_cities = df[df['city'].str.contains('o')]
print(o_cities)
Output:
name age city salary
0 John 28 New York 50000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000
Advanced loc
and iloc
The .loc
and .iloc
indexers become even more powerful when combined with boolean indexing.
Advanced .loc
Usage
.loc
can be used with boolean masks for more complex selections:
# Select specific columns for rows with age > 30
result = df.loc[df['age'] > 30, ['name', 'city']]
print(result)
Output:
name city
1 Anna Paris
3 Linda London
4 Bob Tokyo
Combining Row and Column Selections
# Select specific rows and columns using .loc
result = df.loc[df['salary'] > 60000, 'name':'city']
print(result)
Output:
name age city
1 Anna 34 Paris
3 Linda 42 London
Hierarchical Indexing (MultiIndex)
Hierarchical indexing allows you to work with higher-dimensional data in a lower-dimensional representation.
Creating a MultiIndex DataFrame
# Create a DataFrame with MultiIndex
arrays = [
['California', 'California', 'New York', 'New York', 'Florida', 'Florida'],
['Los Angeles', 'San Francisco', 'New York City', 'Buffalo', 'Miami', 'Orlando']
]
index = pd.MultiIndex.from_arrays(arrays, names=['State', 'City'])
data = {
'Population': [3970000, 870000, 8420000, 255000, 454000, 280000],
'Year Founded': [1781, 1776, 1624, 1789, 1896, 1875]
}
multiindex_df = pd.DataFrame(data, index=index)
print(multiindex_df)
Output:
Population Year Founded
State City
California Los Angeles 3970000 1781
San Francisco 870000 1776
New York New York City 8420000 1624
Buffalo 255000 1789
Florida Miami 454000 1896
Orlando 280000 1875
Selecting with MultiIndex
# Select a specific state
california_data = multiindex_df.loc['California']
print(california_data)
Output:
Population Year Founded
City
Los Angeles 3970000 1781
San Francisco 870000 1776
# Select a specific state and city
specific_city = multiindex_df.loc[('New York', 'New York City')]
print(specific_city)
Output:
Population 8420000
Year Founded 1624
Name: (New York, New York City), dtype: int64
Cross-section Selection
# Select cross-section with .xs()
miami_data = multiindex_df.xs('Miami', level='City')
print(miami_data)
Output:
Population Year Founded
State
Florida 454000 1896
Advanced Query Methods
Using query()
The query()
method allows you to use string expressions for filtering:
# Filter with query
young_high_earners = df.query('age < 30 and salary >= 50000')
print(young_high_earners)
Output:
name age city salary
0 John 28 New York 50000
Using eval()
The eval()
method lets you create new columns from expressions:
# Create a new column using eval
df_with_bonus = df.eval('bonus = salary * 0.1')
print(df_with_bonus)
Output:
name age city salary bonus
0 John 28 New York 50000 5000.0
1 Anna 34 Paris 65000 6500.0
2 Peter 19 Berlin 30000 3000.0
3 Linda 42 London 85000 8500.0
4 Bob 31 Tokyo 60000 6000.0
Handling Missing Values
Advanced selection often involves dealing with missing values.
# Create data with some NaN values
data_with_missing = df.copy()
data_with_missing.loc[1, 'salary'] = np.nan
data_with_missing.loc[3, 'city'] = np.nan
print(data_with_missing)
Output:
name age city salary
0 John 28 New York 50000.0
1 Anna 34 Paris NaN
2 Peter 19 Berlin 30000.0
3 Linda 42 NaN 85000.0
4 Bob 31 Tokyo 60000.0
Filtering Missing Values
# Drop rows with any missing value
cleaned_df = data_with_missing.dropna()
print(cleaned_df)
Output:
name age city salary
0 John 28 New York 50000.0
2 Peter 19 Berlin 30000.0
4 Bob 31 Tokyo 60000.0
Finding Missing Values
# Find rows with missing cities
missing_city = data_with_missing[data_with_missing['city'].isna()]
print(missing_city)
Output:
name age city salary
3 Linda 42 NaN 85000.0
Real-world Example: Data Analysis
Let's apply these techniques to a realistic scenario using a larger dataset:
# Create a more realistic dataset
np.random.seed(42)
n = 100
data = {
'customer_id': range(1, n+1),
'name': [f"Customer_{i}" for i in range(1, n+1)],
'age': np.random.randint(18, 70, n),
'purchase_amount': np.random.randint(10, 500, n),
'subscription_type': np.random.choice(['Basic', 'Premium', 'Gold'], n),
'active': np.random.choice([True, False], n, p=[0.8, 0.2])
}
customers = pd.DataFrame(data)
print(customers.head())
Output:
customer_id name age purchase_amount subscription_type active
0 1 Customer_1 42 37 Basic True
1 2 Customer_2 52 373 Premium False
2 3 Customer_3 54 358 Premium True
3 4 Customer_4 25 408 Gold True
4 5 Customer_5 58 386 Basic True
Analysis Task: Finding Valuable Customers
Let's identify active premium or gold customers with high purchase amounts:
# Find high-value customers
high_value = customers[
(customers['active'] == True) &
(customers['subscription_type'].isin(['Premium', 'Gold'])) &
(customers['purchase_amount'] > 300)
]
print(f"Number of high-value customers: {len(high_value)}")
print(high_value.head())
Output:
Number of high-value customers: 20
customer_id name age purchase_amount subscription_type active
2 3 Customer_3 54 358 Premium True
3 4 Customer_4 25 408 Gold True
9 10 Customer_10 27 493 Premium True
19 20 Customer_20 42 495 Premium True
25 26 Customer_26 46 456 Premium True
Creating Segments Based on Multiple Conditions
# Create customer segments based on age and purchase behavior
def assign_segment(row):
if row['age'] < 30 and row['purchase_amount'] > 200:
return 'Young Big Spender'
elif row['age'] >= 30 and row['purchase_amount'] > 200:
return 'Mature Big Spender'
elif row['age'] < 30:
return 'Young Regular'
else:
return 'Mature Regular'
customers['segment'] = customers.apply(assign_segment, axis=1)
# Count customers in each segment
segment_counts = customers.groupby('segment').size()
print(segment_counts)
Output:
segment
Mature Big Spender 27
Mature Regular 25
Young Big Spender 23
Young Regular 25
dtype: int64
Summary
Advanced selection in pandas gives you powerful tools to extract exactly the data you need. Here's what we've covered:
- Boolean indexing for condition-based selection
- Combining multiple conditions with logical operators
- Using specialized methods like
isin()
,str.contains()
- Advanced usage of
.loc
and.iloc
- Working with hierarchical (multi) indexes
- Using
query()
andeval()
for expressive selections - Handling missing data in selections
- Applied these techniques to real-world analysis tasks
These methods allow you to write expressive, concise code to analyze and manipulate complex datasets effectively.
Exercises
To practice these concepts, try these exercises:
- Create a DataFrame with sales data and select all transactions above a certain value
- Filter customers by multiple demographic attributes
- Work with a MultiIndex DataFrame and select data at different levels
- Use
query()
to filter data based on complex conditions - Create a custom analysis that combines at least three different selection techniques
Additional Resources
- Pandas Official Documentation on Indexing and Selecting
- Pandas Query Method Documentation
- Working with Missing Data in Pandas
Happy data wrangling!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)