Skip to main content

Pandas Advanced Selection

In this tutorial, we'll explore advanced techniques for selecting and filtering data in pandas. After mastering the basics of indexing and selection, these advanced methods will give you more power and flexibility when working with complex datasets.

Introduction

Pandas provides numerous ways to select and filter data beyond simple indexing. These advanced selection techniques allow you to:

  • Extract data based on complex conditions
  • Work with hierarchical indexes
  • Apply sophisticated filtering logic to your DataFrames
  • Combine multiple selection criteria

These skills are essential for effective data analysis and manipulation in real-world applications.

Prerequisites

Before diving into advanced selection techniques, make sure you're familiar with:

  • Basic pandas operations
  • DataFrame and Series objects
  • Basic indexing and selection with .loc and .iloc

Let's import pandas to get started:

python
import pandas as pd
import numpy as np

Boolean Indexing

Boolean indexing is one of the most powerful selection techniques in pandas. It lets you filter data based on conditions.

Basic Boolean Filtering

Let's create a sample DataFrame:

python
# Create a simple DataFrame
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'age': [28, 34, 19, 42, 31],
'city': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
'salary': [50000, 65000, 30000, 85000, 60000]
}

df = pd.DataFrame(data)
print(df)

Output:

    name  age     city  salary
0 John 28 New York 50000
1 Anna 34 Paris 65000
2 Peter 19 Berlin 30000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000

To select rows based on a condition, use a boolean expression:

python
# Select rows where age is over 30
over_30 = df[df['age'] > 30]
print(over_30)

Output:

    name  age    city  salary
1 Anna 34 Paris 65000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000

Multiple Conditions

You can combine multiple conditions using logical operators & (and), | (or), and ~ (not):

python
# Select people over 30 years old AND with salary above 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print(filtered_df)

Output:

    name  age    city  salary
3 Linda 42 London 85000
python
# Select people from Paris OR Tokyo
cities_filter = df[(df['city'] == 'Paris') | (df['city'] == 'Tokyo')]
print(cities_filter)

Output:

   name  age   city  salary
1 Anna 34 Paris 65000
4 Bob 31 Tokyo 60000

Note: When combining multiple conditions, always wrap each condition in parentheses.

Using isin() for Multiple Values

To check if values are in a specific set, use isin():

python
# Select rows where city is in a list of cities
selected_cities = ['New York', 'London', 'Berlin']
city_filter = df[df['city'].isin(selected_cities)]
print(city_filter)

Output:

    name  age     city  salary
0 John 28 New York 50000
2 Peter 19 Berlin 30000
3 Linda 42 London 85000

Filtering on Strings

Pandas provides string methods through .str accessor:

python
# Filter names that start with 'J'
j_names = df[df['name'].str.startswith('J')]
print(j_names)

Output:

   name  age     city  salary
0 John 28 New York 50000
python
# Filter cities containing the letter 'o'
o_cities = df[df['city'].str.contains('o')]
print(o_cities)

Output:

    name  age     city  salary
0 John 28 New York 50000
3 Linda 42 London 85000
4 Bob 31 Tokyo 60000

Advanced loc and iloc

The .loc and .iloc indexers become even more powerful when combined with boolean indexing.

Advanced .loc Usage

.loc can be used with boolean masks for more complex selections:

python
# Select specific columns for rows with age > 30
result = df.loc[df['age'] > 30, ['name', 'city']]
print(result)

Output:

    name    city
1 Anna Paris
3 Linda London
4 Bob Tokyo

Combining Row and Column Selections

python
# Select specific rows and columns using .loc
result = df.loc[df['salary'] > 60000, 'name':'city']
print(result)

Output:

    name  age    city
1 Anna 34 Paris
3 Linda 42 London

Hierarchical Indexing (MultiIndex)

Hierarchical indexing allows you to work with higher-dimensional data in a lower-dimensional representation.

Creating a MultiIndex DataFrame

python
# Create a DataFrame with MultiIndex
arrays = [
['California', 'California', 'New York', 'New York', 'Florida', 'Florida'],
['Los Angeles', 'San Francisco', 'New York City', 'Buffalo', 'Miami', 'Orlando']
]
index = pd.MultiIndex.from_arrays(arrays, names=['State', 'City'])
data = {
'Population': [3970000, 870000, 8420000, 255000, 454000, 280000],
'Year Founded': [1781, 1776, 1624, 1789, 1896, 1875]
}
multiindex_df = pd.DataFrame(data, index=index)
print(multiindex_df)

Output:

                        Population  Year Founded
State City
California Los Angeles 3970000 1781
San Francisco 870000 1776
New York New York City 8420000 1624
Buffalo 255000 1789
Florida Miami 454000 1896
Orlando 280000 1875

Selecting with MultiIndex

python
# Select a specific state
california_data = multiindex_df.loc['California']
print(california_data)

Output:

               Population  Year Founded
City
Los Angeles 3970000 1781
San Francisco 870000 1776
python
# Select a specific state and city
specific_city = multiindex_df.loc[('New York', 'New York City')]
print(specific_city)

Output:

Population      8420000
Year Founded 1624
Name: (New York, New York City), dtype: int64

Cross-section Selection

python
# Select cross-section with .xs()
miami_data = multiindex_df.xs('Miami', level='City')
print(miami_data)

Output:

          Population  Year Founded
State
Florida 454000 1896

Advanced Query Methods

Using query()

The query() method allows you to use string expressions for filtering:

python
# Filter with query
young_high_earners = df.query('age < 30 and salary >= 50000')
print(young_high_earners)

Output:

   name  age     city  salary
0 John 28 New York 50000

Using eval()

The eval() method lets you create new columns from expressions:

python
# Create a new column using eval
df_with_bonus = df.eval('bonus = salary * 0.1')
print(df_with_bonus)

Output:

    name  age     city  salary   bonus
0 John 28 New York 50000 5000.0
1 Anna 34 Paris 65000 6500.0
2 Peter 19 Berlin 30000 3000.0
3 Linda 42 London 85000 8500.0
4 Bob 31 Tokyo 60000 6000.0

Handling Missing Values

Advanced selection often involves dealing with missing values.

python
# Create data with some NaN values
data_with_missing = df.copy()
data_with_missing.loc[1, 'salary'] = np.nan
data_with_missing.loc[3, 'city'] = np.nan
print(data_with_missing)

Output:

    name  age     city  salary
0 John 28 New York 50000.0
1 Anna 34 Paris NaN
2 Peter 19 Berlin 30000.0
3 Linda 42 NaN 85000.0
4 Bob 31 Tokyo 60000.0

Filtering Missing Values

python
# Drop rows with any missing value
cleaned_df = data_with_missing.dropna()
print(cleaned_df)

Output:

    name  age     city  salary
0 John 28 New York 50000.0
2 Peter 19 Berlin 30000.0
4 Bob 31 Tokyo 60000.0

Finding Missing Values

python
# Find rows with missing cities
missing_city = data_with_missing[data_with_missing['city'].isna()]
print(missing_city)

Output:

    name  age  city  salary
3 Linda 42 NaN 85000.0

Real-world Example: Data Analysis

Let's apply these techniques to a realistic scenario using a larger dataset:

python
# Create a more realistic dataset
np.random.seed(42)
n = 100
data = {
'customer_id': range(1, n+1),
'name': [f"Customer_{i}" for i in range(1, n+1)],
'age': np.random.randint(18, 70, n),
'purchase_amount': np.random.randint(10, 500, n),
'subscription_type': np.random.choice(['Basic', 'Premium', 'Gold'], n),
'active': np.random.choice([True, False], n, p=[0.8, 0.2])
}

customers = pd.DataFrame(data)
print(customers.head())

Output:

   customer_id       name  age  purchase_amount subscription_type  active
0 1 Customer_1 42 37 Basic True
1 2 Customer_2 52 373 Premium False
2 3 Customer_3 54 358 Premium True
3 4 Customer_4 25 408 Gold True
4 5 Customer_5 58 386 Basic True

Analysis Task: Finding Valuable Customers

Let's identify active premium or gold customers with high purchase amounts:

python
# Find high-value customers
high_value = customers[
(customers['active'] == True) &
(customers['subscription_type'].isin(['Premium', 'Gold'])) &
(customers['purchase_amount'] > 300)
]

print(f"Number of high-value customers: {len(high_value)}")
print(high_value.head())

Output:

Number of high-value customers: 20
customer_id name age purchase_amount subscription_type active
2 3 Customer_3 54 358 Premium True
3 4 Customer_4 25 408 Gold True
9 10 Customer_10 27 493 Premium True
19 20 Customer_20 42 495 Premium True
25 26 Customer_26 46 456 Premium True

Creating Segments Based on Multiple Conditions

python
# Create customer segments based on age and purchase behavior
def assign_segment(row):
if row['age'] < 30 and row['purchase_amount'] > 200:
return 'Young Big Spender'
elif row['age'] >= 30 and row['purchase_amount'] > 200:
return 'Mature Big Spender'
elif row['age'] < 30:
return 'Young Regular'
else:
return 'Mature Regular'

customers['segment'] = customers.apply(assign_segment, axis=1)

# Count customers in each segment
segment_counts = customers.groupby('segment').size()
print(segment_counts)

Output:

segment
Mature Big Spender 27
Mature Regular 25
Young Big Spender 23
Young Regular 25
dtype: int64

Summary

Advanced selection in pandas gives you powerful tools to extract exactly the data you need. Here's what we've covered:

  • Boolean indexing for condition-based selection
  • Combining multiple conditions with logical operators
  • Using specialized methods like isin(), str.contains()
  • Advanced usage of .loc and .iloc
  • Working with hierarchical (multi) indexes
  • Using query() and eval() for expressive selections
  • Handling missing data in selections
  • Applied these techniques to real-world analysis tasks

These methods allow you to write expressive, concise code to analyze and manipulate complex datasets effectively.

Exercises

To practice these concepts, try these exercises:

  1. Create a DataFrame with sales data and select all transactions above a certain value
  2. Filter customers by multiple demographic attributes
  3. Work with a MultiIndex DataFrame and select data at different levels
  4. Use query() to filter data based on complex conditions
  5. Create a custom analysis that combines at least three different selection techniques

Additional Resources

Happy data wrangling!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)