Pandas Advanced Selection

In this tutorial, we'll explore advanced techniques for selecting and filtering data in pandas. After mastering the basics of indexing and selection, these advanced methods will give you more power and flexibility when working with complex datasets.

Introduction

Pandas provides numerous ways to select and filter data beyond simple indexing. These advanced selection techniques allow you to:

Extract data based on complex conditions
Work with hierarchical indexes
Apply sophisticated filtering logic to your DataFrames
Combine multiple selection criteria

These skills are essential for effective data analysis and manipulation in real-world applications.

Prerequisites

Before diving into advanced selection techniques, make sure you're familiar with:

Basic pandas operations
DataFrame and Series objects
Basic indexing and selection with .loc and .iloc

Let's import pandas to get started:

python
import pandas as pd
import numpy as np

Boolean Indexing

Boolean indexing is one of the most powerful selection techniques in pandas. It lets you filter data based on conditions.

Basic Boolean Filtering

Let's create a sample DataFrame:

python
# Create a simple DataFrame
data = {
    'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
    'age': [28, 34, 19, 42, 31],
    'city': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo'],
    'salary': [50000, 65000, 30000, 85000, 60000]
}

df = pd.DataFrame(data)
print(df)

Output:

    name  age     city  salary
 John   28  New York   50000
 Anna   34     Paris   65000
Peter   19    Berlin   30000
Linda   42    London   85000
  Bob   31     Tokyo   60000

To select rows based on a condition, use a boolean expression:

python
# Select rows where age is over 30
over_30 = df[df['age'] > 30]
print(over_30)

Output:

    name  age    city  salary
 Anna   34   Paris   65000
Linda   42  London   85000
  Bob   31   Tokyo   60000

Multiple Conditions

You can combine multiple conditions using logical operators & (and), | (or), and ~ (not):

python
# Select people over 30 years old AND with salary above 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print(filtered_df)

Output:

    name  age    city  salary
3  Linda   42  London   85000

python
# Select people from Paris OR Tokyo
cities_filter = df[(df['city'] == 'Paris') | (df['city'] == 'Tokyo')]
print(cities_filter)

Output:

   name  age   city  salary
1  Anna   34  Paris   65000
4   Bob   31  Tokyo   60000

Note: When combining multiple conditions, always wrap each condition in parentheses.

Using `isin()` for Multiple Values

To check if values are in a specific set, use isin():

python
# Select rows where city is in a list of cities
selected_cities = ['New York', 'London', 'Berlin']
city_filter = df[df['city'].isin(selected_cities)]
print(city_filter)

Output:

    name  age     city  salary
 John   28  New York   50000
Peter   19    Berlin   30000
Linda   42    London   85000

Filtering on Strings

Pandas provides string methods through .str accessor:

python
# Filter names that start with 'J'
j_names = df[df['name'].str.startswith('J')]
print(j_names)

Output:

   name  age     city  salary
0  John   28  New York   50000

python
# Filter cities containing the letter 'o'
o_cities = df[df['city'].str.contains('o')]
print(o_cities)

Output:

    name  age     city  salary
 John   28  New York   50000
Linda   42    London   85000
  Bob   31     Tokyo   60000

Advanced `loc` and `iloc`

The .loc and .iloc indexers become even more powerful when combined with boolean indexing.

Advanced `.loc` Usage

.loc can be used with boolean masks for more complex selections:

python
# Select specific columns for rows with age > 30
result = df.loc[df['age'] > 30, ['name', 'city']]
print(result)

Output:

    name    city
 Anna   Paris
Linda  London
  Bob   Tokyo

Combining Row and Column Selections

python
# Select specific rows and columns using .loc
result = df.loc[df['salary'] > 60000, 'name':'city']
print(result)

Output:

    name  age    city
1   Anna   34   Paris
3  Linda   42  London

Hierarchical Indexing (MultiIndex)

Hierarchical indexing allows you to work with higher-dimensional data in a lower-dimensional representation.

Creating a MultiIndex DataFrame

python
# Create a DataFrame with MultiIndex
arrays = [
    ['California', 'California', 'New York', 'New York', 'Florida', 'Florida'],
    ['Los Angeles', 'San Francisco', 'New York City', 'Buffalo', 'Miami', 'Orlando']
]
index = pd.MultiIndex.from_arrays(arrays, names=['State', 'City'])
data = {
    'Population': [3970000, 870000, 8420000, 255000, 454000, 280000],
    'Year Founded': [1781, 1776, 1624, 1789, 1896, 1875]
}
multiindex_df = pd.DataFrame(data, index=index)
print(multiindex_df)

Output:

                        Population  Year Founded
State      City                                
California Los Angeles     3970000          1781
           San Francisco    870000          1776
New York   New York City  8420000          1624
           Buffalo         255000          1789
Florida    Miami           454000          1896
           Orlando         280000          1875

Selecting with MultiIndex

python
# Select a specific state
california_data = multiindex_df.loc['California']
print(california_data)

Output:

               Population  Year Founded
City                                   
Los Angeles      3970000          1781
San Francisco     870000          1776

python
# Select a specific state and city
specific_city = multiindex_df.loc[('New York', 'New York City')]
print(specific_city)

Output:

Population      8420000
Year Founded      1624
Name: (New York, New York City), dtype: int64

Cross-section Selection

python
# Select cross-section with .xs()
miami_data = multiindex_df.xs('Miami', level='City')
print(miami_data)

Output:

          Population  Year Founded
State                            
Florida        454000          1896

Advanced Query Methods

Using `query()`

The query() method allows you to use string expressions for filtering:

python
# Filter with query
young_high_earners = df.query('age < 30 and salary >= 50000')
print(young_high_earners)

Output:

   name  age     city  salary
0  John   28  New York   50000

Using `eval()`

The eval() method lets you create new columns from expressions:

python
# Create a new column using eval
df_with_bonus = df.eval('bonus = salary * 0.1')
print(df_with_bonus)

Output:

    name  age     city  salary   bonus
 John   28  New York   50000  5000.0
 Anna   34     Paris   65000  6500.0
Peter   19    Berlin   30000  3000.0
Linda   42    London   85000  8500.0
  Bob   31     Tokyo   60000  6000.0

Handling Missing Values

Advanced selection often involves dealing with missing values.

python
# Create data with some NaN values
data_with_missing = df.copy()
data_with_missing.loc[1, 'salary'] = np.nan
data_with_missing.loc[3, 'city'] = np.nan
print(data_with_missing)

Output:

    name  age     city  salary
 John   28  New York  50000.0
 Anna   34     Paris      NaN
Peter   19    Berlin  30000.0
Linda   42      NaN  85000.0
  Bob   31     Tokyo  60000.0

Filtering Missing Values

python
# Drop rows with any missing value
cleaned_df = data_with_missing.dropna()
print(cleaned_df)

Output:

    name  age     city  salary
 John   28  New York  50000.0
Peter   19    Berlin  30000.0
  Bob   31     Tokyo  60000.0

Finding Missing Values

python
# Find rows with missing cities
missing_city = data_with_missing[data_with_missing['city'].isna()]
print(missing_city)

Output:

    name  age  city  salary
3  Linda   42   NaN  85000.0

Real-world Example: Data Analysis

Let's apply these techniques to a realistic scenario using a larger dataset:

python
# Create a more realistic dataset
np.random.seed(42)
n = 100
data = {
    'customer_id': range(1, n+1),
    'name': [f"Customer_{i}" for i in range(1, n+1)],
    'age': np.random.randint(18, 70, n),
    'purchase_amount': np.random.randint(10, 500, n),
    'subscription_type': np.random.choice(['Basic', 'Premium', 'Gold'], n),
    'active': np.random.choice([True, False], n, p=[0.8, 0.2])
}

customers = pd.DataFrame(data)
print(customers.head())

Output:

   customer_id       name  age  purchase_amount subscription_type  active
          1  Customer_1   42              37             Basic    True
          2  Customer_2   52             373           Premium   False
          3  Customer_3   54             358           Premium    True
          4  Customer_4   25             408              Gold    True
          5  Customer_5   58             386             Basic    True

Analysis Task: Finding Valuable Customers

Let's identify active premium or gold customers with high purchase amounts:

python
# Find high-value customers
high_value = customers[
    (customers['active'] == True) & 
    (customers['subscription_type'].isin(['Premium', 'Gold'])) & 
    (customers['purchase_amount'] > 300)
]

print(f"Number of high-value customers: {len(high_value)}")
print(high_value.head())

Output:

Number of high-value customers: 20
   customer_id        name  age  purchase_amount subscription_type  active
          3  Customer_3   54             358           Premium    True
          4  Customer_4   25             408              Gold    True
         10 Customer_10   27             493           Premium    True
        20 Customer_20   42             495           Premium    True
        26 Customer_26   46             456           Premium    True

Creating Segments Based on Multiple Conditions

python
# Create customer segments based on age and purchase behavior
def assign_segment(row):
    if row['age'] < 30 and row['purchase_amount'] > 200:
        return 'Young Big Spender'
    elif row['age'] >= 30 and row['purchase_amount'] > 200:
        return 'Mature Big Spender'
    elif row['age'] < 30:
        return 'Young Regular'
    else:
        return 'Mature Regular'

customers['segment'] = customers.apply(assign_segment, axis=1)

# Count customers in each segment
segment_counts = customers.groupby('segment').size()
print(segment_counts)

Output:

segment
Mature Big Spender    27
Mature Regular        25
Young Big Spender     23
Young Regular         25
dtype: int64

Summary

Advanced selection in pandas gives you powerful tools to extract exactly the data you need. Here's what we've covered:

Boolean indexing for condition-based selection
Combining multiple conditions with logical operators
Using specialized methods like isin(), str.contains()
Advanced usage of .loc and .iloc
Working with hierarchical (multi) indexes
Using query() and eval() for expressive selections
Handling missing data in selections
Applied these techniques to real-world analysis tasks

These methods allow you to write expressive, concise code to analyze and manipulate complex datasets effectively.

Exercises

To practice these concepts, try these exercises:

Create a DataFrame with sales data and select all transactions above a certain value
Filter customers by multiple demographic attributes
Work with a MultiIndex DataFrame and select data at different levels
Use query() to filter data based on complex conditions
Create a custom analysis that combines at least three different selection techniques

Additional Resources

Happy data wrangling!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Boolean Indexing​

Basic Boolean Filtering​

Multiple Conditions​

Using isin() for Multiple Values​

Filtering on Strings​

Advanced loc and iloc​

Advanced .loc Usage​

Combining Row and Column Selections​

Hierarchical Indexing (MultiIndex)​

Creating a MultiIndex DataFrame​

Selecting with MultiIndex​

Cross-section Selection​

Advanced Query Methods​

Using query()​

Using eval()​

Handling Missing Values​

Filtering Missing Values​

Finding Missing Values​

Real-world Example: Data Analysis​

Analysis Task: Finding Valuable Customers​

Creating Segments Based on Multiple Conditions​

Summary​

Exercises​

Additional Resources​

Introduction

Prerequisites

Boolean Indexing

Basic Boolean Filtering

Multiple Conditions

Using `isin()` for Multiple Values

Filtering on Strings

Advanced `loc` and `iloc`

Advanced `.loc` Usage

Combining Row and Column Selections

Hierarchical Indexing (MultiIndex)

Creating a MultiIndex DataFrame

Selecting with MultiIndex

Cross-section Selection

Advanced Query Methods

Using `query()`

Using `eval()`

Handling Missing Values

Filtering Missing Values

Finding Missing Values

Real-world Example: Data Analysis

Analysis Task: Finding Valuable Customers

Creating Segments Based on Multiple Conditions

Summary

Exercises

Additional Resources