Pandas Boolean Selection

Introduction

Boolean selection in pandas is a powerful technique that allows you to filter data based on specific conditions. Rather than selecting data by position or label, boolean selection lets you filter rows or columns that meet certain criteria. This approach is incredibly useful for data analysis tasks where you need to focus on specific subsets of your data that satisfy certain conditions.

In this guide, we'll explore how to use boolean indexing and masks in pandas to select data based on logical conditions. This is one of the most commonly used features in data analysis with pandas, so mastering it will significantly enhance your data manipulation skills.

Basic Boolean Selection

At its core, boolean selection in pandas works by using a Series of True or False values (also called a mask) to determine which elements to include in the result.

Let's start with a simple example:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Boston', 'New York'],
    'Salary': [50000, 60000, 70000, 80000, 65000]
})

print(df)

This will output:

      Name  Age      City  Salary
  Alice   25  New York   50000
    Bob   30    Boston   60000
Charlie   35   Chicago   70000
  David   40    Boston   80000
    Eva   28  New York   65000

Now, let's filter the DataFrame to show only people who are older than 30:

python
# Create a boolean mask
mask = df['Age'] > 30

# Print the mask
print(mask)

# Use the mask to filter the DataFrame
filtered_df = df[mask]
print("\nPeople older than 30:")
print(filtered_df)

Output:

  False
  False
   True
   True
  False
Name: Age, dtype: bool

People older than 30:
      Name  Age     City  Salary
Charlie   35  Chicago   70000
  David   40   Boston   80000

Notice how the mask is a Series of boolean values that corresponds to each row in our DataFrame. When we apply this mask to the DataFrame with df[mask], we get back only the rows where the mask is True.

Combining Multiple Conditions

You can combine multiple boolean conditions using operators like & (AND), | (OR), and ~ (NOT).

Important: When combining conditions, you must use & instead of and, | instead of or, and each condition must be enclosed in parentheses.

Example with multiple conditions:

python
# People who are older than 30 AND live in Boston
mask = (df['Age'] > 30) & (df['City'] == 'Boston')
print("People older than 30 who live in Boston:")
print(df[mask])

# People who live in New York OR have a salary greater than 70000
mask = (df['City'] == 'New York') | (df['Salary'] > 70000)
print("\nPeople who live in New York OR have a salary greater than 70000:")
print(df[mask])

# People who are NOT from Boston
mask = ~(df['City'] == 'Boston')
print("\nPeople who are NOT from Boston:")
print(df[mask])

Output:

People older than 30 who live in Boston:
     Name  Age    City  Salary
3  David   40  Boston   80000

People who live in New York OR have a salary greater than 70000:
      Name  Age      City  Salary
0    Alice   25  New York   50000
3    David   40    Boston   80000
4      Eva   28  New York   65000

People who are NOT from Boston:
      Name  Age      City  Salary
0    Alice   25  New York   50000
2  Charlie   35   Chicago   70000
4      Eva   28  New York   65000

The `isin()` Method

The isin() method is useful when you want to filter rows based on whether a value is in a list of values:

python
# Filter people who live in either New York or Chicago
cities_of_interest = ['New York', 'Chicago']
mask = df['City'].isin(cities_of_interest)
print("People from New York or Chicago:")
print(df[mask])

Output:

People from New York or Chicago:
      Name  Age      City  Salary
0    Alice   25  New York   50000
2  Charlie   35   Chicago   70000
4      Eva   28  New York   65000

You can also use the negation (~) to find values not in the list:

python
# People who don't live in New York or Chicago
mask = ~df['City'].isin(cities_of_interest)
print("People NOT from New York or Chicago:")
print(df[mask])

Output:

People NOT from New York or Chicago:
    Name  Age    City  Salary
1    Bob   30  Boston   60000
3  David   40  Boston   80000

String Methods with Boolean Selection

Pandas provides string methods through the .str accessor which can be combined with boolean selection:

python
# Find people whose names start with 'A'
mask = df['Name'].str.startswith('A')
print("People whose names start with 'A':")
print(df[mask])

# Find people whose cities contain the letter 'o'
mask = df['City'].str.contains('o')
print("\nPeople whose cities contain the letter 'o':")
print(df[mask])

Output:

People whose names start with 'A':
    Name  Age      City  Salary
0  Alice   25  New York   50000

People whose cities contain the letter 'o':
    Name  Age     City  Salary
1    Bob   30   Boston   60000
3  David   40   Boston   80000

Applying Boolean Selection to Columns

Boolean selection isn't limited to rows - you can also select specific columns based on conditions:

python
# Create a DataFrame with numeric columns
numeric_df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [0.5, 0.6, 0.7, 0.8, 0.9],
    'C': [10, 20, 30, 40, 50],
    'D': [-1, -2, -3, -4, -5]
})

# Select columns with a mean value greater than 5
mask = numeric_df.mean() > 5
print("Columns with mean > 5:")
print(numeric_df.loc[:, mask])

Output:

Columns with mean > 5:
    C
10
20
30
40
50

Real-world Example: Data Analysis

Let's use boolean selection for some real-world data analysis tasks. We'll work with a dataset representing sales data:

python
# Create a sample sales dataset
sales_data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=10),
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 
                'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop'],
    'Region': ['North', 'South', 'East', 'West', 'North', 
               'South', 'East', 'West', 'North', 'South'],
    'Units_Sold': [10, 25, 15, 30, 20, 10, 15, 25, 30, 20],
    'Price': [1200, 800, 600, 1200, 800, 600, 1200, 800, 600, 1200]
})

# Add a Revenue column
sales_data['Revenue'] = sales_data['Units_Sold'] * sales_data['Price']

print(sales_data)

Output:

        Date Product Region  Units_Sold  Price  Revenue
2023-01-01  Laptop  North          10   1200    12000
2023-01-02   Phone  South          25    800    20000
2023-01-03  Tablet   East          15    600     9000
2023-01-04  Laptop   West          30   1200    36000
2023-01-05   Phone  North          20    800    16000
2023-01-06  Tablet  South          10    600     6000
2023-01-07  Laptop   East          15   1200    18000
2023-01-08   Phone   West          25    800    20000
2023-01-09  Tablet  North          30    600    18000
2023-01-10  Laptop  South          20   1200    24000

Now, let's use boolean selection to answer some business questions:

1. Find high-value transactions (revenue > 20000):

python
high_value = sales_data[sales_data['Revenue'] > 20000]
print("High-value transactions:")
print(high_value)

Output:

High-value transactions:
        Date Product Region  Units_Sold  Price  Revenue
3 2023-01-04  Laptop   West          30   1200    36000
9 2023-01-10  Laptop  South          20   1200    24000

2. Find laptop sales from North or South regions:

python
laptop_north_south = sales_data[(sales_data['Product'] == 'Laptop') & 
                               (sales_data['Region'].isin(['North', 'South']))]
print("\nLaptop sales from North or South regions:")
print(laptop_north_south)

Output:

Laptop sales from North or South regions:
        Date Product Region  Units_Sold  Price  Revenue
0 2023-01-01  Laptop  North          10   1200    12000
9 2023-01-10  Laptop  South          20   1200    24000

3. Find transactions in the first week with more than 15 units sold:

python
first_week = (sales_data['Date'] <= '2023-01-07') & (sales_data['Units_Sold'] > 15)
print("\nFirst week transactions with more than 15 units sold:")
print(sales_data[first_week])

Output:

First week transactions with more than 15 units sold:
        Date Product Region  Units_Sold  Price  Revenue
1 2023-01-02   Phone  South          25    800    20000
3 2023-01-04  Laptop   West          30   1200    36000
4 2023-01-05   Phone  North          20    800    16000

Using `query()` Method for Boolean Selection

For complex filtering conditions, the query() method can make your code more readable:

python
# Using query() instead of boolean masks
result = sales_data.query("Product == 'Phone' and Units_Sold > 20")
print("Phone transactions with more than 20 units sold (using query):")
print(result)

Output:

Phone transactions with more than 20 units sold (using query):
        Date Product Region  Units_Sold  Price  Revenue
1 2023-01-02   Phone  South          25    800    20000
7 2023-01-08   Phone   West          25    800    20000

Modifying Data Using Boolean Selection

Boolean selection is not just for reading data; you can also use it to modify values that meet certain conditions:

python
# Create a copy of the DataFrame to avoid modifying the original
df_modified = sales_data.copy()

# Apply a 10% discount to all Laptop prices
mask = df_modified['Product'] == 'Laptop'
df_modified.loc[mask, 'Price'] = df_modified.loc[mask, 'Price'] * 0.9

# Update the Revenue column to reflect the discounted prices
df_modified.loc[mask, 'Revenue'] = df_modified.loc[mask, 'Units_Sold'] * df_modified.loc[mask, 'Price']

print("DataFrame after applying discount to Laptops:")
print(df_modified)

Output:

        Date Product Region  Units_Sold  Price  Revenue
2023-01-01  Laptop  North          10  1080.0  10800.0
2023-01-02   Phone  South          25   800.0  20000.0
2023-01-03  Tablet   East          15   600.0   9000.0
2023-01-04  Laptop   West          30  1080.0  32400.0
2023-01-05   Phone  North          20   800.0  16000.0
2023-01-06  Tablet  South          10   600.0   6000.0
2023-01-07  Laptop   East          15  1080.0  16200.0
2023-01-08   Phone   West          25   800.0  20000.0
2023-01-09  Tablet  North          30   600.0  18000.0
2023-01-10  Laptop  South          20  1080.0  21600.0

Summary

In this guide, we've explored pandas boolean selection, a powerful feature for filtering and selecting data based on conditions:

Basic boolean selection using masks: df[df['column'] > value]
Combining conditions with & (AND), | (OR), and ~ (NOT)
Using the isin() method to check if values are in a list
Applying string methods with .str for text-based filtering
Using boolean selection to filter columns based on conditions
Real-world examples with sales data analysis
Using the query() method for more readable filtering
Modifying data that meets specific conditions

Boolean selection is a fundamental skill in pandas that will help you efficiently filter and manipulate data according to your analysis needs. With practice, you'll find that boolean indexing makes complex data filtering tasks much simpler.

Exercises

To strengthen your understanding of boolean selection in pandas, try these exercises:

Create a DataFrame with student records (name, age, grade, subject) and filter students who:
- Are older than 15 and have a grade higher than 85
- Take either Math or Science and have a grade lower than 70
For the sales data we worked with:
- Find regions where tablet sales exceeded 20 units
- Calculate the average revenue for each product
- Identify dates where total sales across all products exceeded 25 units
Create a mask to identify outliers in a dataset (values more than 2 standard deviations from the mean)

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Boolean Selection​

Combining Multiple Conditions​

Example with multiple conditions:​

The isin() Method​

String Methods with Boolean Selection​

Applying Boolean Selection to Columns​

Real-world Example: Data Analysis​

1. Find high-value transactions (revenue > 20000):​

2. Find laptop sales from North or South regions:​

3. Find transactions in the first week with more than 15 units sold:​

Using query() Method for Boolean Selection​

Modifying Data Using Boolean Selection​

Summary​

Exercises​

Additional Resources​

Introduction

Basic Boolean Selection

Combining Multiple Conditions

Example with multiple conditions:

The `isin()` Method

String Methods with Boolean Selection

Applying Boolean Selection to Columns

Real-world Example: Data Analysis

1. Find high-value transactions (revenue > 20000):

2. Find laptop sales from North or South regions:

3. Find transactions in the first week with more than 15 units sold:

Using `query()` Method for Boolean Selection

Modifying Data Using Boolean Selection

Summary

Exercises

Additional Resources