Skip to main content

Pandas Boolean Selection

Introduction

Boolean selection in pandas is a powerful technique that allows you to filter data based on specific conditions. Rather than selecting data by position or label, boolean selection lets you filter rows or columns that meet certain criteria. This approach is incredibly useful for data analysis tasks where you need to focus on specific subsets of your data that satisfy certain conditions.

In this guide, we'll explore how to use boolean indexing and masks in pandas to select data based on logical conditions. This is one of the most commonly used features in data analysis with pandas, so mastering it will significantly enhance your data manipulation skills.

Basic Boolean Selection

At its core, boolean selection in pandas works by using a Series of True or False values (also called a mask) to determine which elements to include in the result.

Let's start with a simple example:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 28],
'City': ['New York', 'Boston', 'Chicago', 'Boston', 'New York'],
'Salary': [50000, 60000, 70000, 80000, 65000]
})

print(df)

This will output:

      Name  Age      City  Salary
0 Alice 25 New York 50000
1 Bob 30 Boston 60000
2 Charlie 35 Chicago 70000
3 David 40 Boston 80000
4 Eva 28 New York 65000

Now, let's filter the DataFrame to show only people who are older than 30:

python
# Create a boolean mask
mask = df['Age'] > 30

# Print the mask
print(mask)

# Use the mask to filter the DataFrame
filtered_df = df[mask]
print("\nPeople older than 30:")
print(filtered_df)

Output:

0    False
1 False
2 True
3 True
4 False
Name: Age, dtype: bool

People older than 30:
Name Age City Salary
2 Charlie 35 Chicago 70000
3 David 40 Boston 80000

Notice how the mask is a Series of boolean values that corresponds to each row in our DataFrame. When we apply this mask to the DataFrame with df[mask], we get back only the rows where the mask is True.

Combining Multiple Conditions

You can combine multiple boolean conditions using operators like & (AND), | (OR), and ~ (NOT).

Important: When combining conditions, you must use & instead of and, | instead of or, and each condition must be enclosed in parentheses.

Example with multiple conditions:

python
# People who are older than 30 AND live in Boston
mask = (df['Age'] > 30) & (df['City'] == 'Boston')
print("People older than 30 who live in Boston:")
print(df[mask])

# People who live in New York OR have a salary greater than 70000
mask = (df['City'] == 'New York') | (df['Salary'] > 70000)
print("\nPeople who live in New York OR have a salary greater than 70000:")
print(df[mask])

# People who are NOT from Boston
mask = ~(df['City'] == 'Boston')
print("\nPeople who are NOT from Boston:")
print(df[mask])

Output:

People older than 30 who live in Boston:
Name Age City Salary
3 David 40 Boston 80000

People who live in New York OR have a salary greater than 70000:
Name Age City Salary
0 Alice 25 New York 50000
3 David 40 Boston 80000
4 Eva 28 New York 65000

People who are NOT from Boston:
Name Age City Salary
0 Alice 25 New York 50000
2 Charlie 35 Chicago 70000
4 Eva 28 New York 65000

The isin() Method

The isin() method is useful when you want to filter rows based on whether a value is in a list of values:

python
# Filter people who live in either New York or Chicago
cities_of_interest = ['New York', 'Chicago']
mask = df['City'].isin(cities_of_interest)
print("People from New York or Chicago:")
print(df[mask])

Output:

People from New York or Chicago:
Name Age City Salary
0 Alice 25 New York 50000
2 Charlie 35 Chicago 70000
4 Eva 28 New York 65000

You can also use the negation (~) to find values not in the list:

python
# People who don't live in New York or Chicago
mask = ~df['City'].isin(cities_of_interest)
print("People NOT from New York or Chicago:")
print(df[mask])

Output:

People NOT from New York or Chicago:
Name Age City Salary
1 Bob 30 Boston 60000
3 David 40 Boston 80000

String Methods with Boolean Selection

Pandas provides string methods through the .str accessor which can be combined with boolean selection:

python
# Find people whose names start with 'A'
mask = df['Name'].str.startswith('A')
print("People whose names start with 'A':")
print(df[mask])

# Find people whose cities contain the letter 'o'
mask = df['City'].str.contains('o')
print("\nPeople whose cities contain the letter 'o':")
print(df[mask])

Output:

People whose names start with 'A':
Name Age City Salary
0 Alice 25 New York 50000

People whose cities contain the letter 'o':
Name Age City Salary
1 Bob 30 Boston 60000
3 David 40 Boston 80000

Applying Boolean Selection to Columns

Boolean selection isn't limited to rows - you can also select specific columns based on conditions:

python
# Create a DataFrame with numeric columns
numeric_df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [0.5, 0.6, 0.7, 0.8, 0.9],
'C': [10, 20, 30, 40, 50],
'D': [-1, -2, -3, -4, -5]
})

# Select columns with a mean value greater than 5
mask = numeric_df.mean() > 5
print("Columns with mean > 5:")
print(numeric_df.loc[:, mask])

Output:

Columns with mean > 5:
C
0 10
1 20
2 30
3 40
4 50

Real-world Example: Data Analysis

Let's use boolean selection for some real-world data analysis tasks. We'll work with a dataset representing sales data:

python
# Create a sample sales dataset
sales_data = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', periods=10),
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone',
'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop'],
'Region': ['North', 'South', 'East', 'West', 'North',
'South', 'East', 'West', 'North', 'South'],
'Units_Sold': [10, 25, 15, 30, 20, 10, 15, 25, 30, 20],
'Price': [1200, 800, 600, 1200, 800, 600, 1200, 800, 600, 1200]
})

# Add a Revenue column
sales_data['Revenue'] = sales_data['Units_Sold'] * sales_data['Price']

print(sales_data)

Output:

        Date Product Region  Units_Sold  Price  Revenue
0 2023-01-01 Laptop North 10 1200 12000
1 2023-01-02 Phone South 25 800 20000
2 2023-01-03 Tablet East 15 600 9000
3 2023-01-04 Laptop West 30 1200 36000
4 2023-01-05 Phone North 20 800 16000
5 2023-01-06 Tablet South 10 600 6000
6 2023-01-07 Laptop East 15 1200 18000
7 2023-01-08 Phone West 25 800 20000
8 2023-01-09 Tablet North 30 600 18000
9 2023-01-10 Laptop South 20 1200 24000

Now, let's use boolean selection to answer some business questions:

1. Find high-value transactions (revenue > 20000):

python
high_value = sales_data[sales_data['Revenue'] > 20000]
print("High-value transactions:")
print(high_value)

Output:

High-value transactions:
Date Product Region Units_Sold Price Revenue
3 2023-01-04 Laptop West 30 1200 36000
9 2023-01-10 Laptop South 20 1200 24000

2. Find laptop sales from North or South regions:

python
laptop_north_south = sales_data[(sales_data['Product'] == 'Laptop') & 
(sales_data['Region'].isin(['North', 'South']))]
print("\nLaptop sales from North or South regions:")
print(laptop_north_south)

Output:

Laptop sales from North or South regions:
Date Product Region Units_Sold Price Revenue
0 2023-01-01 Laptop North 10 1200 12000
9 2023-01-10 Laptop South 20 1200 24000

3. Find transactions in the first week with more than 15 units sold:

python
first_week = (sales_data['Date'] <= '2023-01-07') & (sales_data['Units_Sold'] > 15)
print("\nFirst week transactions with more than 15 units sold:")
print(sales_data[first_week])

Output:

First week transactions with more than 15 units sold:
Date Product Region Units_Sold Price Revenue
1 2023-01-02 Phone South 25 800 20000
3 2023-01-04 Laptop West 30 1200 36000
4 2023-01-05 Phone North 20 800 16000

Using query() Method for Boolean Selection

For complex filtering conditions, the query() method can make your code more readable:

python
# Using query() instead of boolean masks
result = sales_data.query("Product == 'Phone' and Units_Sold > 20")
print("Phone transactions with more than 20 units sold (using query):")
print(result)

Output:

Phone transactions with more than 20 units sold (using query):
Date Product Region Units_Sold Price Revenue
1 2023-01-02 Phone South 25 800 20000
7 2023-01-08 Phone West 25 800 20000

Modifying Data Using Boolean Selection

Boolean selection is not just for reading data; you can also use it to modify values that meet certain conditions:

python
# Create a copy of the DataFrame to avoid modifying the original
df_modified = sales_data.copy()

# Apply a 10% discount to all Laptop prices
mask = df_modified['Product'] == 'Laptop'
df_modified.loc[mask, 'Price'] = df_modified.loc[mask, 'Price'] * 0.9

# Update the Revenue column to reflect the discounted prices
df_modified.loc[mask, 'Revenue'] = df_modified.loc[mask, 'Units_Sold'] * df_modified.loc[mask, 'Price']

print("DataFrame after applying discount to Laptops:")
print(df_modified)

Output:

        Date Product Region  Units_Sold  Price  Revenue
0 2023-01-01 Laptop North 10 1080.0 10800.0
1 2023-01-02 Phone South 25 800.0 20000.0
2 2023-01-03 Tablet East 15 600.0 9000.0
3 2023-01-04 Laptop West 30 1080.0 32400.0
4 2023-01-05 Phone North 20 800.0 16000.0
5 2023-01-06 Tablet South 10 600.0 6000.0
6 2023-01-07 Laptop East 15 1080.0 16200.0
7 2023-01-08 Phone West 25 800.0 20000.0
8 2023-01-09 Tablet North 30 600.0 18000.0
9 2023-01-10 Laptop South 20 1080.0 21600.0

Summary

In this guide, we've explored pandas boolean selection, a powerful feature for filtering and selecting data based on conditions:

  • Basic boolean selection using masks: df[df['column'] > value]
  • Combining conditions with & (AND), | (OR), and ~ (NOT)
  • Using the isin() method to check if values are in a list
  • Applying string methods with .str for text-based filtering
  • Using boolean selection to filter columns based on conditions
  • Real-world examples with sales data analysis
  • Using the query() method for more readable filtering
  • Modifying data that meets specific conditions

Boolean selection is a fundamental skill in pandas that will help you efficiently filter and manipulate data according to your analysis needs. With practice, you'll find that boolean indexing makes complex data filtering tasks much simpler.

Exercises

To strengthen your understanding of boolean selection in pandas, try these exercises:

  1. Create a DataFrame with student records (name, age, grade, subject) and filter students who:

    • Are older than 15 and have a grade higher than 85
    • Take either Math or Science and have a grade lower than 70
  2. For the sales data we worked with:

    • Find regions where tablet sales exceeded 20 units
    • Calculate the average revenue for each product
    • Identify dates where total sales across all products exceeded 25 units
  3. Create a mask to identify outliers in a dataset (values more than 2 standard deviations from the mean)

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)