Skip to main content

Pandas Filtering

Introduction

Filtering data is one of the most common and important operations in data analysis. When working with large datasets, you often need to focus on specific portions of your data that meet certain conditions. Pandas provides powerful and flexible ways to filter your DataFrames based on various criteria.

In this tutorial, we'll explore different techniques to filter Pandas DataFrames, from simple boolean conditions to more complex filtering operations. By the end, you'll be able to extract exactly the data you need from your datasets with confidence.

Basic Boolean Filtering

The most fundamental way to filter data in Pandas is using boolean indexing. This involves creating a boolean Series (containing True or False values) and using it to select rows from a DataFrame.

Let's start with a simple example:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'age': [28, 24, 35, 32, 45],
'city': ['New York', 'Paris', 'Berlin', 'London', 'Sydney'],
'salary': [50000, 60000, 72000, 80000, 65000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000

Filtering with a Single Condition

To filter the DataFrame based on a single condition, we use the comparison operators to create a boolean mask:

python
# Filter people older than 30
mask = df['age'] > 30
filtered_df = df[mask]

print("\nPeople older than 30:")
print(filtered_df)

Output:

People older than 30:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000

The same filtering can be done in a single line:

python
# Same filter in one line
filtered_df = df[df['age'] > 30]
print("\nPeople older than 30 (one-line):")
print(filtered_df)

Output:

People older than 30 (one-line):
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000

Filtering with Multiple Conditions

You can combine multiple conditions using logical operators:

  • & for AND
  • | for OR
  • ~ for NOT

Important: When using multiple conditions, each condition must be enclosed in parentheses.

python
# People older than 30 AND earning more than 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print("\nPeople older than 30 and earning more than 70000:")
print(filtered_df)

# People in New York OR Paris
filtered_df = df[(df['city'] == 'New York') | (df['city'] == 'Paris')]
print("\nPeople in New York or Paris:")
print(filtered_df)

# People NOT living in London
filtered_df = df[~(df['city'] == 'London')]
print("\nPeople not living in London:")
print(filtered_df)

Output:

People older than 30 and earning more than 70000:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000

People in New York or Paris:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000

People not living in London:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000
4 Bob 45 Sydney 65000

Advanced Filtering Techniques

Using isin() for Multiple Values

If you need to filter rows based on multiple possible values in a column, the isin() method is more efficient than chaining multiple OR conditions:

python
# People in specific cities
cities_of_interest = ['New York', 'Paris', 'Berlin']
filtered_df = df[df['city'].isin(cities_of_interest)]
print("\nPeople in New York, Paris, or Berlin:")
print(filtered_df)

# People NOT in specific cities
filtered_df = df[~df['city'].isin(cities_of_interest)]
print("\nPeople NOT in New York, Paris, or Berlin:")
print(filtered_df)

Output:

People in New York, Paris, or Berlin:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000

People NOT in New York, Paris, or Berlin:
name age city salary
3 Linda 32 London 80000
4 Bob 45 Sydney 65000

Filtering with String Methods

Pandas provides string methods for filtering text data. Access them using the str accessor:

python
# Create a DataFrame with more text data
df_text = pd.DataFrame({
'product': ['Laptop Pro', 'Tablet Mini', 'Smartphone X', 'Smartwatch Pro', 'Headphones Bass'],
'description': ['Powerful laptop with SSD', 'Compact tablet for reading',
'Next-gen smartphone', 'Fitness tracking watch', 'Noise-cancelling headphones'],
'price': [1200, 400, 800, 300, 150]
})

print("\nProducts DataFrame:")
print(df_text)

# Filter products containing "Pro" in their name
pro_products = df_text[df_text['product'].str.contains('Pro')]
print("\nProducts with 'Pro' in the name:")
print(pro_products)

# Filter products with descriptions mentioning "tablet" or "laptop" (case-insensitive)
computing_devices = df_text[df_text['description'].str.contains('tablet|laptop', case=False)]
print("\nTablets or Laptops:")
print(computing_devices)

Output:

Products DataFrame:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
1 Tablet Mini Compact tablet for reading 400
2 Smartphone X Next-gen smartphone 800
3 Smartwatch Pro Fitness tracking watch 300
4 Headphones Bass Noise-cancelling headphones 150

Products with 'Pro' in the name:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
3 Smartwatch Pro Fitness tracking watch 300

Tablets or Laptops:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
1 Tablet Mini Compact tablet for reading 400

Filtering with query() Method

The query() method provides a concise way to filter DataFrames using string expressions:

python
# Filter using query() method
high_salary_young = df.query('salary > 60000 and age < 35')
print("\nPeople younger than 35 with salary over 60000:")
print(high_salary_young)

# Query with variable
min_salary = 65000
high_earners = df.query('salary >= @min_salary')
print(f"\nPeople earning at least {min_salary}:")
print(high_earners)

Output:

People younger than 35 with salary over 60000:
name age city salary
3 Linda 32 London 80000

People earning at least 65000:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000

Filtering with loc and iloc

You can also filter using the loc and iloc indexers:

python
# Filter with loc
london_residents = df.loc[df['city'] == 'London']
print("\nPeople in London (using loc):")
print(london_residents)

# Get specific columns for filtered data
name_age = df.loc[df['salary'] > 70000, ['name', 'age']]
print("\nNames and ages of high earners:")
print(name_age)

Output:

People in London (using loc):
name age city salary
3 Linda 32 London 80000

Names and ages of high earners:
name age
2 Peter 35
3 Linda 32

Handling Missing Values

Filtering operations often involve handling missing values:

python
# Create a DataFrame with missing values
df_missing = pd.DataFrame({
'name': ['John', 'Anna', None, 'Linda', 'Bob'],
'age': [28, np.nan, 35, 32, 45],
'city': ['New York', 'Paris', None, 'London', 'Sydney'],
'salary': [50000, 60000, np.nan, 80000, 65000]
})

print("\nDataFrame with missing values:")
print(df_missing)

# Filter rows with missing values in any column
rows_with_na = df_missing[df_missing.isna().any(axis=1)]
print("\nRows with any missing value:")
print(rows_with_na)

# Filter rows without missing values
complete_rows = df_missing.dropna()
print("\nRows without missing values:")
print(complete_rows)

# Filter rows where specific column is not null
valid_names = df_missing[df_missing['name'].notna()]
print("\nRows with non-null names:")
print(valid_names)

Output:

DataFrame with missing values:
name age city salary
0 John 28.0 New York 50000.0
1 Anna NaN Paris 60000.0
2 None 35.0 None NaN
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0

Rows with any missing value:
name age city salary
1 Anna NaN Paris 60000.0
2 None 35.0 None NaN

Rows without missing values:
name age city salary
0 John 28.0 New York 50000.0
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0

Rows with non-null names:
name age city salary
0 John 28.0 New York 50000.0
1 Anna NaN Paris 60000.0
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0

Real-world Example: Data Analysis

Let's apply filtering techniques to a more realistic data analysis scenario using a dataset of sales transactions:

python
# Create a sales dataset
np.random.seed(42) # for reproducibility
dates = pd.date_range('2023-01-01', '2023-03-31', freq='D')
products = ['Laptop', 'Monitor', 'Keyboard', 'Mouse', 'Headphones']
regions = ['North', 'South', 'East', 'West']

sales_data = []
for _ in range(200):
date = np.random.choice(dates)
product = np.random.choice(products)
region = np.random.choice(regions)
units = np.random.randint(1, 10)
price = {'Laptop': 1200, 'Monitor': 300, 'Keyboard': 100, 'Mouse': 50, 'Headphones': 150}[product]
sales_data.append([date, product, region, units, price, units * price])

sales_df = pd.DataFrame(sales_data, columns=['date', 'product', 'region', 'units', 'unit_price', 'total'])
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['month'] = sales_df['date'].dt.month_name()

print("Sales Dataset Sample:")
print(sales_df.head())

# 1. Find high-value transactions (over $1000)
high_value = sales_df[sales_df['total'] > 1000]
print(f"\nHigh-value transactions: {len(high_value)}")
print(high_value.head(3))

# 2. Find laptop sales in February in the West region
feb_west_laptops = sales_df[
(sales_df['product'] == 'Laptop') &
(sales_df['month'] == 'February') &
(sales_df['region'] == 'West')
]
print(f"\nLaptop sales in February in West region: {len(feb_west_laptops)}")
print(feb_west_laptops)

# 3. Find most profitable product categories
product_performance = sales_df.groupby('product')['total'].sum().reset_index()
top_products = product_performance.sort_values('total', ascending=False)
print("\nProduct performance (by total sales):")
print(top_products)

# 4. Find days with exceptional sales (top 5% of daily sales)
daily_sales = sales_df.groupby(sales_df['date'].dt.date)['total'].sum().reset_index()
threshold = daily_sales['total'].quantile(0.95)
exceptional_days = daily_sales[daily_sales['total'] > threshold]
print("\nExceptional sales days (top 5%):")
print(exceptional_days)

Output (sample - actual values may vary):

Sales Dataset Sample:
date product region units unit_price total
0 2023-03-13 Laptop North 7 1200 8400
1 2023-01-24 Mouse East 5 50 250
2 2023-02-11 Keyboard East 1 100 100
3 2023-02-19 Laptop East 8 1200 9600
4 2023-01-21 Monitor West 3 300 900

High-value transactions: 92
date product region units unit_price total month
0 2023-03-13 Laptop North 7 1200 8400 March
3 2023-02-19 Laptop East 8 1200 9600 February
5 2023-03-05 Laptop North 8 1200 9600 March

Laptop sales in February in West region: 3
date product region units unit_price total month
26 2023-02-05 Laptop West 2 1200 2400 February
43 2023-02-14 Laptop West 7 1200 8400 February
91 2023-02-26 Laptop West 3 1200 3600 February

Product performance (by total sales):
product total
0 Laptop 254400.0
3 Monitor 62100.0
4 Headphones 27300.0
2 Keyboard 10200.0
1 Mouse 8250.0

Exceptional sales days (top 5%):
date total
9 2023-02-12 10200.0
17 2023-03-19 10500.0
30 2023-03-21 10800.0
40 2023-02-04 11850.0

Summary

Filtering is a critical component of data analysis with Pandas. In this tutorial, we've covered:

  • Basic boolean filtering with single and multiple conditions
  • Using isin() for filtering with multiple possible values
  • String filtering with str.contains() and other string methods
  • The query() method for more readable filtering
  • Using loc and iloc for combined filtering and column selection
  • Handling missing values in filtering operations
  • Applying filtering techniques to real-world data analysis

Remember these key points:

  1. Always use parentheses around each condition when combining multiple conditions
  2. Choose the appropriate method based on your specific filtering needs
  3. Consider readability and performance when working with large datasets

Exercises

To practice your filtering skills, try these exercises:

  1. Create a DataFrame of students with columns for name, grade, subjects, and scores
  2. Filter students who got an A grade (90 or above)
  3. Find students who failed (below 60) in any subject
  4. Identify students who excel in science subjects but struggle in humanities
  5. Create a filtering operation that uses at least three conditions
  6. Use string filtering to find students whose names start with a specific letter

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)