Pandas Filtering
Introduction
Filtering data is one of the most common and important operations in data analysis. When working with large datasets, you often need to focus on specific portions of your data that meet certain conditions. Pandas provides powerful and flexible ways to filter your DataFrames based on various criteria.
In this tutorial, we'll explore different techniques to filter Pandas DataFrames, from simple boolean conditions to more complex filtering operations. By the end, you'll be able to extract exactly the data you need from your datasets with confidence.
Basic Boolean Filtering
The most fundamental way to filter data in Pandas is using boolean indexing. This involves creating a boolean Series (containing True
or False
values) and using it to select rows from a DataFrame.
Let's start with a simple example:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
'age': [28, 24, 35, 32, 45],
'city': ['New York', 'Paris', 'Berlin', 'London', 'Sydney'],
'salary': [50000, 60000, 72000, 80000, 65000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000
Filtering with a Single Condition
To filter the DataFrame based on a single condition, we use the comparison operators to create a boolean mask:
# Filter people older than 30
mask = df['age'] > 30
filtered_df = df[mask]
print("\nPeople older than 30:")
print(filtered_df)
Output:
People older than 30:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000
The same filtering can be done in a single line:
# Same filter in one line
filtered_df = df[df['age'] > 30]
print("\nPeople older than 30 (one-line):")
print(filtered_df)
Output:
People older than 30 (one-line):
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000
Filtering with Multiple Conditions
You can combine multiple conditions using logical operators:
&
for AND|
for OR~
for NOT
Important: When using multiple conditions, each condition must be enclosed in parentheses.
# People older than 30 AND earning more than 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print("\nPeople older than 30 and earning more than 70000:")
print(filtered_df)
# People in New York OR Paris
filtered_df = df[(df['city'] == 'New York') | (df['city'] == 'Paris')]
print("\nPeople in New York or Paris:")
print(filtered_df)
# People NOT living in London
filtered_df = df[~(df['city'] == 'London')]
print("\nPeople not living in London:")
print(filtered_df)
Output:
People older than 30 and earning more than 70000:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
People in New York or Paris:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
People not living in London:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000
4 Bob 45 Sydney 65000
Advanced Filtering Techniques
Using isin()
for Multiple Values
If you need to filter rows based on multiple possible values in a column, the isin()
method is more efficient than chaining multiple OR conditions:
# People in specific cities
cities_of_interest = ['New York', 'Paris', 'Berlin']
filtered_df = df[df['city'].isin(cities_of_interest)]
print("\nPeople in New York, Paris, or Berlin:")
print(filtered_df)
# People NOT in specific cities
filtered_df = df[~df['city'].isin(cities_of_interest)]
print("\nPeople NOT in New York, Paris, or Berlin:")
print(filtered_df)
Output:
People in New York, Paris, or Berlin:
name age city salary
0 John 28 New York 50000
1 Anna 24 Paris 60000
2 Peter 35 Berlin 72000
People NOT in New York, Paris, or Berlin:
name age city salary
3 Linda 32 London 80000
4 Bob 45 Sydney 65000
Filtering with String Methods
Pandas provides string methods for filtering text data. Access them using the str
accessor:
# Create a DataFrame with more text data
df_text = pd.DataFrame({
'product': ['Laptop Pro', 'Tablet Mini', 'Smartphone X', 'Smartwatch Pro', 'Headphones Bass'],
'description': ['Powerful laptop with SSD', 'Compact tablet for reading',
'Next-gen smartphone', 'Fitness tracking watch', 'Noise-cancelling headphones'],
'price': [1200, 400, 800, 300, 150]
})
print("\nProducts DataFrame:")
print(df_text)
# Filter products containing "Pro" in their name
pro_products = df_text[df_text['product'].str.contains('Pro')]
print("\nProducts with 'Pro' in the name:")
print(pro_products)
# Filter products with descriptions mentioning "tablet" or "laptop" (case-insensitive)
computing_devices = df_text[df_text['description'].str.contains('tablet|laptop', case=False)]
print("\nTablets or Laptops:")
print(computing_devices)
Output:
Products DataFrame:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
1 Tablet Mini Compact tablet for reading 400
2 Smartphone X Next-gen smartphone 800
3 Smartwatch Pro Fitness tracking watch 300
4 Headphones Bass Noise-cancelling headphones 150
Products with 'Pro' in the name:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
3 Smartwatch Pro Fitness tracking watch 300
Tablets or Laptops:
product description price
0 Laptop Pro Powerful laptop with SSD 1200
1 Tablet Mini Compact tablet for reading 400
Filtering with query()
Method
The query()
method provides a concise way to filter DataFrames using string expressions:
# Filter using query() method
high_salary_young = df.query('salary > 60000 and age < 35')
print("\nPeople younger than 35 with salary over 60000:")
print(high_salary_young)
# Query with variable
min_salary = 65000
high_earners = df.query('salary >= @min_salary')
print(f"\nPeople earning at least {min_salary}:")
print(high_earners)
Output:
People younger than 35 with salary over 60000:
name age city salary
3 Linda 32 London 80000
People earning at least 65000:
name age city salary
2 Peter 35 Berlin 72000
3 Linda 32 London 80000
4 Bob 45 Sydney 65000
Filtering with loc
and iloc
You can also filter using the loc
and iloc
indexers:
# Filter with loc
london_residents = df.loc[df['city'] == 'London']
print("\nPeople in London (using loc):")
print(london_residents)
# Get specific columns for filtered data
name_age = df.loc[df['salary'] > 70000, ['name', 'age']]
print("\nNames and ages of high earners:")
print(name_age)
Output:
People in London (using loc):
name age city salary
3 Linda 32 London 80000
Names and ages of high earners:
name age
2 Peter 35
3 Linda 32
Handling Missing Values
Filtering operations often involve handling missing values:
# Create a DataFrame with missing values
df_missing = pd.DataFrame({
'name': ['John', 'Anna', None, 'Linda', 'Bob'],
'age': [28, np.nan, 35, 32, 45],
'city': ['New York', 'Paris', None, 'London', 'Sydney'],
'salary': [50000, 60000, np.nan, 80000, 65000]
})
print("\nDataFrame with missing values:")
print(df_missing)
# Filter rows with missing values in any column
rows_with_na = df_missing[df_missing.isna().any(axis=1)]
print("\nRows with any missing value:")
print(rows_with_na)
# Filter rows without missing values
complete_rows = df_missing.dropna()
print("\nRows without missing values:")
print(complete_rows)
# Filter rows where specific column is not null
valid_names = df_missing[df_missing['name'].notna()]
print("\nRows with non-null names:")
print(valid_names)
Output:
DataFrame with missing values:
name age city salary
0 John 28.0 New York 50000.0
1 Anna NaN Paris 60000.0
2 None 35.0 None NaN
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0
Rows with any missing value:
name age city salary
1 Anna NaN Paris 60000.0
2 None 35.0 None NaN
Rows without missing values:
name age city salary
0 John 28.0 New York 50000.0
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0
Rows with non-null names:
name age city salary
0 John 28.0 New York 50000.0
1 Anna NaN Paris 60000.0
3 Linda 32.0 London 80000.0
4 Bob 45.0 Sydney 65000.0
Real-world Example: Data Analysis
Let's apply filtering techniques to a more realistic data analysis scenario using a dataset of sales transactions:
# Create a sales dataset
np.random.seed(42) # for reproducibility
dates = pd.date_range('2023-01-01', '2023-03-31', freq='D')
products = ['Laptop', 'Monitor', 'Keyboard', 'Mouse', 'Headphones']
regions = ['North', 'South', 'East', 'West']
sales_data = []
for _ in range(200):
date = np.random.choice(dates)
product = np.random.choice(products)
region = np.random.choice(regions)
units = np.random.randint(1, 10)
price = {'Laptop': 1200, 'Monitor': 300, 'Keyboard': 100, 'Mouse': 50, 'Headphones': 150}[product]
sales_data.append([date, product, region, units, price, units * price])
sales_df = pd.DataFrame(sales_data, columns=['date', 'product', 'region', 'units', 'unit_price', 'total'])
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['month'] = sales_df['date'].dt.month_name()
print("Sales Dataset Sample:")
print(sales_df.head())
# 1. Find high-value transactions (over $1000)
high_value = sales_df[sales_df['total'] > 1000]
print(f"\nHigh-value transactions: {len(high_value)}")
print(high_value.head(3))
# 2. Find laptop sales in February in the West region
feb_west_laptops = sales_df[
(sales_df['product'] == 'Laptop') &
(sales_df['month'] == 'February') &
(sales_df['region'] == 'West')
]
print(f"\nLaptop sales in February in West region: {len(feb_west_laptops)}")
print(feb_west_laptops)
# 3. Find most profitable product categories
product_performance = sales_df.groupby('product')['total'].sum().reset_index()
top_products = product_performance.sort_values('total', ascending=False)
print("\nProduct performance (by total sales):")
print(top_products)
# 4. Find days with exceptional sales (top 5% of daily sales)
daily_sales = sales_df.groupby(sales_df['date'].dt.date)['total'].sum().reset_index()
threshold = daily_sales['total'].quantile(0.95)
exceptional_days = daily_sales[daily_sales['total'] > threshold]
print("\nExceptional sales days (top 5%):")
print(exceptional_days)
Output (sample - actual values may vary):
Sales Dataset Sample:
date product region units unit_price total
0 2023-03-13 Laptop North 7 1200 8400
1 2023-01-24 Mouse East 5 50 250
2 2023-02-11 Keyboard East 1 100 100
3 2023-02-19 Laptop East 8 1200 9600
4 2023-01-21 Monitor West 3 300 900
High-value transactions: 92
date product region units unit_price total month
0 2023-03-13 Laptop North 7 1200 8400 March
3 2023-02-19 Laptop East 8 1200 9600 February
5 2023-03-05 Laptop North 8 1200 9600 March
Laptop sales in February in West region: 3
date product region units unit_price total month
26 2023-02-05 Laptop West 2 1200 2400 February
43 2023-02-14 Laptop West 7 1200 8400 February
91 2023-02-26 Laptop West 3 1200 3600 February
Product performance (by total sales):
product total
0 Laptop 254400.0
3 Monitor 62100.0
4 Headphones 27300.0
2 Keyboard 10200.0
1 Mouse 8250.0
Exceptional sales days (top 5%):
date total
9 2023-02-12 10200.0
17 2023-03-19 10500.0
30 2023-03-21 10800.0
40 2023-02-04 11850.0
Summary
Filtering is a critical component of data analysis with Pandas. In this tutorial, we've covered:
- Basic boolean filtering with single and multiple conditions
- Using
isin()
for filtering with multiple possible values - String filtering with
str.contains()
and other string methods - The
query()
method for more readable filtering - Using
loc
andiloc
for combined filtering and column selection - Handling missing values in filtering operations
- Applying filtering techniques to real-world data analysis
Remember these key points:
- Always use parentheses around each condition when combining multiple conditions
- Choose the appropriate method based on your specific filtering needs
- Consider readability and performance when working with large datasets
Exercises
To practice your filtering skills, try these exercises:
- Create a DataFrame of students with columns for name, grade, subjects, and scores
- Filter students who got an A grade (90 or above)
- Find students who failed (below 60) in any subject
- Identify students who excel in science subjects but struggle in humanities
- Create a filtering operation that uses at least three conditions
- Use string filtering to find students whose names start with a specific letter
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)