Pandas Filtering

Introduction

Filtering data is one of the most common and important operations in data analysis. When working with large datasets, you often need to focus on specific portions of your data that meet certain conditions. Pandas provides powerful and flexible ways to filter your DataFrames based on various criteria.

In this tutorial, we'll explore different techniques to filter Pandas DataFrames, from simple boolean conditions to more complex filtering operations. By the end, you'll be able to extract exactly the data you need from your datasets with confidence.

Basic Boolean Filtering

The most fundamental way to filter data in Pandas is using boolean indexing. This involves creating a boolean Series (containing True or False values) and using it to select rows from a DataFrame.

Let's start with a simple example:

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'name': ['John', 'Anna', 'Peter', 'Linda', 'Bob'],
    'age': [28, 24, 35, 32, 45],
    'city': ['New York', 'Paris', 'Berlin', 'London', 'Sydney'],
    'salary': [50000, 60000, 72000, 80000, 65000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
    name  age     city  salary
 John   28  New York   50000
 Anna   24     Paris   60000
Peter   35    Berlin   72000
Linda   32    London   80000
  Bob   45    Sydney   65000

Filtering with a Single Condition

To filter the DataFrame based on a single condition, we use the comparison operators to create a boolean mask:

# Filter people older than 30
mask = df['age'] > 30
filtered_df = df[mask]

print("\nPeople older than 30:")
print(filtered_df)

Output:

People older than 30:
    name  age    city  salary
2  Peter   35  Berlin   72000
3  Linda   32  London   80000
4    Bob   45  Sydney   65000

The same filtering can be done in a single line:

# Same filter in one line
filtered_df = df[df['age'] > 30]
print("\nPeople older than 30 (one-line):")
print(filtered_df)

Output:

People older than 30 (one-line):
    name  age    city  salary
2  Peter   35  Berlin   72000
3  Linda   32  London   80000
4    Bob   45  Sydney   65000

Filtering with Multiple Conditions

You can combine multiple conditions using logical operators:

& for AND
| for OR
~ for NOT

Important: When using multiple conditions, each condition must be enclosed in parentheses.

# People older than 30 AND earning more than 70000
filtered_df = df[(df['age'] > 30) & (df['salary'] > 70000)]
print("\nPeople older than 30 and earning more than 70000:")
print(filtered_df)

# People in New York OR Paris
filtered_df = df[(df['city'] == 'New York') | (df['city'] == 'Paris')]
print("\nPeople in New York or Paris:")
print(filtered_df)

# People NOT living in London
filtered_df = df[~(df['city'] == 'London')]
print("\nPeople not living in London:")
print(filtered_df)

Output:

People older than 30 and earning more than 70000:
    name  age    city  salary
2  Peter   35  Berlin   72000
3  Linda   32  London   80000

People in New York or Paris:
   name  age      city  salary
0  John   28  New York   50000
1  Anna   24     Paris   60000

People not living in London:
    name  age      city  salary
0   John   28  New York   50000
1   Anna   24     Paris   60000
2  Peter   35    Berlin   72000
4    Bob   45    Sydney   65000

Advanced Filtering Techniques

Using `isin()` for Multiple Values

If you need to filter rows based on multiple possible values in a column, the isin() method is more efficient than chaining multiple OR conditions:

# People in specific cities
cities_of_interest = ['New York', 'Paris', 'Berlin']
filtered_df = df[df['city'].isin(cities_of_interest)]
print("\nPeople in New York, Paris, or Berlin:")
print(filtered_df)

# People NOT in specific cities
filtered_df = df[~df['city'].isin(cities_of_interest)]
print("\nPeople NOT in New York, Paris, or Berlin:")
print(filtered_df)

Output:

People in New York, Paris, or Berlin:
    name  age      city  salary
0   John   28  New York   50000
1   Anna   24     Paris   60000
2  Peter   35    Berlin   72000

People NOT in New York, Paris, or Berlin:
    name  age    city  salary
3  Linda   32  London   80000
4    Bob   45  Sydney   65000

Filtering with String Methods

Pandas provides string methods for filtering text data. Access them using the str accessor:

# Create a DataFrame with more text data
df_text = pd.DataFrame({
    'product': ['Laptop Pro', 'Tablet Mini', 'Smartphone X', 'Smartwatch Pro', 'Headphones Bass'],
    'description': ['Powerful laptop with SSD', 'Compact tablet for reading', 
                   'Next-gen smartphone', 'Fitness tracking watch', 'Noise-cancelling headphones'],
    'price': [1200, 400, 800, 300, 150]
})

print("\nProducts DataFrame:")
print(df_text)

# Filter products containing "Pro" in their name
pro_products = df_text[df_text['product'].str.contains('Pro')]
print("\nProducts with 'Pro' in the name:")
print(pro_products)

# Filter products with descriptions mentioning "tablet" or "laptop" (case-insensitive)
computing_devices = df_text[df_text['description'].str.contains('tablet|laptop', case=False)]
print("\nTablets or Laptops:")
print(computing_devices)

Output:

Products DataFrame:
            product                    description  price
0        Laptop Pro      Powerful laptop with SSD   1200
1       Tablet Mini     Compact tablet for reading    400
2     Smartphone X           Next-gen smartphone     800
3    Smartwatch Pro        Fitness tracking watch    300
4  Headphones Bass  Noise-cancelling headphones     150

Products with 'Pro' in the name:
         product              description  price
0     Laptop Pro  Powerful laptop with SSD   1200
3  Smartwatch Pro    Fitness tracking watch    300

Tablets or Laptops:
       product                 description  price
0  Laptop Pro  Powerful laptop with SSD      1200
1  Tablet Mini Compact tablet for reading     400

Filtering with `query()` Method

The query() method provides a concise way to filter DataFrames using string expressions:

# Filter using query() method
high_salary_young = df.query('salary > 60000 and age < 35')
print("\nPeople younger than 35 with salary over 60000:")
print(high_salary_young)

# Query with variable
min_salary = 65000
high_earners = df.query('salary >= @min_salary')
print(f"\nPeople earning at least {min_salary}:")
print(high_earners)

Output:

People younger than 35 with salary over 60000:
    name  age    city  salary
3  Linda   32  London   80000

People earning at least 65000:
    name  age    city  salary
2  Peter   35  Berlin   72000
3  Linda   32  London   80000
4    Bob   45  Sydney   65000

Filtering with `loc` and `iloc`

You can also filter using the loc and iloc indexers:

# Filter with loc
london_residents = df.loc[df['city'] == 'London']
print("\nPeople in London (using loc):")
print(london_residents)

# Get specific columns for filtered data
name_age = df.loc[df['salary'] > 70000, ['name', 'age']]
print("\nNames and ages of high earners:")
print(name_age)

Output:

People in London (using loc):
    name  age    city  salary
3  Linda   32  London   80000

Names and ages of high earners:
    name  age
2  Peter   35
3  Linda   32

Handling Missing Values

Filtering operations often involve handling missing values:

# Create a DataFrame with missing values
df_missing = pd.DataFrame({
    'name': ['John', 'Anna', None, 'Linda', 'Bob'],
    'age': [28, np.nan, 35, 32, 45],
    'city': ['New York', 'Paris', None, 'London', 'Sydney'],
    'salary': [50000, 60000, np.nan, 80000, 65000]
})

print("\nDataFrame with missing values:")
print(df_missing)

# Filter rows with missing values in any column
rows_with_na = df_missing[df_missing.isna().any(axis=1)]
print("\nRows with any missing value:")
print(rows_with_na)

# Filter rows without missing values
complete_rows = df_missing.dropna()
print("\nRows without missing values:")
print(complete_rows)

# Filter rows where specific column is not null
valid_names = df_missing[df_missing['name'].notna()]
print("\nRows with non-null names:")
print(valid_names)

Output:

DataFrame with missing values:
    name   age      city  salary
0   John  28.0  New York  50000.0
1   Anna   NaN     Paris  60000.0
2   None  35.0      None      NaN
3  Linda  32.0    London  80000.0
4    Bob  45.0    Sydney  65000.0

Rows with any missing value:
    name   age    city  salary
1   Anna   NaN   Paris  60000.0
2   None  35.0    None      NaN

Rows without missing values:
    name   age      city  salary
0   John  28.0  New York  50000.0
3  Linda  32.0    London  80000.0
4    Bob  45.0    Sydney  65000.0

Rows with non-null names:
    name   age      city  salary
0   John  28.0  New York  50000.0
1   Anna   NaN     Paris  60000.0
3  Linda  32.0    London  80000.0
4    Bob  45.0    Sydney  65000.0

Real-world Example: Data Analysis

Let's apply filtering techniques to a more realistic data analysis scenario using a dataset of sales transactions:

# Create a sales dataset
np.random.seed(42)  # for reproducibility
dates = pd.date_range('2023-01-01', '2023-03-31', freq='D')
products = ['Laptop', 'Monitor', 'Keyboard', 'Mouse', 'Headphones']
regions = ['North', 'South', 'East', 'West']

sales_data = []
for _ in range(200):
    date = np.random.choice(dates)
    product = np.random.choice(products)
    region = np.random.choice(regions)
    units = np.random.randint(1, 10)
    price = {'Laptop': 1200, 'Monitor': 300, 'Keyboard': 100, 'Mouse': 50, 'Headphones': 150}[product]
    sales_data.append([date, product, region, units, price, units * price])

sales_df = pd.DataFrame(sales_data, columns=['date', 'product', 'region', 'units', 'unit_price', 'total'])
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['month'] = sales_df['date'].dt.month_name()

print("Sales Dataset Sample:")
print(sales_df.head())

# 1. Find high-value transactions (over $1000)
high_value = sales_df[sales_df['total'] > 1000]
print(f"\nHigh-value transactions: {len(high_value)}")
print(high_value.head(3))

# 2. Find laptop sales in February in the West region
feb_west_laptops = sales_df[
    (sales_df['product'] == 'Laptop') & 
    (sales_df['month'] == 'February') & 
    (sales_df['region'] == 'West')
]
print(f"\nLaptop sales in February in West region: {len(feb_west_laptops)}")
print(feb_west_laptops)

# 3. Find most profitable product categories
product_performance = sales_df.groupby('product')['total'].sum().reset_index()
top_products = product_performance.sort_values('total', ascending=False)
print("\nProduct performance (by total sales):")
print(top_products)

# 4. Find days with exceptional sales (top 5% of daily sales)
daily_sales = sales_df.groupby(sales_df['date'].dt.date)['total'].sum().reset_index()
threshold = daily_sales['total'].quantile(0.95)
exceptional_days = daily_sales[daily_sales['total'] > threshold]
print("\nExceptional sales days (top 5%):")
print(exceptional_days)

Output (sample - actual values may vary):

Sales Dataset Sample:
        date    product region  units  unit_price   total
0 2023-03-13    Laptop  North      7        1200    8400
1 2023-01-24     Mouse   East      5          50     250
2 2023-02-11  Keyboard   East      1         100     100
3 2023-02-19    Laptop   East      8        1200    9600
4 2023-01-21   Monitor   West      3         300     900

High-value transactions: 92
        date    product region  units  unit_price  total    month
0 2023-03-13    Laptop  North      7        1200   8400    March
3 2023-02-19    Laptop   East      8        1200   9600 February
5 2023-03-05    Laptop  North      8        1200   9600    March

Laptop sales in February in West region: 3
         date product region  units  unit_price  total     month
26 2023-02-05  Laptop   West      2        1200   2400  February
43 2023-02-14  Laptop   West      7        1200   8400  February
91 2023-02-26  Laptop   West      3        1200   3600  February

Product performance (by total sales):
       product    total
0      Laptop  254400.0
3      Monitor   62100.0
4   Headphones   27300.0
2     Keyboard   10200.0
1        Mouse    8250.0

Exceptional sales days (top 5%):
         date    total
9  2023-02-12  10200.0
17 2023-03-19  10500.0
30 2023-03-21  10800.0
40 2023-02-04  11850.0

Summary

Filtering is a critical component of data analysis with Pandas. In this tutorial, we've covered:

Basic boolean filtering with single and multiple conditions
Using isin() for filtering with multiple possible values
String filtering with str.contains() and other string methods
The query() method for more readable filtering
Using loc and iloc for combined filtering and column selection
Handling missing values in filtering operations
Applying filtering techniques to real-world data analysis

Remember these key points:

Always use parentheses around each condition when combining multiple conditions
Choose the appropriate method based on your specific filtering needs
Consider readability and performance when working with large datasets

Exercises

To practice your filtering skills, try these exercises:

Create a DataFrame of students with columns for name, grade, subjects, and scores
Filter students who got an A grade (90 or above)
Find students who failed (below 60) in any subject
Identify students who excel in science subjects but struggle in humanities
Create a filtering operation that uses at least three conditions
Use string filtering to find students whose names start with a specific letter

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Boolean Filtering​

Filtering with a Single Condition​

Filtering with Multiple Conditions​

Advanced Filtering Techniques​

Using isin() for Multiple Values​

Filtering with String Methods​

Filtering with query() Method​

Filtering with loc and iloc​

Handling Missing Values​

Real-world Example: Data Analysis​

Summary​

Exercises​

Additional Resources​

Introduction

Basic Boolean Filtering

Filtering with a Single Condition

Filtering with Multiple Conditions

Advanced Filtering Techniques

Using `isin()` for Multiple Values

Filtering with String Methods

Filtering with `query()` Method

Filtering with `loc` and `iloc`

Handling Missing Values

Real-world Example: Data Analysis

Summary

Exercises

Additional Resources