Pandas Row Selection

Introduction

When working with data in pandas, one of the most common operations you'll perform is selecting specific rows from a DataFrame. Row selection allows you to filter your data based on conditions, access specific rows by position or label, and extract meaningful subsets of your data for analysis.

In this tutorial, we'll explore various techniques for selecting rows in pandas DataFrames. Whether you're looking to filter data based on specific criteria or access rows by their position, you'll learn the tools and methods to efficiently work with your data.

Prerequisites

Before diving into row selection, make sure you have:

Basic understanding of Python
Pandas installed (pip install pandas)
Basic familiarity with pandas DataFrames

Let's start by importing pandas and creating a sample DataFrame to work with throughout this tutorial:

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Sara', 'Mike', 'Emily', 'David'],
    'Age': [28, 35, 22, 41, 30],
    'City': ['New York', 'Boston', 'Chicago', 'New York', 'Boston'],
    'Salary': [65000, 72000, 55000, 85000, 69000]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City  Salary
 John   28  New York   65000
 Sara   35    Boston   72000
 Mike   22   Chicago   55000
Emily   41  New York   85000
David   30    Boston   69000

Row Selection Methods in Pandas

1. Selection Using Numerical Indexing with `iloc`

The iloc accessor lets you select rows (and columns) by their integer positions:

# Select the first row
first_row = df.iloc[0]
print("First row:")
print(first_row)

# Select multiple rows (rows 1 and 3)
multiple_rows = df.iloc[[1, 3]]
print("\nRows 1 and 3:")
print(multiple_rows)

# Select a range of rows (rows 1 through 3)
row_range = df.iloc[1:4]
print("\nRows 1 through 3:")
print(row_range)

Output:

First row:
Name       John
Age          28
City    New York
Salary     65000
Name: 0, dtype: object

Rows 1 and 3:
    Name  Age      City  Salary
1   Sara   35    Boston   72000
3  Emily   41  New York   85000

Rows 1 through 3:
    Name  Age      City  Salary
1   Sara   35    Boston   72000
2   Mike   22   Chicago   55000
3  Emily   41  New York   85000

2. Selection Using Labels with `loc`

The loc accessor allows you to select rows using their labels (index values):

# Select row with index 2
row_index_2 = df.loc[2]
print("Row with index 2:")
print(row_index_2)

# Select multiple rows by label
selected_rows = df.loc[[0, 2, 4]]
print("\nRows with indices 0, 2, and 4:")
print(selected_rows)

Output:

Row with index 2:
Name        Mike
Age           22
City     Chicago
Salary     55000
Name: 2, dtype: object

Rows with indices 0, 2, and 4:
    Name  Age      City  Salary
0   John   28  New York   65000
2   Mike   22   Chicago   55000
4  David   30    Boston   69000

3. Boolean Indexing (Filtering Rows with Conditions)

One of the most powerful ways to select rows is through boolean indexing, which allows you to filter rows based on conditions:

# Select rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)

# Select rows where City is 'New York'
new_yorkers = df[df['City'] == 'New York']
print("\nPeople from New York:")
print(new_yorkers)

# Multiple conditions using & (and) operator
high_earning_bostonians = df[(df['City'] == 'Boston') & (df['Salary'] > 70000)]
print("\nHigh-earning people from Boston:")
print(high_earning_bostonians)

# Multiple conditions using | (or) operator
young_or_rich = df[(df['Age'] < 25) | (df['Salary'] > 80000)]
print("\nPeople who are either young or high earners:")
print(young_or_rich)

Output:

People older than 30:
    Name  Age      City  Salary
1   Sara   35    Boston   72000
3  Emily   41  New York   85000

People from New York:
    Name  Age      City  Salary
0   John   28  New York   65000
3  Emily   41  New York   85000

High-earning people from Boston:
   Name  Age    City  Salary
1  Sara   35  Boston   72000

People who are either young or high earners:
    Name  Age      City  Salary
2   Mike   22   Chicago   55000
3  Emily   41  New York   85000

4. Using `query()` Method

The query() method provides a concise way to filter rows using string expressions:

# Select rows where Age is greater than 30
older_than_30_query = df.query('Age > 30')
print("People older than 30 (using query):")
print(older_than_30_query)

# Select rows with multiple conditions
complex_query = df.query('City == "Boston" and Salary > 65000')
print("\nBoston residents with salary > 65000:")
print(complex_query)

Output:

People older than 30 (using query):
    Name  Age      City  Salary
1   Sara   35    Boston   72000
3  Emily   41  New York   85000

Boston residents with salary > 65000:
    Name  Age    City  Salary
1   Sara   35  Boston   72000
4  David   30  Boston   69000

5. Using `isin()` for Membership Tests

You can use isin() to check if values in a column are present in a list:

# Select rows where City is either Boston or Chicago
selected_cities = df[df['City'].isin(['Boston', 'Chicago'])]
print("People from Boston or Chicago:")
print(selected_cities)

# Using ~ (not) with isin
not_selected_cities = df[~df['City'].isin(['Boston', 'Chicago'])]
print("\nPeople NOT from Boston or Chicago:")
print(not_selected_cities)

Output:

People from Boston or Chicago:
    Name  Age     City  Salary
1   Sara   35   Boston   72000
2   Mike   22  Chicago   55000
4  David   30   Boston   69000

People NOT from Boston or Chicago:
    Name  Age      City  Salary
0   John   28  New York   65000
3  Emily   41  New York   85000

Real-World Applications of Row Selection

Example 1: Data Cleaning - Removing Outliers

Let's see how to remove rows with extreme values using row selection:

# Create a new dataset with some outliers
data_with_outliers = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'Price': [10, 15, 12, 8, 1000, 14],
    'Quantity': [100, 80, 90, 110, 2, 95]
})

print("Original data with outliers:")
print(data_with_outliers)

# Remove outliers (price > 100 or quantity < 10)
cleaned_data = data_with_outliers[(data_with_outliers['Price'] <= 100) & 
                                 (data_with_outliers['Quantity'] >= 10)]
print("\nData after removing outliers:")
print(cleaned_data)

Output:

Original data with outliers:
  Product  Price  Quantity
     A     10       100
     B     15        80
     C     12        90
     D      8       110
     E   1000         2
     F     14        95

Data after removing outliers:
  Product  Price  Quantity
     A     10       100
     B     15        80
     C     12        90
     D      8       110
     F     14        95

Example 2: Creating Analysis Groups

Here's how to select different subsets for comparison and analysis:

# Using our original dataset
print("Original data:")
print(df)

# Group by city and calculate average salary
city_avg_salary = df.groupby('City')['Salary'].mean().reset_index()
print("\nAverage salary by city:")
print(city_avg_salary)

# Separate data into age groups
young_group = df[df['Age'] < 30]
middle_group = df[(df['Age'] >= 30) & (df['Age'] < 40)]
older_group = df[df['Age'] >= 40]

print("\nAge group analysis:")
print(f"Young group (< 30): {young_group['Name'].tolist()}")
print(f"Middle group (30-39): {middle_group['Name'].tolist()}")
print(f"Older group (40+): {older_group['Name'].tolist()}")

Output:

Original data:
    Name  Age      City  Salary
0   John   28  New York   65000
1   Sara   35    Boston   72000
2   Mike   22   Chicago   55000
3  Emily   41  New York   85000
4  David   30    Boston   69000

Average salary by city:
       City     Salary
0    Boston  70500.00
1   Chicago  55000.00
2  New York  75000.00

Age group analysis:
Young group (< 30): ['John', 'Mike']
Middle group (30-39): ['Sara', 'David']
Older group (40+): ['Emily']

Example 3: Building a Custom Filter Function

You can create reusable functions for more complex filtering:

def filter_employees(df, min_age=None, max_age=None, cities=None, min_salary=None):
    """Filter employees based on multiple criteria."""
    result = df.copy()
    
    if min_age is not None:
        result = result[result['Age'] >= min_age]
        
    if max_age is not None:
        result = result[result['Age'] <= max_age]
        
    if cities is not None:
        result = result[result['City'].isin(cities)]
        
    if min_salary is not None:
        result = result[result['Salary'] >= min_salary]
        
    return result

# Use our custom filter function
filtered = filter_employees(
    df, 
    min_age=25, 
    max_age=40, 
    cities=['New York', 'Boston'], 
    min_salary=65000
)

print("Filtered employees (25-40 years old, from NY or Boston, earning 65k+):")
print(filtered)

Output:

Filtered employees (25-40 years old, from NY or Boston, earning 65k+):
    Name  Age      City  Salary
0   John   28  New York   65000
1   Sara   35    Boston   72000
4  David   30    Boston   69000

Efficient Row Selection for Large DataFrames

When working with large datasets, efficiency becomes crucial:

# Creating a larger dataset for demonstration
import numpy as np

# Generate a larger DataFrame (100,000 rows)
np.random.seed(42)
large_df = pd.DataFrame({
    'ID': range(100000),
    'Value1': np.random.randint(0, 100, 100000),
    'Value2': np.random.normal(50, 15, 100000),
    'Category': np.random.choice(['A', 'B', 'C', 'D'], 100000)
})

# Inefficient way - evaluates the condition for all rows
import time

start = time.time()
result1 = large_df[large_df['Value1'] > 90]
end = time.time()
print(f"Simple filtering time: {end - start:.5f} seconds")

# More efficient for large DataFrames - using query
start = time.time()
result2 = large_df.query('Value1 > 90')
end = time.time()
print(f"Query method time: {end - start:.5f} seconds")

# Using numexpr (pandas uses this internally with query)
print(f"Number of rows in result: {len(result1)}")

Output (times may vary):

Simple filtering time: 0.00327 seconds
Query method time: 0.00312 seconds
Number of rows in result: 9803

Summary

In this tutorial, we've covered multiple ways to select and filter rows in pandas DataFrames:

Index-based selection with iloc for integer-position based indexing
Label-based selection with loc for selecting by index labels
Boolean indexing for filtering rows based on conditions
The query() method for string expression-based filtering
isin() for membership-based filtering
Real-world applications including data cleaning, group analysis, and custom filters

Row selection is a fundamental skill in data analysis with pandas. Mastering these techniques will allow you to efficiently extract relevant subsets of data for analysis, visualization, and modeling.

Exercises

To practice your row selection skills, try these exercises:

Create a DataFrame with information about products (name, category, price, inventory) and select:
- Products that are low in inventory (less than 10 items)
- Products in a specific category with a price above average
- The 3 most expensive products
Using the employee dataset from this tutorial, write a function to find employees who:
- Live in a specific city AND earn above the average for that city
- Are the youngest employee in each city
Practice chaining multiple selection methods together:
- First filter rows based on a condition, then select specific columns
- Select rows with one condition, then further filter that subset with another condition

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Row Selection Methods in Pandas​

1. Selection Using Numerical Indexing with iloc​

2. Selection Using Labels with loc​

3. Boolean Indexing (Filtering Rows with Conditions)​

4. Using query() Method​

5. Using isin() for Membership Tests​

Real-World Applications of Row Selection​

Example 1: Data Cleaning - Removing Outliers​

Example 2: Creating Analysis Groups​

Example 3: Building a Custom Filter Function​

Efficient Row Selection for Large DataFrames​

Summary​

Exercises​

Additional Resources​

Introduction

Prerequisites

Row Selection Methods in Pandas

1. Selection Using Numerical Indexing with `iloc`

2. Selection Using Labels with `loc`

3. Boolean Indexing (Filtering Rows with Conditions)

4. Using `query()` Method

5. Using `isin()` for Membership Tests

Real-World Applications of Row Selection

Example 1: Data Cleaning - Removing Outliers

Example 2: Creating Analysis Groups

Example 3: Building a Custom Filter Function

Efficient Row Selection for Large DataFrames

Summary

Exercises

Additional Resources