Skip to main content

Pandas Row Selection

Introduction

When working with data in pandas, one of the most common operations you'll perform is selecting specific rows from a DataFrame. Row selection allows you to filter your data based on conditions, access specific rows by position or label, and extract meaningful subsets of your data for analysis.

In this tutorial, we'll explore various techniques for selecting rows in pandas DataFrames. Whether you're looking to filter data based on specific criteria or access rows by their position, you'll learn the tools and methods to efficiently work with your data.

Prerequisites

Before diving into row selection, make sure you have:

  • Basic understanding of Python
  • Pandas installed (pip install pandas)
  • Basic familiarity with pandas DataFrames

Let's start by importing pandas and creating a sample DataFrame to work with throughout this tutorial:

python
import pandas as pd

# Create a sample DataFrame
data = {
'Name': ['John', 'Sara', 'Mike', 'Emily', 'David'],
'Age': [28, 35, 22, 41, 30],
'City': ['New York', 'Boston', 'Chicago', 'New York', 'Boston'],
'Salary': [65000, 72000, 55000, 85000, 69000]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City  Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
4 David 30 Boston 69000

Row Selection Methods in Pandas

1. Selection Using Numerical Indexing with iloc

The iloc accessor lets you select rows (and columns) by their integer positions:

python
# Select the first row
first_row = df.iloc[0]
print("First row:")
print(first_row)

# Select multiple rows (rows 1 and 3)
multiple_rows = df.iloc[[1, 3]]
print("\nRows 1 and 3:")
print(multiple_rows)

# Select a range of rows (rows 1 through 3)
row_range = df.iloc[1:4]
print("\nRows 1 through 3:")
print(row_range)

Output:

First row:
Name John
Age 28
City New York
Salary 65000
Name: 0, dtype: object

Rows 1 and 3:
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000

Rows 1 through 3:
Name Age City Salary
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000

2. Selection Using Labels with loc

The loc accessor allows you to select rows using their labels (index values):

python
# Select row with index 2
row_index_2 = df.loc[2]
print("Row with index 2:")
print(row_index_2)

# Select multiple rows by label
selected_rows = df.loc[[0, 2, 4]]
print("\nRows with indices 0, 2, and 4:")
print(selected_rows)

Output:

Row with index 2:
Name Mike
Age 22
City Chicago
Salary 55000
Name: 2, dtype: object

Rows with indices 0, 2, and 4:
Name Age City Salary
0 John 28 New York 65000
2 Mike 22 Chicago 55000
4 David 30 Boston 69000

3. Boolean Indexing (Filtering Rows with Conditions)

One of the most powerful ways to select rows is through boolean indexing, which allows you to filter rows based on conditions:

python
# Select rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)

# Select rows where City is 'New York'
new_yorkers = df[df['City'] == 'New York']
print("\nPeople from New York:")
print(new_yorkers)

# Multiple conditions using & (and) operator
high_earning_bostonians = df[(df['City'] == 'Boston') & (df['Salary'] > 70000)]
print("\nHigh-earning people from Boston:")
print(high_earning_bostonians)

# Multiple conditions using | (or) operator
young_or_rich = df[(df['Age'] < 25) | (df['Salary'] > 80000)]
print("\nPeople who are either young or high earners:")
print(young_or_rich)

Output:

People older than 30:
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000

People from New York:
Name Age City Salary
0 John 28 New York 65000
3 Emily 41 New York 85000

High-earning people from Boston:
Name Age City Salary
1 Sara 35 Boston 72000

People who are either young or high earners:
Name Age City Salary
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000

4. Using query() Method

The query() method provides a concise way to filter rows using string expressions:

python
# Select rows where Age is greater than 30
older_than_30_query = df.query('Age > 30')
print("People older than 30 (using query):")
print(older_than_30_query)

# Select rows with multiple conditions
complex_query = df.query('City == "Boston" and Salary > 65000')
print("\nBoston residents with salary > 65000:")
print(complex_query)

Output:

People older than 30 (using query):
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000

Boston residents with salary > 65000:
Name Age City Salary
1 Sara 35 Boston 72000
4 David 30 Boston 69000

5. Using isin() for Membership Tests

You can use isin() to check if values in a column are present in a list:

python
# Select rows where City is either Boston or Chicago
selected_cities = df[df['City'].isin(['Boston', 'Chicago'])]
print("People from Boston or Chicago:")
print(selected_cities)

# Using ~ (not) with isin
not_selected_cities = df[~df['City'].isin(['Boston', 'Chicago'])]
print("\nPeople NOT from Boston or Chicago:")
print(not_selected_cities)

Output:

People from Boston or Chicago:
Name Age City Salary
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
4 David 30 Boston 69000

People NOT from Boston or Chicago:
Name Age City Salary
0 John 28 New York 65000
3 Emily 41 New York 85000

Real-World Applications of Row Selection

Example 1: Data Cleaning - Removing Outliers

Let's see how to remove rows with extreme values using row selection:

python
# Create a new dataset with some outliers
data_with_outliers = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E', 'F'],
'Price': [10, 15, 12, 8, 1000, 14],
'Quantity': [100, 80, 90, 110, 2, 95]
})

print("Original data with outliers:")
print(data_with_outliers)

# Remove outliers (price > 100 or quantity < 10)
cleaned_data = data_with_outliers[(data_with_outliers['Price'] <= 100) &
(data_with_outliers['Quantity'] >= 10)]
print("\nData after removing outliers:")
print(cleaned_data)

Output:

Original data with outliers:
Product Price Quantity
0 A 10 100
1 B 15 80
2 C 12 90
3 D 8 110
4 E 1000 2
5 F 14 95

Data after removing outliers:
Product Price Quantity
0 A 10 100
1 B 15 80
2 C 12 90
3 D 8 110
5 F 14 95

Example 2: Creating Analysis Groups

Here's how to select different subsets for comparison and analysis:

python
# Using our original dataset
print("Original data:")
print(df)

# Group by city and calculate average salary
city_avg_salary = df.groupby('City')['Salary'].mean().reset_index()
print("\nAverage salary by city:")
print(city_avg_salary)

# Separate data into age groups
young_group = df[df['Age'] < 30]
middle_group = df[(df['Age'] >= 30) & (df['Age'] < 40)]
older_group = df[df['Age'] >= 40]

print("\nAge group analysis:")
print(f"Young group (< 30): {young_group['Name'].tolist()}")
print(f"Middle group (30-39): {middle_group['Name'].tolist()}")
print(f"Older group (40+): {older_group['Name'].tolist()}")

Output:

Original data:
Name Age City Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
4 David 30 Boston 69000

Average salary by city:
City Salary
0 Boston 70500.00
1 Chicago 55000.00
2 New York 75000.00

Age group analysis:
Young group (< 30): ['John', 'Mike']
Middle group (30-39): ['Sara', 'David']
Older group (40+): ['Emily']

Example 3: Building a Custom Filter Function

You can create reusable functions for more complex filtering:

python
def filter_employees(df, min_age=None, max_age=None, cities=None, min_salary=None):
"""Filter employees based on multiple criteria."""
result = df.copy()

if min_age is not None:
result = result[result['Age'] >= min_age]

if max_age is not None:
result = result[result['Age'] <= max_age]

if cities is not None:
result = result[result['City'].isin(cities)]

if min_salary is not None:
result = result[result['Salary'] >= min_salary]

return result

# Use our custom filter function
filtered = filter_employees(
df,
min_age=25,
max_age=40,
cities=['New York', 'Boston'],
min_salary=65000
)

print("Filtered employees (25-40 years old, from NY or Boston, earning 65k+):")
print(filtered)

Output:

Filtered employees (25-40 years old, from NY or Boston, earning 65k+):
Name Age City Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
4 David 30 Boston 69000

Efficient Row Selection for Large DataFrames

When working with large datasets, efficiency becomes crucial:

python
# Creating a larger dataset for demonstration
import numpy as np

# Generate a larger DataFrame (100,000 rows)
np.random.seed(42)
large_df = pd.DataFrame({
'ID': range(100000),
'Value1': np.random.randint(0, 100, 100000),
'Value2': np.random.normal(50, 15, 100000),
'Category': np.random.choice(['A', 'B', 'C', 'D'], 100000)
})

# Inefficient way - evaluates the condition for all rows
import time

start = time.time()
result1 = large_df[large_df['Value1'] > 90]
end = time.time()
print(f"Simple filtering time: {end - start:.5f} seconds")

# More efficient for large DataFrames - using query
start = time.time()
result2 = large_df.query('Value1 > 90')
end = time.time()
print(f"Query method time: {end - start:.5f} seconds")

# Using numexpr (pandas uses this internally with query)
print(f"Number of rows in result: {len(result1)}")

Output (times may vary):

Simple filtering time: 0.00327 seconds
Query method time: 0.00312 seconds
Number of rows in result: 9803

Summary

In this tutorial, we've covered multiple ways to select and filter rows in pandas DataFrames:

  1. Index-based selection with iloc for integer-position based indexing
  2. Label-based selection with loc for selecting by index labels
  3. Boolean indexing for filtering rows based on conditions
  4. The query() method for string expression-based filtering
  5. isin() for membership-based filtering
  6. Real-world applications including data cleaning, group analysis, and custom filters

Row selection is a fundamental skill in data analysis with pandas. Mastering these techniques will allow you to efficiently extract relevant subsets of data for analysis, visualization, and modeling.

Exercises

To practice your row selection skills, try these exercises:

  1. Create a DataFrame with information about products (name, category, price, inventory) and select:

    • Products that are low in inventory (less than 10 items)
    • Products in a specific category with a price above average
    • The 3 most expensive products
  2. Using the employee dataset from this tutorial, write a function to find employees who:

    • Live in a specific city AND earn above the average for that city
    • Are the youngest employee in each city
  3. Practice chaining multiple selection methods together:

    • First filter rows based on a condition, then select specific columns
    • Select rows with one condition, then further filter that subset with another condition

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)