Pandas Row Selection
Introduction
When working with data in pandas, one of the most common operations you'll perform is selecting specific rows from a DataFrame. Row selection allows you to filter your data based on conditions, access specific rows by position or label, and extract meaningful subsets of your data for analysis.
In this tutorial, we'll explore various techniques for selecting rows in pandas DataFrames. Whether you're looking to filter data based on specific criteria or access rows by their position, you'll learn the tools and methods to efficiently work with your data.
Prerequisites
Before diving into row selection, make sure you have:
- Basic understanding of Python
- Pandas installed (
pip install pandas
) - Basic familiarity with pandas DataFrames
Let's start by importing pandas and creating a sample DataFrame to work with throughout this tutorial:
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Sara', 'Mike', 'Emily', 'David'],
'Age': [28, 35, 22, 41, 30],
'City': ['New York', 'Boston', 'Chicago', 'New York', 'Boston'],
'Salary': [65000, 72000, 55000, 85000, 69000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
4 David 30 Boston 69000
Row Selection Methods in Pandas
1. Selection Using Numerical Indexing with iloc
The iloc
accessor lets you select rows (and columns) by their integer positions:
# Select the first row
first_row = df.iloc[0]
print("First row:")
print(first_row)
# Select multiple rows (rows 1 and 3)
multiple_rows = df.iloc[[1, 3]]
print("\nRows 1 and 3:")
print(multiple_rows)
# Select a range of rows (rows 1 through 3)
row_range = df.iloc[1:4]
print("\nRows 1 through 3:")
print(row_range)
Output:
First row:
Name John
Age 28
City New York
Salary 65000
Name: 0, dtype: object
Rows 1 and 3:
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000
Rows 1 through 3:
Name Age City Salary
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
2. Selection Using Labels with loc
The loc
accessor allows you to select rows using their labels (index values):
# Select row with index 2
row_index_2 = df.loc[2]
print("Row with index 2:")
print(row_index_2)
# Select multiple rows by label
selected_rows = df.loc[[0, 2, 4]]
print("\nRows with indices 0, 2, and 4:")
print(selected_rows)
Output:
Row with index 2:
Name Mike
Age 22
City Chicago
Salary 55000
Name: 2, dtype: object
Rows with indices 0, 2, and 4:
Name Age City Salary
0 John 28 New York 65000
2 Mike 22 Chicago 55000
4 David 30 Boston 69000
3. Boolean Indexing (Filtering Rows with Conditions)
One of the most powerful ways to select rows is through boolean indexing, which allows you to filter rows based on conditions:
# Select rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)
# Select rows where City is 'New York'
new_yorkers = df[df['City'] == 'New York']
print("\nPeople from New York:")
print(new_yorkers)
# Multiple conditions using & (and) operator
high_earning_bostonians = df[(df['City'] == 'Boston') & (df['Salary'] > 70000)]
print("\nHigh-earning people from Boston:")
print(high_earning_bostonians)
# Multiple conditions using | (or) operator
young_or_rich = df[(df['Age'] < 25) | (df['Salary'] > 80000)]
print("\nPeople who are either young or high earners:")
print(young_or_rich)
Output:
People older than 30:
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000
People from New York:
Name Age City Salary
0 John 28 New York 65000
3 Emily 41 New York 85000
High-earning people from Boston:
Name Age City Salary
1 Sara 35 Boston 72000
People who are either young or high earners:
Name Age City Salary
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
4. Using query()
Method
The query()
method provides a concise way to filter rows using string expressions:
# Select rows where Age is greater than 30
older_than_30_query = df.query('Age > 30')
print("People older than 30 (using query):")
print(older_than_30_query)
# Select rows with multiple conditions
complex_query = df.query('City == "Boston" and Salary > 65000')
print("\nBoston residents with salary > 65000:")
print(complex_query)
Output:
People older than 30 (using query):
Name Age City Salary
1 Sara 35 Boston 72000
3 Emily 41 New York 85000
Boston residents with salary > 65000:
Name Age City Salary
1 Sara 35 Boston 72000
4 David 30 Boston 69000
5. Using isin()
for Membership Tests
You can use isin()
to check if values in a column are present in a list:
# Select rows where City is either Boston or Chicago
selected_cities = df[df['City'].isin(['Boston', 'Chicago'])]
print("People from Boston or Chicago:")
print(selected_cities)
# Using ~ (not) with isin
not_selected_cities = df[~df['City'].isin(['Boston', 'Chicago'])]
print("\nPeople NOT from Boston or Chicago:")
print(not_selected_cities)
Output:
People from Boston or Chicago:
Name Age City Salary
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
4 David 30 Boston 69000
People NOT from Boston or Chicago:
Name Age City Salary
0 John 28 New York 65000
3 Emily 41 New York 85000
Real-World Applications of Row Selection
Example 1: Data Cleaning - Removing Outliers
Let's see how to remove rows with extreme values using row selection:
# Create a new dataset with some outliers
data_with_outliers = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D', 'E', 'F'],
'Price': [10, 15, 12, 8, 1000, 14],
'Quantity': [100, 80, 90, 110, 2, 95]
})
print("Original data with outliers:")
print(data_with_outliers)
# Remove outliers (price > 100 or quantity < 10)
cleaned_data = data_with_outliers[(data_with_outliers['Price'] <= 100) &
(data_with_outliers['Quantity'] >= 10)]
print("\nData after removing outliers:")
print(cleaned_data)
Output:
Original data with outliers:
Product Price Quantity
0 A 10 100
1 B 15 80
2 C 12 90
3 D 8 110
4 E 1000 2
5 F 14 95
Data after removing outliers:
Product Price Quantity
0 A 10 100
1 B 15 80
2 C 12 90
3 D 8 110
5 F 14 95
Example 2: Creating Analysis Groups
Here's how to select different subsets for comparison and analysis:
# Using our original dataset
print("Original data:")
print(df)
# Group by city and calculate average salary
city_avg_salary = df.groupby('City')['Salary'].mean().reset_index()
print("\nAverage salary by city:")
print(city_avg_salary)
# Separate data into age groups
young_group = df[df['Age'] < 30]
middle_group = df[(df['Age'] >= 30) & (df['Age'] < 40)]
older_group = df[df['Age'] >= 40]
print("\nAge group analysis:")
print(f"Young group (< 30): {young_group['Name'].tolist()}")
print(f"Middle group (30-39): {middle_group['Name'].tolist()}")
print(f"Older group (40+): {older_group['Name'].tolist()}")
Output:
Original data:
Name Age City Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
2 Mike 22 Chicago 55000
3 Emily 41 New York 85000
4 David 30 Boston 69000
Average salary by city:
City Salary
0 Boston 70500.00
1 Chicago 55000.00
2 New York 75000.00
Age group analysis:
Young group (< 30): ['John', 'Mike']
Middle group (30-39): ['Sara', 'David']
Older group (40+): ['Emily']
Example 3: Building a Custom Filter Function
You can create reusable functions for more complex filtering:
def filter_employees(df, min_age=None, max_age=None, cities=None, min_salary=None):
"""Filter employees based on multiple criteria."""
result = df.copy()
if min_age is not None:
result = result[result['Age'] >= min_age]
if max_age is not None:
result = result[result['Age'] <= max_age]
if cities is not None:
result = result[result['City'].isin(cities)]
if min_salary is not None:
result = result[result['Salary'] >= min_salary]
return result
# Use our custom filter function
filtered = filter_employees(
df,
min_age=25,
max_age=40,
cities=['New York', 'Boston'],
min_salary=65000
)
print("Filtered employees (25-40 years old, from NY or Boston, earning 65k+):")
print(filtered)
Output:
Filtered employees (25-40 years old, from NY or Boston, earning 65k+):
Name Age City Salary
0 John 28 New York 65000
1 Sara 35 Boston 72000
4 David 30 Boston 69000
Efficient Row Selection for Large DataFrames
When working with large datasets, efficiency becomes crucial:
# Creating a larger dataset for demonstration
import numpy as np
# Generate a larger DataFrame (100,000 rows)
np.random.seed(42)
large_df = pd.DataFrame({
'ID': range(100000),
'Value1': np.random.randint(0, 100, 100000),
'Value2': np.random.normal(50, 15, 100000),
'Category': np.random.choice(['A', 'B', 'C', 'D'], 100000)
})
# Inefficient way - evaluates the condition for all rows
import time
start = time.time()
result1 = large_df[large_df['Value1'] > 90]
end = time.time()
print(f"Simple filtering time: {end - start:.5f} seconds")
# More efficient for large DataFrames - using query
start = time.time()
result2 = large_df.query('Value1 > 90')
end = time.time()
print(f"Query method time: {end - start:.5f} seconds")
# Using numexpr (pandas uses this internally with query)
print(f"Number of rows in result: {len(result1)}")
Output (times may vary):
Simple filtering time: 0.00327 seconds
Query method time: 0.00312 seconds
Number of rows in result: 9803
Summary
In this tutorial, we've covered multiple ways to select and filter rows in pandas DataFrames:
- Index-based selection with
iloc
for integer-position based indexing - Label-based selection with
loc
for selecting by index labels - Boolean indexing for filtering rows based on conditions
- The
query()
method for string expression-based filtering isin()
for membership-based filtering- Real-world applications including data cleaning, group analysis, and custom filters
Row selection is a fundamental skill in data analysis with pandas. Mastering these techniques will allow you to efficiently extract relevant subsets of data for analysis, visualization, and modeling.
Exercises
To practice your row selection skills, try these exercises:
-
Create a DataFrame with information about products (name, category, price, inventory) and select:
- Products that are low in inventory (less than 10 items)
- Products in a specific category with a price above average
- The 3 most expensive products
-
Using the employee dataset from this tutorial, write a function to find employees who:
- Live in a specific city AND earn above the average for that city
- Are the youngest employee in each city
-
Practice chaining multiple selection methods together:
- First filter rows based on a condition, then select specific columns
- Select rows with one condition, then further filter that subset with another condition
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)