Pandas Row Operations
When working with data in Pandas, manipulating rows is one of the most common tasks you'll perform. Whether you need to add new records, remove unwanted entries, filter data, or transform values across rows, Pandas provides powerful tools to handle these operations efficiently.
Introduction to Row Operations
A DataFrame in Pandas is essentially a two-dimensional table with rows and columns. Each row represents an observation or record, while columns represent different variables or features. When performing data analysis or preparation, you'll often need to:
- Select specific rows based on conditions
- Add new rows to your dataset
- Delete rows that contain irrelevant or erroneous data
- Iterate through rows for custom operations
- Apply functions across rows
In this guide, we'll explore various techniques for handling these row operations in Pandas.
Selecting and Accessing Rows
Using Positional Indexing with iloc
The iloc
accessor lets you select rows by their position (integer index):
import pandas as pd
# Create a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle']
}
df = pd.DataFrame(data)
# Select the first row
first_row = df.iloc[0]
print("First row:")
print(first_row)
# Select multiple rows
rows_2_to_4 = df.iloc[1:4]
print("\nRows 2 to 4:")
print(rows_2_to_4)
Output:
First row:
Name Alice
Age 25
City New York
Name: 0, dtype: object
Rows 2 to 4:
Name Age City
1 Bob 30 Boston
2 Charlie 35 Chicago
3 David 40 Denver
Using Label-Based Indexing with loc
The loc
accessor allows you to select rows by their labels:
# Set the 'Name' column as index
df = df.set_index('Name')
# Select row by label
alice_row = df.loc['Alice']
print("Row with index 'Alice':")
print(alice_row)
# Select multiple rows by label
selected_rows = df.loc[['Alice', 'Charlie', 'Emma']]
print("\nSelected rows:")
print(selected_rows)
Output:
Row with index 'Alice':
Age 25
City New York
Name: Alice, dtype: object
Selected rows:
Age City
Name
Alice 25 New York
Charlie 35 Chicago
Emma 45 Seattle
Filtering Rows Based on Conditions
One of the most powerful features of Pandas is the ability to filter rows based on conditions:
# Reset index back to default
df = df.reset_index()
# Filter rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)
# Filter rows with multiple conditions
older_than_30_from_chicago = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print("\nPeople older than 30 from Chicago:")
print(older_than_30_from_chicago)
Output:
People older than 30:
Name Age City
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
People older than 30 from Chicago:
Name Age City
2 Charlie 35 Chicago
Using query()
Method for Filtering
For more complex filtering, the query()
method offers a cleaner syntax:
# Filter using query method
coastal_cities = df.query("City == 'New York' or City == 'Boston' or City == 'Seattle'")
print("People from coastal cities:")
print(coastal_cities)
Output:
People from coastal cities:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
4 Emma 45 Seattle
Adding Rows to a DataFrame
Using append()
Method (Legacy)
# Note: append() is deprecated in newer versions of pandas
new_person = pd.DataFrame({'Name': ['Frank'], 'Age': [50], 'City': ['Miami']})
df_extended = df.append(new_person, ignore_index=True)
print("DataFrame with new row:")
print(df_extended)
Output:
DataFrame with new row:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
5 Frank 50 Miami
Using concat()
(Recommended Approach)
# Adding a new row using concat (preferred method)
new_person = pd.DataFrame({'Name': ['Grace'], 'Age': [28], 'City': ['Austin']})
df_extended = pd.concat([df, new_person], ignore_index=True)
print("DataFrame with new row (using concat):")
print(df_extended)
Output:
DataFrame with new row (using concat):
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
5 Grace 28 Austin
Adding Multiple Rows at Once
# Adding multiple rows at once
new_people = pd.DataFrame({
'Name': ['Hannah', 'Ian'],
'Age': [33, 27],
'City': ['Portland', 'Philadelphia']
})
df_extended = pd.concat([df, new_people], ignore_index=True)
print("DataFrame with multiple new rows:")
print(df_extended)
Output:
DataFrame with multiple new rows:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
5 Hannah 33 Portland
6 Ian 27 Philadelphia
Deleting and Dropping Rows
Dropping Rows by Index
# Drop a row by index position
df_without_row = df.drop(1) # Drops the row with index 1 (Bob)
print("DataFrame after dropping row with index 1:")
print(df_without_row)
# Drop multiple rows by index
df_without_multiple = df.drop([0, 2]) # Drops rows with indices 0 and 2 (Alice and Charlie)
print("\nDataFrame after dropping multiple rows:")
print(df_without_multiple)
Output:
DataFrame after dropping row with index 1:
Name Age City
0 Alice 25 New York
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
DataFrame after dropping multiple rows:
Name Age City
1 Bob 30 Boston
3 David 40 Denver
4 Emma 45 Seattle
Dropping Rows Based on Conditions
# Drop rows based on a condition
df_young_only = df[~(df['Age'] > 30)] # Keep only rows where Age is not > 30
print("DataFrame with only young people:")
print(df_young_only)
Output:
DataFrame with only young people:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
Modifying Row Values
Updating a Single Row
# Update values in a specific row
df.loc[1, 'City'] = 'Cambridge' # Change Bob's city
print("DataFrame after updating Bob's city:")
print(df)
Output:
DataFrame after updating Bob's city:
Name Age City
0 Alice 25 New York
1 Bob 30 Cambridge
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
Updating Based on Conditions
# Update multiple rows based on a condition
df.loc[df['Age'] > 40, 'Age'] = df.loc[df['Age'] > 40, 'Age'] - 5
print("DataFrame after reducing age for people over 40:")
print(df)
Output:
DataFrame after reducing age for people over 40:
Name Age City
0 Alice 25 New York
1 Bob 30 Cambridge
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 40 Seattle
Iterating Through Rows
Using iterrows()
# Iterate through rows
print("Iterating through rows:")
for index, row in df.iterrows():
print(f"Index: {index}, Name: {row['Name']}, Age: {row['Age']}, City: {row['City']}")
Output:
Iterating through rows:
Index: 0, Name: Alice, Age: 25, City: New York
Index: 1, Name: Bob, Age: 30, City: Cambridge
Index: 2, Name: Charlie, Age: 35, City: Chicago
Index: 3, Name: David, Age: 40, City: Denver
Index: 4, Name: Emma, Age: 40, City: Seattle
Using itertuples()
for Better Performance
# Iterate using itertuples (more efficient than iterrows)
print("\nIterating with itertuples:")
for row in df.itertuples():
print(f"Index: {row.Index}, Name: {row.Name}, Age: {row.Age}, City: {row.City}")
Output:
Iterating with itertuples:
Index: 0, Name: Alice, Age: 25, City: New York
Index: 1, Name: Bob, Age: 30, City: Cambridge
Index: 2, Name: Charlie, Age: 35, City: Chicago
Index: 3, Name: David, Age: 40, City: Denver
Index: 4, Name: Emma, Age: 40, City: Seattle
Applying Functions to Rows
Using apply()
with axis=1
# Apply a function to each row
def format_person(row):
return f"{row['Name']} is {row['Age']} years old and lives in {row['City']}"
descriptions = df.apply(format_person, axis=1)
print("Applied function to each row:")
print(descriptions)
Output:
Applied function to each row:
0 Alice is 25 years old and lives in New York
1 Bob is 30 years old and lives in Cambridge
2 Charlie is 35 years old and lives in Chicago
3 David is 40 years old and lives in Denver
4 Emma is 40 years old and lives in Seattle
dtype: object
Transforming Multiple Columns at Once
# Transform multiple columns at once
def adjust_data(row):
row['Age'] = row['Age'] + 1 # Increment age by 1
row['City'] = row['City'].upper() # Convert city to uppercase
return row
df_transformed = df.apply(adjust_data, axis=1)
print("\nTransformed data:")
print(df_transformed)
Output:
Transformed data:
Name Age City
0 Alice 26 NEW YORK
1 Bob 31 CAMBRIDGE
2 Charlie 36 CHICAGO
3 David 41 DENVER
4 Emma 41 SEATTLE
Real-World Example: Data Cleaning Pipeline
Let's put these skills together in a practical example of cleaning customer data:
# Create a sample customer dataset with some issues
customer_data = {
'CustomerID': [1001, 1002, None, 1004, 1005],
'Name': ['John Smith', 'Jane Doe', 'Mike Johnson', 'Sarah Williams', None],
'Email': ['[email protected]', 'invalid_email', '[email protected]', '[email protected]', '[email protected]'],
'Age': [25, -30, 45, 'unknown', 32],
'Purchases': [3, 5, 0, 7, 2]
}
customers_df = pd.DataFrame(customer_data)
print("Original customer data:")
print(customers_df)
# Data cleaning pipeline for rows
def clean_customer_data(df):
# Step 1: Remove rows with missing critical information (CustomerID or Name)
df = df.dropna(subset=['CustomerID', 'Name'])
# Step 2: Filter out invalid emails (must contain @)
df = df[df['Email'].str.contains('@', na=False)]
# Step 3: Fix numeric data types and filter out invalid values
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df = df[df['Age'] > 0]
# Step 4: Reset index after all the filtering
df = df.reset_index(drop=True)
return df
cleaned_df = clean_customer_data(customers_df)
print("\nCleaned customer data:")
print(cleaned_df)
Output:
Original customer data:
CustomerID Name Email Age Purchases
0 1001.0 John Smith [email protected] 25.0 3
1 1002.0 Jane Doe invalid_email -30.0 5
2 NaN Mike Johnson [email protected] 45.0 0
3 1004.0 Sarah Williams [email protected] unknown 7
4 1005.0 None [email protected] 32.0 2
Cleaned customer data:
CustomerID Name Email Age Purchases
0 1001.0 John Smith [email protected] 25.0 3
1 1004.0 Sarah Williams [email protected] NaN 7
Summary
In this guide, we've explored various ways to work with rows in Pandas DataFrames:
- Selecting rows using
iloc
(position-based) andloc
(label-based) indexing - Filtering rows based on conditions using boolean masks and the
query()
method - Adding new rows to DataFrames using
concat()
- Removing rows using the
drop()
method or conditional filtering - Modifying row values either directly or based on conditions
- Iterating through rows using
iterrows()
anditertuples()
- Applying functions to transform row data using
apply()
These row operations are fundamental to data manipulation in Pandas and provide the building blocks for more complex data transformations and analysis.
Exercises
To practice your skills with Pandas row operations, try these exercises:
-
Create a DataFrame with 10 rows of fictional employee data, then:
- Filter employees based on salary and department
- Add two new employees to the DataFrame
- Remove employees who have been with the company less than a year
-
Load a real dataset (like Titanic from seaborn) and perform the following:
- Filter passengers who survived and were under 18 years old
- Update the fare column to show fare per family member
- Add a new column that categorizes passengers based on their age
-
Create a function that processes a DataFrame of sales data to:
- Remove rows with missing customer information
- Filter transactions to a specific date range
- Add a "total value" column by multiplying quantity by price
Additional Resources
Remember that effective row operations are key to data manipulation in Pandas, and mastering these techniques will significantly enhance your data analysis capabilities.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)