Pandas Row Operations

When working with data in Pandas, manipulating rows is one of the most common tasks you'll perform. Whether you need to add new records, remove unwanted entries, filter data, or transform values across rows, Pandas provides powerful tools to handle these operations efficiently.

Introduction to Row Operations

A DataFrame in Pandas is essentially a two-dimensional table with rows and columns. Each row represents an observation or record, while columns represent different variables or features. When performing data analysis or preparation, you'll often need to:

Select specific rows based on conditions
Add new rows to your dataset
Delete rows that contain irrelevant or erroneous data
Iterate through rows for custom operations
Apply functions across rows

In this guide, we'll explore various techniques for handling these row operations in Pandas.

Selecting and Accessing Rows

Using Positional Indexing with `iloc`

The iloc accessor lets you select rows by their position (integer index):

python
import pandas as pd

# Create a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle']
}
df = pd.DataFrame(data)

# Select the first row
first_row = df.iloc[0]
print("First row:")
print(first_row)

# Select multiple rows
rows_2_to_4 = df.iloc[1:4]
print("\nRows 2 to 4:")
print(rows_2_to_4)

Output:

First row:
Name    Alice
Age        25
City    New York
Name: 0, dtype: object

Rows 2 to 4:
      Name  Age     City
1      Bob   30   Boston
2  Charlie   35  Chicago
3    David   40   Denver

Using Label-Based Indexing with `loc`

The loc accessor allows you to select rows by their labels:

python
# Set the 'Name' column as index
df = df.set_index('Name')

# Select row by label
alice_row = df.loc['Alice']
print("Row with index 'Alice':")
print(alice_row)

# Select multiple rows by label
selected_rows = df.loc[['Alice', 'Charlie', 'Emma']]
print("\nSelected rows:")
print(selected_rows)

Output:

Row with index 'Alice':
Age        25
City    New York
Name: Alice, dtype: object

Selected rows:
        Age      City
Name                 
Alice    25  New York
Charlie  35   Chicago
Emma     45   Seattle

Filtering Rows Based on Conditions

One of the most powerful features of Pandas is the ability to filter rows based on conditions:

python
# Reset index back to default
df = df.reset_index()

# Filter rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)

# Filter rows with multiple conditions
older_than_30_from_chicago = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print("\nPeople older than 30 from Chicago:")
print(older_than_30_from_chicago)

Output:

People older than 30:
      Name  Age     City
2  Charlie   35  Chicago
3    David   40   Denver
4     Emma   45  Seattle

People older than 30 from Chicago:
      Name  Age     City
2  Charlie   35  Chicago

Using `query()` Method for Filtering

For more complex filtering, the query() method offers a cleaner syntax:

python
# Filter using query method
coastal_cities = df.query("City == 'New York' or City == 'Boston' or City == 'Seattle'")
print("People from coastal cities:")
print(coastal_cities)

Output:

People from coastal cities:
    Name  Age     City
0  Alice   25  New York
1    Bob   30    Boston
4   Emma   45   Seattle

Adding Rows to a DataFrame

Using `append()` Method (Legacy)

python
# Note: append() is deprecated in newer versions of pandas
new_person = pd.DataFrame({'Name': ['Frank'], 'Age': [50], 'City': ['Miami']})
df_extended = df.append(new_person, ignore_index=True)
print("DataFrame with new row:")
print(df_extended)

Output:

DataFrame with new row:
      Name  Age      City
  Alice   25   New York
    Bob   30     Boston
Charlie   35    Chicago
  David   40     Denver
   Emma   45    Seattle
  Frank   50      Miami

Using `concat()` (Recommended Approach)

python
# Adding a new row using concat (preferred method)
new_person = pd.DataFrame({'Name': ['Grace'], 'Age': [28], 'City': ['Austin']})
df_extended = pd.concat([df, new_person], ignore_index=True)
print("DataFrame with new row (using concat):")
print(df_extended)

Output:

DataFrame with new row (using concat):
      Name  Age      City
  Alice   25   New York
    Bob   30     Boston
Charlie   35    Chicago
  David   40     Denver
   Emma   45    Seattle
  Grace   28     Austin

Adding Multiple Rows at Once

python
# Adding multiple rows at once
new_people = pd.DataFrame({
    'Name': ['Hannah', 'Ian'],
    'Age': [33, 27],
    'City': ['Portland', 'Philadelphia']
})

df_extended = pd.concat([df, new_people], ignore_index=True)
print("DataFrame with multiple new rows:")
print(df_extended)

Output:

DataFrame with multiple new rows:
      Name  Age          City
  Alice   25       New York
    Bob   30         Boston
Charlie   35        Chicago
  David   40         Denver
   Emma   45        Seattle
 Hannah   33       Portland
    Ian   27  Philadelphia

Deleting and Dropping Rows

Dropping Rows by Index

python
# Drop a row by index position
df_without_row = df.drop(1)  # Drops the row with index 1 (Bob)
print("DataFrame after dropping row with index 1:")
print(df_without_row)

# Drop multiple rows by index
df_without_multiple = df.drop([0, 2])  # Drops rows with indices 0 and 2 (Alice and Charlie)
print("\nDataFrame after dropping multiple rows:")
print(df_without_multiple)

Output:

DataFrame after dropping row with index 1:
      Name  Age      City
0    Alice   25   New York
2  Charlie   35    Chicago
3    David   40     Denver
4     Emma   45    Seattle

DataFrame after dropping multiple rows:
    Name  Age     City
1    Bob   30    Boston
3  David   40    Denver
4   Emma   45  Seattle

Dropping Rows Based on Conditions

python
# Drop rows based on a condition
df_young_only = df[~(df['Age'] > 30)]  # Keep only rows where Age is not > 30
print("DataFrame with only young people:")
print(df_young_only)

Output:

DataFrame with only young people:
    Name  Age      City
0  Alice   25   New York
1    Bob   30     Boston

Modifying Row Values

Updating a Single Row

python
# Update values in a specific row
df.loc[1, 'City'] = 'Cambridge'  # Change Bob's city
print("DataFrame after updating Bob's city:")
print(df)

Output:

DataFrame after updating Bob's city:
      Name  Age       City
  Alice   25    New York
    Bob   30  Cambridge
Charlie   35    Chicago
  David   40     Denver
   Emma   45    Seattle

Updating Based on Conditions

python
# Update multiple rows based on a condition
df.loc[df['Age'] > 40, 'Age'] = df.loc[df['Age'] > 40, 'Age'] - 5
print("DataFrame after reducing age for people over 40:")
print(df)

Output:

DataFrame after reducing age for people over 40:
      Name  Age       City
  Alice   25    New York
    Bob   30  Cambridge
Charlie   35    Chicago
  David   40     Denver
   Emma   40    Seattle

Iterating Through Rows

Using `iterrows()`

python
# Iterate through rows
print("Iterating through rows:")
for index, row in df.iterrows():
    print(f"Index: {index}, Name: {row['Name']}, Age: {row['Age']}, City: {row['City']}")

Output:

Iterating through rows:
Index: 0, Name: Alice, Age: 25, City: New York
Index: 1, Name: Bob, Age: 30, City: Cambridge
Index: 2, Name: Charlie, Age: 35, City: Chicago
Index: 3, Name: David, Age: 40, City: Denver
Index: 4, Name: Emma, Age: 40, City: Seattle

Using `itertuples()` for Better Performance

python
# Iterate using itertuples (more efficient than iterrows)
print("\nIterating with itertuples:")
for row in df.itertuples():
    print(f"Index: {row.Index}, Name: {row.Name}, Age: {row.Age}, City: {row.City}")

Output:

Iterating with itertuples:
Index: 0, Name: Alice, Age: 25, City: New York
Index: 1, Name: Bob, Age: 30, City: Cambridge
Index: 2, Name: Charlie, Age: 35, City: Chicago
Index: 3, Name: David, Age: 40, City: Denver
Index: 4, Name: Emma, Age: 40, City: Seattle

Applying Functions to Rows

Using `apply()` with `axis=1`

python
# Apply a function to each row
def format_person(row):
    return f"{row['Name']} is {row['Age']} years old and lives in {row['City']}"

descriptions = df.apply(format_person, axis=1)
print("Applied function to each row:")
print(descriptions)

Output:

Applied function to each row:
    Alice is 25 years old and lives in New York
      Bob is 30 years old and lives in Cambridge
    Charlie is 35 years old and lives in Chicago
      David is 40 years old and lives in Denver
      Emma is 40 years old and lives in Seattle
dtype: object

Transforming Multiple Columns at Once

python
# Transform multiple columns at once
def adjust_data(row):
    row['Age'] = row['Age'] + 1  # Increment age by 1
    row['City'] = row['City'].upper()  # Convert city to uppercase
    return row

df_transformed = df.apply(adjust_data, axis=1)
print("\nTransformed data:")
print(df_transformed)

Output:

Transformed data:
      Name  Age       City
  Alice   26    NEW YORK
    Bob   31  CAMBRIDGE
Charlie   36    CHICAGO
  David   41     DENVER
   Emma   41    SEATTLE

Real-World Example: Data Cleaning Pipeline

Let's put these skills together in a practical example of cleaning customer data:

python
# Create a sample customer dataset with some issues
customer_data = {
    'CustomerID': [1001, 1002, None, 1004, 1005],
    'Name': ['John Smith', 'Jane Doe', 'Mike Johnson', 'Sarah Williams', None],
    'Email': ['[email protected]', 'invalid_email', '[email protected]', '[email protected]', '[email protected]'],
    'Age': [25, -30, 45, 'unknown', 32],
    'Purchases': [3, 5, 0, 7, 2]
}

customers_df = pd.DataFrame(customer_data)
print("Original customer data:")
print(customers_df)

# Data cleaning pipeline for rows
def clean_customer_data(df):
    # Step 1: Remove rows with missing critical information (CustomerID or Name)
    df = df.dropna(subset=['CustomerID', 'Name'])
    
    # Step 2: Filter out invalid emails (must contain @)
    df = df[df['Email'].str.contains('@', na=False)]
    
    # Step 3: Fix numeric data types and filter out invalid values
    df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
    df = df[df['Age'] > 0]
    
    # Step 4: Reset index after all the filtering
    df = df.reset_index(drop=True)
    
    return df

cleaned_df = clean_customer_data(customers_df)
print("\nCleaned customer data:")
print(cleaned_df)

Output:

Original customer data:
   CustomerID          Name             Email     Age  Purchases
0      1001.0    John Smith  [email protected]    25.0          3
1      1002.0      Jane Doe    invalid_email   -30.0          5
2         NaN  Mike Johnson  [email protected]    45.0          0
3      1004.0  Sarah Williams [email protected]  unknown       7
4      1005.0          None  [email protected]    32.0          2

Cleaned customer data:
   CustomerID         Name             Email   Age  Purchases
0      1001.0   John Smith  [email protected]  25.0          3
1      1004.0  Sarah Williams [email protected]   NaN       7

Summary

In this guide, we've explored various ways to work with rows in Pandas DataFrames:

Selecting rows using iloc (position-based) and loc (label-based) indexing
Filtering rows based on conditions using boolean masks and the query() method
Adding new rows to DataFrames using concat()
Removing rows using the drop() method or conditional filtering
Modifying row values either directly or based on conditions
Iterating through rows using iterrows() and itertuples()
Applying functions to transform row data using apply()

These row operations are fundamental to data manipulation in Pandas and provide the building blocks for more complex data transformations and analysis.

Exercises

To practice your skills with Pandas row operations, try these exercises:

Create a DataFrame with 10 rows of fictional employee data, then:
- Filter employees based on salary and department
- Add two new employees to the DataFrame
- Remove employees who have been with the company less than a year
Load a real dataset (like Titanic from seaborn) and perform the following:
- Filter passengers who survived and were under 18 years old
- Update the fare column to show fare per family member
- Add a new column that categorizes passengers based on their age
Create a function that processes a DataFrame of sales data to:
- Remove rows with missing customer information
- Filter transactions to a specific date range
- Add a "total value" column by multiplying quantity by price

Additional Resources

Remember that effective row operations are key to data manipulation in Pandas, and mastering these techniques will significantly enhance your data analysis capabilities.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Row Operations​

Selecting and Accessing Rows​

Using Positional Indexing with iloc​

Using Label-Based Indexing with loc​

Filtering Rows Based on Conditions​

Using query() Method for Filtering​

Adding Rows to a DataFrame​

Using append() Method (Legacy)​

Using concat() (Recommended Approach)​

Adding Multiple Rows at Once​

Deleting and Dropping Rows​

Dropping Rows by Index​

Dropping Rows Based on Conditions​

Modifying Row Values​

Updating a Single Row​

Updating Based on Conditions​

Iterating Through Rows​

Using iterrows()​

Using itertuples() for Better Performance​

Applying Functions to Rows​

Using apply() with axis=1​

Transforming Multiple Columns at Once​

Real-World Example: Data Cleaning Pipeline​

Summary​

Exercises​

Additional Resources​

Introduction to Row Operations

Selecting and Accessing Rows

Using Positional Indexing with `iloc`

Using Label-Based Indexing with `loc`

Filtering Rows Based on Conditions

Using `query()` Method for Filtering

Adding Rows to a DataFrame

Using `append()` Method (Legacy)

Using `concat()` (Recommended Approach)

Adding Multiple Rows at Once

Deleting and Dropping Rows

Dropping Rows by Index

Dropping Rows Based on Conditions

Modifying Row Values

Updating a Single Row

Updating Based on Conditions

Iterating Through Rows

Using `iterrows()`

Using `itertuples()` for Better Performance

Applying Functions to Rows

Using `apply()` with `axis=1`

Transforming Multiple Columns at Once

Real-World Example: Data Cleaning Pipeline

Summary

Exercises

Additional Resources