Pandas Duplicates

When working with real-world datasets, duplicate data is a common issue that can skew your analysis and lead to incorrect conclusions. Pandas provides powerful tools to identify and handle duplicate entries in your DataFrames. This guide will walk you through the process of detecting and managing duplicates effectively.

Introduction to Duplicates in Data

Duplicate data occurs when the same information appears multiple times in your dataset. This can happen due to:

Data entry errors
System glitches during data collection
Improper merging of multiple data sources
Repeated submissions in forms or surveys

Cleaning duplicates is an essential step in the data preparation process that helps maintain data integrity and improve the accuracy of your analysis.

Identifying Duplicates

Using `duplicated()` Method

The duplicated() method helps you identify rows that are duplicates of previously occurring rows in your DataFrame.

python
import pandas as pd

# Sample DataFrame with duplicates
data = {
    'Name': ['John', 'Anna', 'Peter', 'John', 'Linda'],
    'Age': [28, 24, 35, 28, 32],
    'City': ['New York', 'Paris', 'Berlin', 'New York', 'Rome']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for duplicate rows
duplicates = df.duplicated()
print("\nDuplicate rows (True indicates duplicate):")
print(duplicates)

Output:

Original DataFrame:
    Name  Age      City
 John   28  New York
 Anna   24     Paris
Peter   35    Berlin
 John   28  New York
Linda   32      Rome

Duplicate rows (True indicates duplicate):
  False
  False
  False
   True
  False
dtype: bool

By default, duplicated() marks the second and subsequent occurrences of a row as True. The first occurrence is marked as False.

Customizing Duplicate Detection

You can customize how duplicates are identified with these parameters:

subset: Check for duplicates in specific columns only
keep: Control which duplicates are marked ('first', 'last', or False)

python
# Check duplicates based only on the 'Name' column
name_dupes = df.duplicated(subset=['Name'])
print("\nDuplicates based on 'Name' column only:")
print(name_dupes)

# Mark all duplicates including first occurrence
all_dupes = df.duplicated(keep=False)
print("\nAll duplicates (including first occurrence):")
print(all_dupes)

# Mark first occurrence as duplicate instead of subsequent ones
last_dupes = df.duplicated(keep='last')
print("\nDuplicates (keeping last occurrence):")
print(last_dupes)

Output:

Duplicates based on 'Name' column only:
  False
  False
  False
   True
  False
dtype: bool

All duplicates (including first occurrence):
   True
  False
  False
   True
  False
dtype: bool

Duplicates (keeping last occurrence):
   True
  False
  False
  False
  False
dtype: bool

Finding and Displaying Duplicate Rows

To see the actual duplicate records, you can combine duplicated() with boolean indexing:

python
# Display all duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print("\nAll duplicate rows:")
print(duplicate_rows)

# Display only the first occurrences of duplicates
first_dupes = df[df.duplicated(keep='last')]
print("\nFirst occurrences of duplicates:")
print(first_dupes)

# Display only the second+ occurrences of duplicates
second_dupes = df[df.duplicated(keep='first')]
print("\nSecond+ occurrences of duplicates:")
print(second_dupes)

Output:

All duplicate rows:
   Name  Age      City
0  John   28  New York
3  John   28  New York

First occurrences of duplicates:
   Name  Age      City
0  John   28  New York

Second+ occurrences of duplicates:
   Name  Age      City
3  John   28  New York

Removing Duplicates with `drop_duplicates()`

The drop_duplicates() method removes duplicate rows from your DataFrame:

python
# Remove duplicate rows (keeps first occurrence by default)
df_no_dupes = df.drop_duplicates()
print("\nDataFrame after removing duplicates (keeping first occurrence):")
print(df_no_dupes)

# Remove duplicates keeping last occurrence
df_last = df.drop_duplicates(keep='last')
print("\nDataFrame after removing duplicates (keeping last occurrence):")
print(df_last)

# Remove duplicates based on specific columns
df_subset = df.drop_duplicates(subset=['Name'])
print("\nDataFrame after removing duplicates based on 'Name' column:")
print(df_subset)

Output:

DataFrame after removing duplicates (keeping first occurrence):
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
4  Linda   32      Rome

DataFrame after removing duplicates (keeping last occurrence):
    Name  Age      City
1   Anna   24     Paris
2  Peter   35    Berlin
3   John   28  New York
4  Linda   32      Rome

DataFrame after removing duplicates based on 'Name' column:
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
4  Linda   32      Rome

In-place Modification

To modify the original DataFrame rather than creating a new one, use the inplace parameter:

python
# Remove duplicates in-place
df.drop_duplicates(inplace=True)
print("\nOriginal DataFrame after in-place duplicate removal:")
print(df)

Output:

Original DataFrame after in-place duplicate removal:
    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
4  Linda   32      Rome

Real-World Example: Cleaning Customer Data

Let's walk through a more practical example of handling duplicates in a customer database:

python
# Customer data with potential duplicates
customer_data = {
    'customer_id': [101, 102, 101, 103, 104, 105, 103],
    'name': ['Alice Smith', 'Bob Jones', 'Alice Smith', 'Charlie Brown', 'Dana White', 'Eve Black', 'Charlie Brown'],
    'email': ['[email protected]', '[email protected]', '[email protected]', 
              '[email protected]', '[email protected]', '[email protected]', '[email protected]'],
    'purchase_amount': [150.50, 200.75, 150.50, 50.25, 300.00, 125.75, 75.50]
}

customers_df = pd.DataFrame(customer_data)
print("Customer Database:")
print(customers_df)

# Count total duplicates
total_dupes = customers_df.duplicated().sum()
print(f"\nTotal number of duplicate rows: {total_dupes}")

# Find duplicates based on customer_id
id_dupes = customers_df.duplicated(subset=['customer_id'], keep=False)
print("\nDuplicates based on customer_id:")
print(customers_df[id_dupes])

# Find exact duplicates (all columns are the same)
exact_dupes = customers_df.duplicated(keep=False)
print("\nExact duplicates (all columns):")
print(customers_df[exact_dupes])

# Remove duplicates while keeping the first occurrence for each customer_id
clean_customers = customers_df.drop_duplicates(subset=['customer_id'])
print("\nCleaned customer database (unique customer_ids):")
print(clean_customers)

Output:

Customer Database:
   customer_id         name               email  purchase_amount
0          101  Alice Smith    [email protected]           150.50
1          102   Bob Jones      [email protected]           200.75
2          101  Alice Smith    [email protected]           150.50
3          103  Charlie Brown  [email protected]            50.25
4          104   Dana White     [email protected]           300.00
5          105   Eve Black      [email protected]           125.75
6          103  Charlie Brown   [email protected]            75.50

Total number of duplicate rows: 1

Duplicates based on customer_id:
   customer_id         name               email  purchase_amount
0          101  Alice Smith    [email protected]           150.50
2          101  Alice Smith    [email protected]           150.50
3          103  Charlie Brown  [email protected]            50.25
6          103  Charlie Brown   [email protected]            75.50

Exact duplicates (all columns):
   customer_id         name              email  purchase_amount
0          101  Alice Smith  [email protected]           150.50
2          101  Alice Smith  [email protected]           150.50

Cleaned customer database (unique customer_ids):
   customer_id         name               email  purchase_amount
0          101  Alice Smith    [email protected]           150.50
1          102   Bob Jones      [email protected]           200.75
3          103  Charlie Brown  [email protected]            50.25
4          104   Dana White     [email protected]           300.00
5          105   Eve Black      [email protected]           125.75

Handling Special Cases

In the real world, you might need more nuanced approaches to duplicate handling:

python
# Keep row with maximum purchase amount when duplicates exist by customer_id
best_customers = customers_df.sort_values('purchase_amount', ascending=False).drop_duplicates(subset=['customer_id'])
print("\nCustomers with highest purchase amounts:")
print(best_customers)

# Aggregate data for duplicate customers (sum purchase amounts)
# Group by customer_id and take first value for name and email while summing purchase_amount
aggregated_customers = customers_df.groupby('customer_id').agg({
    'name': 'first',
    'email': 'first', 
    'purchase_amount': 'sum'
}).reset_index()

print("\nAggregated customer data (summed purchases):")
print(aggregated_customers)

Output:

Customers with highest purchase amounts:
   customer_id         name               email  purchase_amount
        104   Dana White     [email protected]           300.00
        102   Bob Jones      [email protected]           200.75
        101  Alice Smith    [email protected]           150.50
        105   Eve Black      [email protected]           125.75
        103  Charlie Brown   [email protected]            75.50

Aggregated customer data (summed purchases):
   customer_id         name               email  purchase_amount
        101  Alice Smith    [email protected]           301.00
        102   Bob Jones      [email protected]           200.75
        103  Charlie Brown  [email protected]           125.75
        104   Dana White     [email protected]           300.00
        105   Eve Black      [email protected]           125.75

Summary

Handling duplicates is a crucial aspect of data cleaning with pandas. In this guide, we covered:

How to identify duplicates using duplicated()
Customizing duplicate detection with parameters like subset and keep
Removing duplicates with drop_duplicates()
Practical examples of handling duplicates in real-world data

By properly managing duplicates in your datasets, you can ensure more accurate analyses and reliable results.

Exercises

Create a DataFrame with student records that contain some duplicate entries. Remove duplicates based on student ID while keeping the record with the highest score.
Import a CSV file of your choice and identify what percentage of the data consists of duplicates.
Given a DataFrame with customer transactions, identify customers who have made multiple purchases on the same day and consolidate their transactions.
Create a function that takes a DataFrame and returns a report of duplicate counts by different columns.
Use the subset parameter to find partial duplicates in a complex dataset, such as records with the same name but different contact information.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Duplicates in Data​

Identifying Duplicates​

Using duplicated() Method​

Customizing Duplicate Detection​

Finding and Displaying Duplicate Rows​

Removing Duplicates with drop_duplicates()​

In-place Modification​

Real-World Example: Cleaning Customer Data​

Handling Special Cases​

Summary​

Exercises​

Additional Resources​

Introduction to Duplicates in Data

Identifying Duplicates

Using `duplicated()` Method

Customizing Duplicate Detection

Finding and Displaying Duplicate Rows

Removing Duplicates with `drop_duplicates()`

In-place Modification

Real-World Example: Cleaning Customer Data

Handling Special Cases

Summary

Exercises

Additional Resources