Pandas Merge Indicators

When merging DataFrames in pandas, it's often useful to know where each row in the result came from. Did it appear in the left DataFrame only? The right one only? Or was it present in both? The merge indicator feature in pandas helps answer these questions, making it an invaluable tool for data analysis and debugging merge operations.

What is a Merge Indicator?

A merge indicator is an additional column that pandas can add to your merged DataFrame that shows the source of each row. This feature helps you:

Track which rows matched across DataFrames
Identify rows that only appeared in one of the DataFrames
Debug and validate merge operations
Better understand your data's structure and relationships

Basic Usage of Merge Indicators

Let's start with a simple example to demonstrate how merge indicators work:

python
import pandas as pd

# Create sample DataFrames
left_df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

right_df = pd.DataFrame({
    'id': [1, 2, 5, 6],
    'score': [90, 85, 95, 88]
})

# Perform a merge with indicator
merged_df = pd.merge(
    left_df, 
    right_df, 
    on='id', 
    how='outer',
    indicator=True
)

print("Left DataFrame:")
print(left_df)
print("\nRight DataFrame:")
print(right_df)
print("\nMerged DataFrame with indicator:")
print(merged_df)

Output:

Left DataFrame:
   id     name
 1    Alice
 2      Bob
 3  Charlie
 4    David

Right DataFrame:
   id  score
 1     90
 2     85
 5     95
 6     88

Merged DataFrame with indicator:
   id     name  score     _merge
 1    Alice   90.0       both
 2      Bob   85.0       both
 3  Charlie    NaN  left_only
 4    David    NaN  left_only
 5     None   95.0 right_only
 6     None   88.0 right_only

In this example, the _merge column shows:

both: Rows with id 1 and 2 appeared in both DataFrames
left_only: Rows with id 3 and 4 appeared only in the left DataFrame
right_only: Rows with id 5 and 6 appeared only in the right DataFrame

Customizing the Indicator Column

If you want to use a name other than the default _merge for your indicator column, you can specify it as a string:

python
# Custom indicator column name
merged_df = pd.merge(
    left_df, 
    right_df, 
    on='id', 
    how='outer',
    indicator='source'  # Custom name for the indicator column
)

print(merged_df)

Output:

   id     name  score      source
 1    Alice   90.0        both
 2      Bob   85.0        both
 3  Charlie    NaN   left_only
 4    David    NaN   left_only
 5     None   95.0  right_only
 6     None   88.0  right_only

Filtering Based on Merge Indicators

One common use case for merge indicators is to filter rows based on their source. For example, you might want to find records that exist in one DataFrame but not the other:

python
# Find records that appear only in the left DataFrame
left_only = merged_df[merged_df['_merge'] == 'left_only']
print("Records only in left DataFrame:")
print(left_only)

# Find records that appear only in the right DataFrame
right_only = merged_df[merged_df['_merge'] == 'right_only']
print("\nRecords only in right DataFrame:")
print(right_only)

# Find records that appear in both DataFrames
both = merged_df[merged_df['_merge'] == 'both']
print("\nRecords in both DataFrames:")
print(both)

Output:

Records only in left DataFrame:
   id     name  score     _merge
2   3  Charlie    NaN  left_only
3   4    David    NaN  left_only

Records only in right DataFrame:
   id  name  score      _merge
4   5  None   95.0  right_only
5   6  None   88.0  right_only

Records in both DataFrames:
   id   name  score _merge
0   1  Alice   90.0   both
1   2    Bob   85.0   both

Real-World Example: Data Reconciliation

Let's look at a more practical example where merge indicators help reconcile customer data from two different systems:

python
# System A: Customer database with contact information
system_a = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['John Smith', 'Emma Wilson', 'Michael Brown', 'Sophia Lee', 'Robert Chen'],
    'email': ['[email protected]', '[email protected]', '[email protected]', 
              '[email protected]', '[email protected]']
})

# System B: Order database with purchase information
system_b = pd.DataFrame({
    'customer_id': [102, 103, 105, 106, 107],
    'total_purchases': [250.50, 175.25, 340.00, 120.75, 90.30],
    'last_purchase_date': ['2023-01-15', '2023-02-20', '2023-01-30', 
                          '2023-03-05', '2023-02-10']
})

# Reconcile the data with an indicator
reconciled = pd.merge(
    system_a, 
    system_b, 
    on='customer_id',
    how='outer',
    indicator='data_source'
)

print("Reconciled customer data:")
print(reconciled)

# Find customers with contact info but no orders
no_orders = reconciled[reconciled['data_source'] == 'left_only']
print("\nCustomers with contact info but no purchase history:")
print(no_orders[['customer_id', 'name', 'email']])

# Find orders without customer contact info
no_contact = reconciled[reconciled['data_source'] == 'right_only']
print("\nCustomers with orders but missing contact info:")
print(no_contact[['customer_id', 'total_purchases', 'last_purchase_date']])

Output:

Reconciled customer data:
   customer_id         name               email  total_purchases last_purchase_date    data_source
0         101   John Smith     [email protected]              NaN                NaN      left_only
1         102  Emma Wilson     [email protected]           250.50         2023-01-15           both
2         103 Michael Brown  [email protected]           175.25         2023-02-20           both
3         104   Sophia Lee   [email protected]              NaN                NaN      left_only
4         105  Robert Chen   [email protected]           340.00         2023-01-30           both
5         106         None                 None           120.75         2023-03-05     right_only
6         107         None                 None            90.30         2023-02-10     right_only

Customers with contact info but no purchase history:
   customer_id       name               email
0         101 John Smith     [email protected]
3         104 Sophia Lee   [email protected]

Customers with orders but missing contact info:
   customer_id  total_purchases last_purchase_date
5         106           120.75         2023-03-05
6         107            90.30         2023-02-10

In this example, we've used merge indicators to:

Identify customers with contact information but no purchase history
Find orders belonging to customers whose contact information is missing

This type of analysis is extremely valuable for data quality checks and ensuring all systems have consistent data.

Advanced Usage: Indicators with Different Join Types

The indicator feature works with all types of joins (inner, left, right, outer), but the results will vary:

python
# Inner join with indicator
inner_merge = pd.merge(
    left_df, 
    right_df, 
    on='id', 
    how='inner', 
    indicator=True
)

# Left join with indicator
left_merge = pd.merge(
    left_df, 
    right_df, 
    on='id', 
    how='left', 
    indicator=True
)

print("Inner join with indicator:")
print(inner_merge)
print("\nLeft join with indicator:")
print(left_merge)

Output:

Inner join with indicator:
   id   name  score _merge
0   1  Alice     90   both
1   2    Bob     85   both

Left join with indicator:
   id     name  score     _merge
0   1    Alice   90.0       both
1   2      Bob   85.0       both
2   3  Charlie    NaN  left_only
3   4    David    NaN  left_only

With an inner join, you'll only see rows with _merge == 'both' because inner joins only keep matching records.

Summary

Pandas merge indicators provide a powerful way to:

Track the source of rows in a merged DataFrame
Identify which records matched across DataFrames and which didn't
Filter records based on their presence in one or both DataFrames
Perform data reconciliation and quality checks

By using the indicator=True parameter in pandas merge operations, you gain valuable insights that help diagnose merge issues and better understand your data relationships.

Exercises

To practice using merge indicators, try these exercises:

Create two DataFrames representing products and sales, then use merge indicators to find products with no sales.
Use merge indicators to identify duplicate records between two datasets.
Create a function that takes two DataFrames and returns three separate DataFrames: records in both, records only in the first, and records only in the second.
Use merge indicators to perform data validation between production and test datasets.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is a Merge Indicator?​

Basic Usage of Merge Indicators​

Customizing the Indicator Column​

Filtering Based on Merge Indicators​

Real-World Example: Data Reconciliation​

Advanced Usage: Indicators with Different Join Types​

Summary​

Exercises​

Additional Resources​

What is a Merge Indicator?

Basic Usage of Merge Indicators

Customizing the Indicator Column

Filtering Based on Merge Indicators

Real-World Example: Data Reconciliation

Advanced Usage: Indicators with Different Join Types

Summary

Exercises

Additional Resources