Skip to main content

Pandas Merge Indicators

When merging DataFrames in pandas, it's often useful to know where each row in the result came from. Did it appear in the left DataFrame only? The right one only? Or was it present in both? The merge indicator feature in pandas helps answer these questions, making it an invaluable tool for data analysis and debugging merge operations.

What is a Merge Indicator?

A merge indicator is an additional column that pandas can add to your merged DataFrame that shows the source of each row. This feature helps you:

  • Track which rows matched across DataFrames
  • Identify rows that only appeared in one of the DataFrames
  • Debug and validate merge operations
  • Better understand your data's structure and relationships

Basic Usage of Merge Indicators

Let's start with a simple example to demonstrate how merge indicators work:

python
import pandas as pd

# Create sample DataFrames
left_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})

right_df = pd.DataFrame({
'id': [1, 2, 5, 6],
'score': [90, 85, 95, 88]
})

# Perform a merge with indicator
merged_df = pd.merge(
left_df,
right_df,
on='id',
how='outer',
indicator=True
)

print("Left DataFrame:")
print(left_df)
print("\nRight DataFrame:")
print(right_df)
print("\nMerged DataFrame with indicator:")
print(merged_df)

Output:

Left DataFrame:
id name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David

Right DataFrame:
id score
0 1 90
1 2 85
2 5 95
3 6 88

Merged DataFrame with indicator:
id name score _merge
0 1 Alice 90.0 both
1 2 Bob 85.0 both
2 3 Charlie NaN left_only
3 4 David NaN left_only
4 5 None 95.0 right_only
5 6 None 88.0 right_only

In this example, the _merge column shows:

  • both: Rows with id 1 and 2 appeared in both DataFrames
  • left_only: Rows with id 3 and 4 appeared only in the left DataFrame
  • right_only: Rows with id 5 and 6 appeared only in the right DataFrame

Customizing the Indicator Column

If you want to use a name other than the default _merge for your indicator column, you can specify it as a string:

python
# Custom indicator column name
merged_df = pd.merge(
left_df,
right_df,
on='id',
how='outer',
indicator='source' # Custom name for the indicator column
)

print(merged_df)

Output:

   id     name  score      source
0 1 Alice 90.0 both
1 2 Bob 85.0 both
2 3 Charlie NaN left_only
3 4 David NaN left_only
4 5 None 95.0 right_only
5 6 None 88.0 right_only

Filtering Based on Merge Indicators

One common use case for merge indicators is to filter rows based on their source. For example, you might want to find records that exist in one DataFrame but not the other:

python
# Find records that appear only in the left DataFrame
left_only = merged_df[merged_df['_merge'] == 'left_only']
print("Records only in left DataFrame:")
print(left_only)

# Find records that appear only in the right DataFrame
right_only = merged_df[merged_df['_merge'] == 'right_only']
print("\nRecords only in right DataFrame:")
print(right_only)

# Find records that appear in both DataFrames
both = merged_df[merged_df['_merge'] == 'both']
print("\nRecords in both DataFrames:")
print(both)

Output:

Records only in left DataFrame:
id name score _merge
2 3 Charlie NaN left_only
3 4 David NaN left_only

Records only in right DataFrame:
id name score _merge
4 5 None 95.0 right_only
5 6 None 88.0 right_only

Records in both DataFrames:
id name score _merge
0 1 Alice 90.0 both
1 2 Bob 85.0 both

Real-World Example: Data Reconciliation

Let's look at a more practical example where merge indicators help reconcile customer data from two different systems:

python
# System A: Customer database with contact information
system_a = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'name': ['John Smith', 'Emma Wilson', 'Michael Brown', 'Sophia Lee', 'Robert Chen'],
'email': ['[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]']
})

# System B: Order database with purchase information
system_b = pd.DataFrame({
'customer_id': [102, 103, 105, 106, 107],
'total_purchases': [250.50, 175.25, 340.00, 120.75, 90.30],
'last_purchase_date': ['2023-01-15', '2023-02-20', '2023-01-30',
'2023-03-05', '2023-02-10']
})

# Reconcile the data with an indicator
reconciled = pd.merge(
system_a,
system_b,
on='customer_id',
how='outer',
indicator='data_source'
)

print("Reconciled customer data:")
print(reconciled)

# Find customers with contact info but no orders
no_orders = reconciled[reconciled['data_source'] == 'left_only']
print("\nCustomers with contact info but no purchase history:")
print(no_orders[['customer_id', 'name', 'email']])

# Find orders without customer contact info
no_contact = reconciled[reconciled['data_source'] == 'right_only']
print("\nCustomers with orders but missing contact info:")
print(no_contact[['customer_id', 'total_purchases', 'last_purchase_date']])

Output:

Reconciled customer data:
customer_id name email total_purchases last_purchase_date data_source
0 101 John Smith [email protected] NaN NaN left_only
1 102 Emma Wilson [email protected] 250.50 2023-01-15 both
2 103 Michael Brown [email protected] 175.25 2023-02-20 both
3 104 Sophia Lee [email protected] NaN NaN left_only
4 105 Robert Chen [email protected] 340.00 2023-01-30 both
5 106 None None 120.75 2023-03-05 right_only
6 107 None None 90.30 2023-02-10 right_only

Customers with contact info but no purchase history:
customer_id name email
0 101 John Smith [email protected]
3 104 Sophia Lee [email protected]

Customers with orders but missing contact info:
customer_id total_purchases last_purchase_date
5 106 120.75 2023-03-05
6 107 90.30 2023-02-10

In this example, we've used merge indicators to:

  1. Identify customers with contact information but no purchase history
  2. Find orders belonging to customers whose contact information is missing

This type of analysis is extremely valuable for data quality checks and ensuring all systems have consistent data.

Advanced Usage: Indicators with Different Join Types

The indicator feature works with all types of joins (inner, left, right, outer), but the results will vary:

python
# Inner join with indicator
inner_merge = pd.merge(
left_df,
right_df,
on='id',
how='inner',
indicator=True
)

# Left join with indicator
left_merge = pd.merge(
left_df,
right_df,
on='id',
how='left',
indicator=True
)

print("Inner join with indicator:")
print(inner_merge)
print("\nLeft join with indicator:")
print(left_merge)

Output:

Inner join with indicator:
id name score _merge
0 1 Alice 90 both
1 2 Bob 85 both

Left join with indicator:
id name score _merge
0 1 Alice 90.0 both
1 2 Bob 85.0 both
2 3 Charlie NaN left_only
3 4 David NaN left_only

With an inner join, you'll only see rows with _merge == 'both' because inner joins only keep matching records.

Summary

Pandas merge indicators provide a powerful way to:

  1. Track the source of rows in a merged DataFrame
  2. Identify which records matched across DataFrames and which didn't
  3. Filter records based on their presence in one or both DataFrames
  4. Perform data reconciliation and quality checks

By using the indicator=True parameter in pandas merge operations, you gain valuable insights that help diagnose merge issues and better understand your data relationships.

Exercises

To practice using merge indicators, try these exercises:

  1. Create two DataFrames representing products and sales, then use merge indicators to find products with no sales.
  2. Use merge indicators to identify duplicate records between two datasets.
  3. Create a function that takes two DataFrames and returns three separate DataFrames: records in both, records only in the first, and records only in the second.
  4. Use merge indicators to perform data validation between production and test datasets.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)