Pandas Merge Basics

Introduction

When working with data in the real world, information is often spread across multiple tables or DataFrames. Pandas provides powerful tools to combine these DataFrames, with the merge() function being one of the most important. Understanding how to merge DataFrames is essential for data cleaning, transformation, and analysis.

In this tutorial, you'll learn:

What merging is and why it's important
How to use the pandas.merge() function
Different types of joins (inner, outer, left, right)
How to merge on different types of keys
Common issues and how to resolve them

What is Merging?

Merging is the process of combining two DataFrames based on common columns or indices. If you're familiar with SQL, pandas merging is equivalent to SQL joins. Merging allows you to combine information from different sources into a single, unified DataFrame.

Let's start by importing pandas and creating some example DataFrames:

python
import pandas as pd
import numpy as np

# Create example DataFrames
employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'department_id': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'department_id': [101, 102, 103, 104],
    'department_name': ['HR', 'Engineering', 'Marketing', 'Sales']
})

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

Output:

Employees DataFrame:
   employee_id     name  department_id
0           1    Alice            101
1           2      Bob            102
2           3  Charlie            101
3           4    David            103
4           5      Eva            102

Departments DataFrame:
   department_id department_name
0           101             HR
1           102     Engineering
2           103       Marketing
3           104           Sales

Basic Merge Operation

The basic syntax for merging DataFrames in pandas is:

python
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)

Let's merge our employees and departments DataFrames to get each employee with their department name:

python
# Merge employees and departments
employee_departments = pd.merge(
    employees, 
    departments, 
    on='department_id'
)

print(employee_departments)

Output:

   employee_id     name  department_id department_name
         1    Alice            101             HR
         3  Charlie            101             HR
         2      Bob            102     Engineering
         5      Eva            102     Engineering
         4    David            103       Marketing

In this example, we merged the DataFrames on the department_id column which is present in both DataFrames.

Types of Joins

Pandas supports different types of joins through the how parameter:

Inner join (how='inner'): Returns only the rows where there is a match in both DataFrames (default)
Left join (how='left'): Returns all rows from the left DataFrame and matching rows from the right DataFrame
Right join (how='right'): Returns all rows from the right DataFrame and matching rows from the left DataFrame
Outer join (how='outer'): Returns all rows when there is a match in either the left or right DataFrame

Inner Join (Default)

The inner join returns only the rows where there is a match in both DataFrames:

python
inner_join = pd.merge(employees, departments, on='department_id', how='inner')
print("Inner Join:")
print(inner_join)

Output:

Inner Join:
   employee_id     name  department_id department_name
         1    Alice            101             HR
         3  Charlie            101             HR
         2      Bob            102     Engineering
         5      Eva            102     Engineering
         4    David            103       Marketing

Left Join

A left join returns all rows from the left DataFrame and matching rows from the right DataFrame:

python
# Create an employee with a department_id that doesn't exist in departments
employees_extended = employees.copy()
employees_extended.loc[5] = [6, 'Frank', 105]

left_join = pd.merge(
    employees_extended, 
    departments, 
    on='department_id', 
    how='left'
)
print("Left Join:")
print(left_join)

Output:

Left Join:
   employee_id     name  department_id department_name
         1    Alice            101             HR
         3  Charlie            101             HR
         2      Bob            102     Engineering
         5      Eva            102     Engineering
         4    David            103       Marketing
         6    Frank            105            NaN

Notice that Frank appears in the result with NaN for the department name because department 105 doesn't exist in the departments DataFrame.

Right Join

A right join returns all rows from the right DataFrame and matching rows from the left DataFrame:

python
right_join = pd.merge(
    employees, 
    departments, 
    on='department_id', 
    how='right'
)
print("Right Join:")
print(right_join)

Output:

Right Join:
   employee_id     name  department_id department_name
       1.0    Alice            101             HR
       3.0  Charlie            101             HR
       2.0      Bob            102     Engineering
       5.0      Eva            102     Engineering
       4.0    David            103       Marketing
       NaN      NaN            104           Sales

The Sales department (department_id 104) appears in the result with NaN values for employee information because no employee is assigned to that department.

Outer Join

An outer join returns all rows when there is a match in either the left or right DataFrame:

python
# Using employees_extended which has Frank with department_id 105
outer_join = pd.merge(
    employees_extended, 
    departments, 
    on='department_id', 
    how='outer'
)
print("Outer Join:")
print(outer_join)

Output:

Outer Join:
   employee_id     name  department_id department_name
       1.0    Alice            101             HR
       3.0  Charlie            101             HR
       2.0      Bob            102     Engineering
       5.0      Eva            102     Engineering
       4.0    David            103       Marketing
       NaN      NaN            104           Sales
       6.0    Frank            105            NaN

The outer join includes all employees and all departments, with NaN values where there's no match.

Merging on Different Columns

Sometimes the columns you want to merge on have different names in each DataFrame. You can use left_on and right_on to specify the column names:

python
# Create DataFrames with different column names
employees2 = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'dept_id': [101, 102, 101, 103, 102]
})

departments2 = pd.DataFrame({
    'department_id': [101, 102, 103, 104],
    'department_name': ['HR', 'Engineering', 'Marketing', 'Sales']
})

# Merge on different column names
merged_diff_cols = pd.merge(
    employees2,
    departments2,
    left_on='dept_id',
    right_on='department_id'
)

print(merged_diff_cols)

Output:

   emp_id     name  dept_id  department_id department_name
     1    Alice      101            101             HR
     3  Charlie      101            101             HR
     2      Bob      102            102     Engineering
     5      Eva      102            102     Engineering
     4    David      103            103       Marketing

Notice that the resulting DataFrame contains both columns used for merging.

Merging on Multiple Columns

You can merge DataFrames on multiple columns by passing a list of column names:

python
# Create DataFrames with multiple keys
employee_projects = pd.DataFrame({
    'employee_id': [1, 1, 2, 3, 4, 5],
    'project_id': [1001, 1002, 1001, 1003, 1002, 1001],
    'role': ['Developer', 'Tester', 'Manager', 'Developer', 'Developer', 'Tester']
})

project_details = pd.DataFrame({
    'employee_id': [1, 1, 2, 3, 4, 6],
    'project_id': [1001, 1002, 1001, 1003, 1002, 1004],
    'hours_worked': [20, 15, 40, 30, 25, 10]
})

# Merge on multiple columns
multi_key_merge = pd.merge(
    employee_projects,
    project_details,
    on=['employee_id', 'project_id']
)

print("Multiple Key Merge:")
print(multi_key_merge)

Output:

Multiple Key Merge:
   employee_id  project_id        role  hours_worked
         1        1001   Developer            20
         1        1002      Tester            15
         2        1001     Manager            40
         3        1003   Developer            30
         4        1002   Developer            25

Employee 5 and project 1004 don't appear in the result because there's no match for those particular employee-project combinations in both DataFrames.

Handling Duplicate Columns

When merging DataFrames that have columns with the same name (other than the ones you're merging on), pandas adds suffixes to distinguish them:

python
# Create DataFrames with duplicate column names
df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value': [1, 2, 3, 4],
    'data': ['a1', 'b1', 'c1', 'd1']
})

df2 = pd.DataFrame({
    'key': ['B', 'D', 'E', 'F'],
    'value': [20, 40, 50, 60],
    'data': ['b2', 'd2', 'e2', 'f2']
})

# Merge with suffixes
merged_with_suffixes = pd.merge(
    df1,
    df2,
    on='key',
    suffixes=('_left', '_right')
)

print("Merged with Suffixes:")
print(merged_with_suffixes)

Output:

Merged with Suffixes:
  key  value_left data_left  value_right data_right
0   B          2        b1           20         b2
1   D          4        d1           40         d2

By default, pandas uses _x and _y as suffixes, but we've specified _left and _right for clarity.

Real-World Example: Sales Data Analysis

Let's walk through a more practical example. Imagine we have sales data and customer information in separate DataFrames, and we want to analyze sales by customer category:

python
# Create sales data
sales = pd.DataFrame({
    'sale_id': range(1, 11),
    'customer_id': [101, 102, 101, 103, 104, 105, 103, 102, 101, 105],
    'amount': [120.50, 300.75, 95.25, 440.00, 200.00, 150.25, 300.50, 250.75, 180.00, 320.50],
    'date': pd.date_range(start='2023-01-01', periods=10, freq='D')
})

# Create customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 106],
    'name': ['ABC Corp', 'XYZ Inc', '123 LLC', 'Tech Giant', 'Small Shop', 'New Venture'],
    'category': ['Enterprise', 'SMB', 'Enterprise', 'Enterprise', 'SMB', 'Startup']
})

# Merge sales with customer information
sales_with_customer_info = pd.merge(
    sales,
    customers,
    on='customer_id',
    how='left'
)

print("Sales with Customer Info:")
print(sales_with_customer_info.head())

# Analyze sales by customer category
sales_by_category = sales_with_customer_info.groupby('category')['amount'].agg(['sum', 'mean', 'count'])
print("\nSales by Customer Category:")
print(sales_by_category)

Output:

Sales with Customer Info:
   sale_id  customer_id  amount       date      name    category
0        1         101  120.50 2023-01-01  ABC Corp  Enterprise
1        2         102  300.75 2023-01-02   XYZ Inc         SMB
2        3         101   95.25 2023-01-03  ABC Corp  Enterprise
3        4         103  440.00 2023-01-04   123 LLC  Enterprise
4        5         104  200.00 2023-01-05  Tech Giant  Enterprise

Sales by Customer Category:
                 sum       mean  count
category                             
Enterprise  1036.00  259.00000      4
SMB         1022.25  255.56250      4

In this example, we merged sales data with customer information to analyze sales by customer category. The merged data allows us to see which customer categories generate the most sales.

Common Issues and Solutions

1. Memory Usage with Large DataFrames

When working with large DataFrames, merges can consume a lot of memory. You can use the copy=False parameter to avoid making copies of data:

python
# More memory-efficient merge
efficient_merge = pd.merge(
    large_df1, 
    large_df2, 
    on='key', 
    copy=False
)

2. Handling Duplicate Keys

If you have duplicate keys in both DataFrames, the merge will create a row for each combination of duplicates:

python
# DataFrames with duplicate keys
df_a = pd.DataFrame({
    'key': ['A', 'A', 'B', 'C'],
    'value': [1, 2, 3, 4]
})

df_b = pd.DataFrame({
    'key': ['A', 'A', 'B', 'D'],
    'other_value': [10, 20, 30, 40]
})

# Merge with duplicate keys
duplicate_merge = pd.merge(df_a, df_b, on='key')
print("Merge with Duplicate Keys:")
print(duplicate_merge)

Output:

Merge with Duplicate Keys:
  key  value  other_value
 A      1           10
 A      1           20
 A      2           10
 A      2           20
 B      3           30

Be careful with duplicate keys, as they can explode the size of your merged DataFrame.

3. Tracking Merge Results

To understand which rows were matched in a merge, you can use the indicator=True parameter:

python
# Merge with indicator
merge_with_indicator = pd.merge(
    employees,
    departments,
    on='department_id',
    how='outer',
    indicator=True
)

print("Merge with Indicator:")
print(merge_with_indicator)

Output:

Merge with Indicator:
   employee_id     name  department_id department_name      _merge
       1.0    Alice            101             HR        both
       3.0  Charlie            101             HR        both
       2.0      Bob            102     Engineering        both
       5.0      Eva            102     Engineering        both
       4.0    David            103       Marketing        both
       NaN      NaN            104           Sales  right_only

The _merge column shows where each row came from: 'both', 'left_only', or 'right_only'.

Summary

In this tutorial, you've learned the basics of merging DataFrames in pandas:

How to use the pd.merge() function to combine DataFrames
Different types of joins: inner, left, right, and outer
Merging on different columns with left_on and right_on
Merging on multiple columns
Handling duplicate columns with suffixes
Working with real-world data through a practical example
Common issues and solutions for merging DataFrames

Merging is a fundamental operation in data analysis that allows you to combine information from different sources. As you continue your pandas journey, you'll find yourself using these techniques frequently to prepare data for analysis and visualization.

Additional Resources

Exercises

Create two DataFrames: one with student information (student_id, name, major_id) and another with major information (major_id, major_name, department). Merge them to create a complete student profile.
Using the sales and customers DataFrames from the real-world example:
- Find the customer who made the largest single purchase
- Calculate the average purchase amount by customer category
- Identify customers who haven't made any purchases
Create a DataFrame with product information and another with inventory levels. Perform an outer merge to identify products that are either not in inventory or not in the product catalog.
Take two DataFrames with a common date column but in different formats. Convert the dates to a common format and merge the DataFrames based on the date.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is Merging?​

Basic Merge Operation​

Types of Joins​

Inner Join (Default)​

Left Join​

Right Join​

Outer Join​

Merging on Different Columns​

Merging on Multiple Columns​

Handling Duplicate Columns​

Real-World Example: Sales Data Analysis​

Common Issues and Solutions​

1. Memory Usage with Large DataFrames​

2. Handling Duplicate Keys​

3. Tracking Merge Results​

Summary​

Additional Resources​

Exercises​

Introduction

What is Merging?

Basic Merge Operation

Types of Joins

Inner Join (Default)

Left Join

Right Join

Outer Join

Merging on Different Columns

Merging on Multiple Columns

Handling Duplicate Columns

Real-World Example: Sales Data Analysis

Common Issues and Solutions

1. Memory Usage with Large DataFrames

2. Handling Duplicate Keys

3. Tracking Merge Results

Summary

Additional Resources

Exercises