Skip to main content

Pandas Merge Basics

Introduction

When working with data in the real world, information is often spread across multiple tables or DataFrames. Pandas provides powerful tools to combine these DataFrames, with the merge() function being one of the most important. Understanding how to merge DataFrames is essential for data cleaning, transformation, and analysis.

In this tutorial, you'll learn:

  • What merging is and why it's important
  • How to use the pandas.merge() function
  • Different types of joins (inner, outer, left, right)
  • How to merge on different types of keys
  • Common issues and how to resolve them

What is Merging?

Merging is the process of combining two DataFrames based on common columns or indices. If you're familiar with SQL, pandas merging is equivalent to SQL joins. Merging allows you to combine information from different sources into a single, unified DataFrame.

Let's start by importing pandas and creating some example DataFrames:

python
import pandas as pd
import numpy as np

# Create example DataFrames
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'department_id': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
'department_id': [101, 102, 103, 104],
'department_name': ['HR', 'Engineering', 'Marketing', 'Sales']
})

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

Output:

Employees DataFrame:
employee_id name department_id
0 1 Alice 101
1 2 Bob 102
2 3 Charlie 101
3 4 David 103
4 5 Eva 102

Departments DataFrame:
department_id department_name
0 101 HR
1 102 Engineering
2 103 Marketing
3 104 Sales

Basic Merge Operation

The basic syntax for merging DataFrames in pandas is:

python
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)

Let's merge our employees and departments DataFrames to get each employee with their department name:

python
# Merge employees and departments
employee_departments = pd.merge(
employees,
departments,
on='department_id'
)

print(employee_departments)

Output:

   employee_id     name  department_id department_name
0 1 Alice 101 HR
1 3 Charlie 101 HR
2 2 Bob 102 Engineering
3 5 Eva 102 Engineering
4 4 David 103 Marketing

In this example, we merged the DataFrames on the department_id column which is present in both DataFrames.

Types of Joins

Pandas supports different types of joins through the how parameter:

  1. Inner join (how='inner'): Returns only the rows where there is a match in both DataFrames (default)
  2. Left join (how='left'): Returns all rows from the left DataFrame and matching rows from the right DataFrame
  3. Right join (how='right'): Returns all rows from the right DataFrame and matching rows from the left DataFrame
  4. Outer join (how='outer'): Returns all rows when there is a match in either the left or right DataFrame

Inner Join (Default)

The inner join returns only the rows where there is a match in both DataFrames:

python
inner_join = pd.merge(employees, departments, on='department_id', how='inner')
print("Inner Join:")
print(inner_join)

Output:

Inner Join:
employee_id name department_id department_name
0 1 Alice 101 HR
1 3 Charlie 101 HR
2 2 Bob 102 Engineering
3 5 Eva 102 Engineering
4 4 David 103 Marketing

Left Join

A left join returns all rows from the left DataFrame and matching rows from the right DataFrame:

python
# Create an employee with a department_id that doesn't exist in departments
employees_extended = employees.copy()
employees_extended.loc[5] = [6, 'Frank', 105]

left_join = pd.merge(
employees_extended,
departments,
on='department_id',
how='left'
)
print("Left Join:")
print(left_join)

Output:

Left Join:
employee_id name department_id department_name
0 1 Alice 101 HR
1 3 Charlie 101 HR
2 2 Bob 102 Engineering
3 5 Eva 102 Engineering
4 4 David 103 Marketing
5 6 Frank 105 NaN

Notice that Frank appears in the result with NaN for the department name because department 105 doesn't exist in the departments DataFrame.

Right Join

A right join returns all rows from the right DataFrame and matching rows from the left DataFrame:

python
right_join = pd.merge(
employees,
departments,
on='department_id',
how='right'
)
print("Right Join:")
print(right_join)

Output:

Right Join:
employee_id name department_id department_name
0 1.0 Alice 101 HR
1 3.0 Charlie 101 HR
2 2.0 Bob 102 Engineering
3 5.0 Eva 102 Engineering
4 4.0 David 103 Marketing
5 NaN NaN 104 Sales

The Sales department (department_id 104) appears in the result with NaN values for employee information because no employee is assigned to that department.

Outer Join

An outer join returns all rows when there is a match in either the left or right DataFrame:

python
# Using employees_extended which has Frank with department_id 105
outer_join = pd.merge(
employees_extended,
departments,
on='department_id',
how='outer'
)
print("Outer Join:")
print(outer_join)

Output:

Outer Join:
employee_id name department_id department_name
0 1.0 Alice 101 HR
1 3.0 Charlie 101 HR
2 2.0 Bob 102 Engineering
3 5.0 Eva 102 Engineering
4 4.0 David 103 Marketing
5 NaN NaN 104 Sales
6 6.0 Frank 105 NaN

The outer join includes all employees and all departments, with NaN values where there's no match.

Merging on Different Columns

Sometimes the columns you want to merge on have different names in each DataFrame. You can use left_on and right_on to specify the column names:

python
# Create DataFrames with different column names
employees2 = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'dept_id': [101, 102, 101, 103, 102]
})

departments2 = pd.DataFrame({
'department_id': [101, 102, 103, 104],
'department_name': ['HR', 'Engineering', 'Marketing', 'Sales']
})

# Merge on different column names
merged_diff_cols = pd.merge(
employees2,
departments2,
left_on='dept_id',
right_on='department_id'
)

print(merged_diff_cols)

Output:

   emp_id     name  dept_id  department_id department_name
0 1 Alice 101 101 HR
1 3 Charlie 101 101 HR
2 2 Bob 102 102 Engineering
3 5 Eva 102 102 Engineering
4 4 David 103 103 Marketing

Notice that the resulting DataFrame contains both columns used for merging.

Merging on Multiple Columns

You can merge DataFrames on multiple columns by passing a list of column names:

python
# Create DataFrames with multiple keys
employee_projects = pd.DataFrame({
'employee_id': [1, 1, 2, 3, 4, 5],
'project_id': [1001, 1002, 1001, 1003, 1002, 1001],
'role': ['Developer', 'Tester', 'Manager', 'Developer', 'Developer', 'Tester']
})

project_details = pd.DataFrame({
'employee_id': [1, 1, 2, 3, 4, 6],
'project_id': [1001, 1002, 1001, 1003, 1002, 1004],
'hours_worked': [20, 15, 40, 30, 25, 10]
})

# Merge on multiple columns
multi_key_merge = pd.merge(
employee_projects,
project_details,
on=['employee_id', 'project_id']
)

print("Multiple Key Merge:")
print(multi_key_merge)

Output:

Multiple Key Merge:
employee_id project_id role hours_worked
0 1 1001 Developer 20
1 1 1002 Tester 15
2 2 1001 Manager 40
3 3 1003 Developer 30
4 4 1002 Developer 25

Employee 5 and project 1004 don't appear in the result because there's no match for those particular employee-project combinations in both DataFrames.

Handling Duplicate Columns

When merging DataFrames that have columns with the same name (other than the ones you're merging on), pandas adds suffixes to distinguish them:

python
# Create DataFrames with duplicate column names
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value': [1, 2, 3, 4],
'data': ['a1', 'b1', 'c1', 'd1']
})

df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value': [20, 40, 50, 60],
'data': ['b2', 'd2', 'e2', 'f2']
})

# Merge with suffixes
merged_with_suffixes = pd.merge(
df1,
df2,
on='key',
suffixes=('_left', '_right')
)

print("Merged with Suffixes:")
print(merged_with_suffixes)

Output:

Merged with Suffixes:
key value_left data_left value_right data_right
0 B 2 b1 20 b2
1 D 4 d1 40 d2

By default, pandas uses _x and _y as suffixes, but we've specified _left and _right for clarity.

Real-World Example: Sales Data Analysis

Let's walk through a more practical example. Imagine we have sales data and customer information in separate DataFrames, and we want to analyze sales by customer category:

python
# Create sales data
sales = pd.DataFrame({
'sale_id': range(1, 11),
'customer_id': [101, 102, 101, 103, 104, 105, 103, 102, 101, 105],
'amount': [120.50, 300.75, 95.25, 440.00, 200.00, 150.25, 300.50, 250.75, 180.00, 320.50],
'date': pd.date_range(start='2023-01-01', periods=10, freq='D')
})

# Create customer data
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105, 106],
'name': ['ABC Corp', 'XYZ Inc', '123 LLC', 'Tech Giant', 'Small Shop', 'New Venture'],
'category': ['Enterprise', 'SMB', 'Enterprise', 'Enterprise', 'SMB', 'Startup']
})

# Merge sales with customer information
sales_with_customer_info = pd.merge(
sales,
customers,
on='customer_id',
how='left'
)

print("Sales with Customer Info:")
print(sales_with_customer_info.head())

# Analyze sales by customer category
sales_by_category = sales_with_customer_info.groupby('category')['amount'].agg(['sum', 'mean', 'count'])
print("\nSales by Customer Category:")
print(sales_by_category)

Output:

Sales with Customer Info:
sale_id customer_id amount date name category
0 1 101 120.50 2023-01-01 ABC Corp Enterprise
1 2 102 300.75 2023-01-02 XYZ Inc SMB
2 3 101 95.25 2023-01-03 ABC Corp Enterprise
3 4 103 440.00 2023-01-04 123 LLC Enterprise
4 5 104 200.00 2023-01-05 Tech Giant Enterprise

Sales by Customer Category:
sum mean count
category
Enterprise 1036.00 259.00000 4
SMB 1022.25 255.56250 4

In this example, we merged sales data with customer information to analyze sales by customer category. The merged data allows us to see which customer categories generate the most sales.

Common Issues and Solutions

1. Memory Usage with Large DataFrames

When working with large DataFrames, merges can consume a lot of memory. You can use the copy=False parameter to avoid making copies of data:

python
# More memory-efficient merge
efficient_merge = pd.merge(
large_df1,
large_df2,
on='key',
copy=False
)

2. Handling Duplicate Keys

If you have duplicate keys in both DataFrames, the merge will create a row for each combination of duplicates:

python
# DataFrames with duplicate keys
df_a = pd.DataFrame({
'key': ['A', 'A', 'B', 'C'],
'value': [1, 2, 3, 4]
})

df_b = pd.DataFrame({
'key': ['A', 'A', 'B', 'D'],
'other_value': [10, 20, 30, 40]
})

# Merge with duplicate keys
duplicate_merge = pd.merge(df_a, df_b, on='key')
print("Merge with Duplicate Keys:")
print(duplicate_merge)

Output:

Merge with Duplicate Keys:
key value other_value
0 A 1 10
1 A 1 20
2 A 2 10
3 A 2 20
4 B 3 30

Be careful with duplicate keys, as they can explode the size of your merged DataFrame.

3. Tracking Merge Results

To understand which rows were matched in a merge, you can use the indicator=True parameter:

python
# Merge with indicator
merge_with_indicator = pd.merge(
employees,
departments,
on='department_id',
how='outer',
indicator=True
)

print("Merge with Indicator:")
print(merge_with_indicator)

Output:

Merge with Indicator:
employee_id name department_id department_name _merge
0 1.0 Alice 101 HR both
1 3.0 Charlie 101 HR both
2 2.0 Bob 102 Engineering both
3 5.0 Eva 102 Engineering both
4 4.0 David 103 Marketing both
5 NaN NaN 104 Sales right_only

The _merge column shows where each row came from: 'both', 'left_only', or 'right_only'.

Summary

In this tutorial, you've learned the basics of merging DataFrames in pandas:

  • How to use the pd.merge() function to combine DataFrames
  • Different types of joins: inner, left, right, and outer
  • Merging on different columns with left_on and right_on
  • Merging on multiple columns
  • Handling duplicate columns with suffixes
  • Working with real-world data through a practical example
  • Common issues and solutions for merging DataFrames

Merging is a fundamental operation in data analysis that allows you to combine information from different sources. As you continue your pandas journey, you'll find yourself using these techniques frequently to prepare data for analysis and visualization.

Additional Resources

Exercises

  1. Create two DataFrames: one with student information (student_id, name, major_id) and another with major information (major_id, major_name, department). Merge them to create a complete student profile.

  2. Using the sales and customers DataFrames from the real-world example:

    • Find the customer who made the largest single purchase
    • Calculate the average purchase amount by customer category
    • Identify customers who haven't made any purchases
  3. Create a DataFrame with product information and another with inventory levels. Perform an outer merge to identify products that are either not in inventory or not in the product catalog.

  4. Take two DataFrames with a common date column but in different formats. Convert the dates to a common format and merge the DataFrames based on the date.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)