Skip to main content

Pandas Outer Join

When working with multiple datasets in data analysis, you often need to combine them based on common columns. Pandas provides powerful merging capabilities, including the outer join method, which allows you to combine DataFrames while keeping all rows from both inputs.

What is an Outer Join?

An outer join combines two DataFrames and keeps all rows from both DataFrames. If a row in one DataFrame doesn't have a matching row in the other DataFrame, the missing values in the result are filled with NaN (Not a Number).

Think of an outer join as saying: "Give me everything from both tables, even if there's no match."

Basic Syntax

The basic syntax for performing an outer join in Pandas is:

python
pd.merge(left_df, right_df, on='key_column', how='outer')

Or using the DataFrame method:

python
left_df.merge(right_df, on='key_column', how='outer')

Simple Outer Join Example

Let's understand with a basic example:

python
import pandas as pd

# Create two sample DataFrames
employees = pd.DataFrame({
'emp_id': [101, 102, 103, 104, 105],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'dept_id': [1, 2, 1, 3, 2]
})

departments = pd.DataFrame({
'dept_id': [1, 2, 4, 5],
'dept_name': ['HR', 'Engineering', 'Marketing', 'Sales']
})

# Perform an outer join
result = pd.merge(employees, departments, on='dept_id', how='outer')

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
print("\nOuter Join Result:")
print(result)

This will output:

Employees DataFrame:
emp_id name dept_id
0 101 Alice 1
1 102 Bob 2
2 103 Charlie 1
3 104 David 3
4 105 Eva 2

Departments DataFrame:
dept_id dept_name
0 1 HR
1 2 Engineering
2 4 Marketing
3 5 Sales

Outer Join Result:
emp_id name dept_id dept_name
0 101.0 Alice 1 HR
1 103.0 Charlie 1 HR
2 102.0 Bob 2 Engineering
3 105.0 Eva 2 Engineering
4 104.0 David 3 NaN
5 NaN NaN 4 Marketing
6 NaN NaN 5 Sales

Notice that:

  • Employees with dept_id 1 and 2 have matching department information
  • Employee David (dept_id 3) has no matching department, so dept_name is NaN
  • Departments 4 (Marketing) and 5 (Sales) have no matching employees, so emp_id and name are NaN

Outer Join with Multiple Keys

You can join DataFrames on multiple columns by passing a list of column names:

python
pd.merge(left_df, right_df, on=['key1', 'key2'], how='outer')

Joining on Different Column Names

If the key columns have different names in each DataFrame, you can use left_on and right_on:

python
students = pd.DataFrame({
'student_id': [1, 2, 3, 4, 5],
'name': ['Alex', 'Brian', 'Claire', 'Diana', 'Eric'],
'course_code': ['CS101', 'MATH101', 'CS102', 'ENG101', 'PHYS101']
})

courses = pd.DataFrame({
'code': ['CS101', 'MATH101', 'CS102', 'BIO101', 'CHEM101'],
'course_name': ['Intro to Programming', 'Calculus I', 'Data Structures', 'Biology', 'Chemistry']
})

# Outer join with different column names
result = pd.merge(students, courses, left_on='course_code', right_on='code', how='outer')

print(result)

This outputs:

   student_id   name course_code     code           course_name
0 1.0 Alex CS101 CS101 Intro to Programming
1 2.0 Brian MATH101 MATH101 Calculus I
2 3.0 Claire CS102 CS102 Data Structures
3 4.0 Diana ENG101 NaN NaN
4 5.0 Eric PHYS101 NaN NaN
5 NaN NaN NaN BIO101 Biology
6 NaN NaN NaN CHEM101 Chemistry

Handling Index-Based Joins

You can also join on the index instead of columns using left_index and right_index:

python
# Set indices
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C3'], 'D': ['D0', 'D1', 'D3']}, index=['K0', 'K2', 'K3'])

# Outer join on indices
result = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
print("\nOuter Join Result:")
print(result)

Output:

DataFrame 1:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2

DataFrame 2:
C D
K0 C0 D0
K2 C1 D1
K3 C3 D3

Outer Join Result:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C1 D1
K3 NaN NaN C3 D3

Practical Example: Combining Sales Data

Let's see a real-world example where we might use an outer join to combine sales data with customer information:

python
# Sales data from two different regions
sales_north = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1004],
'item': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
'amount': [1200, 800, 300, 450]
})

sales_south = pd.DataFrame({
'customer_id': [1002, 1004, 1005, 1006],
'item': ['Keyboard', 'Headphones', 'Mouse', 'Camera'],
'amount': [100, 150, 50, 350]
})

# Customer information
customers = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1006, 1007],
'name': ['John Smith', 'Jane Doe', 'Robert Brown', 'Lisa Wong', 'Michael Lee'],
'segment': ['Corporate', 'Consumer', 'Corporate', 'Consumer', 'Corporate']
})

# First, combine all sales data using outer join
all_sales = pd.merge(sales_north, sales_south, how='outer', on='customer_id',
suffixes=('_north', '_south'))

# Then, join with customer information
complete_data = pd.merge(all_sales, customers, on='customer_id', how='outer')

print("Complete Sales Data:")
print(complete_data)

Output:

Complete Sales Data:
customer_id item_north amount_north item_south amount_south name segment
0 1001 Laptop 1200.0 NaN NaN John Smith Corporate
1 1002 Phone 800.0 Keyboard 100.0 Jane Doe Consumer
2 1003 Tablet 300.0 NaN NaN Robert Brown Corporate
3 1004 Monitor 450.0 Headphones 150.0 NaN NaN
4 1005 NaN NaN Mouse 50.0 NaN NaN
5 1006 NaN NaN Camera 350.0 Lisa Wong Consumer
6 1007 NaN NaN NaN NaN Michael Lee Corporate

This example shows:

  1. How to combine sales from two regions (keeping all sales records)
  2. How to add customer information to the combined sales data
  3. How to handle suffixes when columns have the same name in both DataFrames

Handling Missing Values After an Outer Join

After performing an outer join, you'll often have NaN values where there was no match. You have several options for handling these:

python
# Fill NaN values with a specific value
result.fillna({'amount_north': 0, 'amount_south': 0})

# Or fill all NaN values with a single value
result.fillna(0)

# Drop rows with any NaN values (this would defeat the purpose of an outer join in most cases)
result.dropna()

# Fill with different values based on columns
result['segment'] = result['segment'].fillna('Unknown')

When to Use Outer Joins

Outer joins are particularly useful when:

  1. You want to see all data from multiple sources, even if there's no correspondence
  2. You need to identify missing records between datasets
  3. You're creating a comprehensive view of all available data
  4. You're performing data quality checks to identify mismatches

Summary

Pandas outer join is a powerful way to combine DataFrames while preserving all rows from both sources. The key points to remember are:

  • Outer joins keep all rows from both DataFrames
  • Missing values are filled with NaN
  • You can join on columns, indices, or a combination
  • You can join on columns with different names using left_on and right_on
  • After joining, you'll often need to handle the NaN values appropriately

Exercises

  1. Create two DataFrames representing products and categories, then perform an outer join to see all products and all categories.
  2. Join three DataFrames using consecutive outer joins: customers, orders, and products.
  3. Compare the results of an outer join vs. an inner join on the same DataFrames to understand the difference.
  4. Create a visualization showing the proportion of matched vs. unmatched rows after an outer join.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)