Skip to main content

Pandas Left Join

Introduction

When working with real-world data, you'll often need to combine multiple datasets for comprehensive analysis. Pandas' left join is a crucial operation that allows you to merge two dataframes while preserving all rows from the left dataframe, regardless of whether they have matching values in the right dataframe. This tutorial will walk you through left joining dataframes with pandas, a common operation in data cleaning and preparation.

Understanding Pandas Left Join

A left join returns all records from the left dataframe and only the matched records from the right dataframe. If there's no match from the right dataframe, the result will contain NaN (Not a Number) values for columns from the right dataframe.

Let's visualize how left join works:

Left DF    Right DF       Result (Left Join)
A B B C A B C
1 X X 100 1 X 100
2 Y Y 200 2 Y 200
3 Z 3 Z NaN

Basic Left Join Syntax

In pandas, you can perform a left join using the merge() function with the how='left' parameter:

python
pd.merge(left_df, right_df, on='key_column', how='left')

Or using the dataframe's merge() method:

python
left_df.merge(right_df, on='key_column', how='left')

Simple Left Join Example

Let's start with a basic example of left join:

python
import pandas as pd

# Create sample dataframes
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'department_id': [101, 102, 103, 104, None]
})

departments = pd.DataFrame({
'department_id': [101, 102, 105],
'department_name': ['HR', 'Engineering', 'Finance']
})

# Display the dataframes
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

# Perform left join
result = employees.merge(departments, on='department_id', how='left')

# Display the result
print("\nResult of Left Join:")
print(result)

Output:

Employees DataFrame:
employee_id name department_id
0 1 Alice 101.0
1 2 Bob 102.0
2 3 Charlie 103.0
3 4 David 104.0
4 5 Eva NaN

Departments DataFrame:
department_id department_name
0 101 HR
1 102 Engineering
2 105 Finance

Result of Left Join:
employee_id name department_id department_name
0 1 Alice 101.0 HR
1 2 Bob 102.0 Engineering
2 3 Charlie 103.0 NaN
3 4 David 104.0 NaN
4 5 Eva NaN NaN

Notice that all employees are included in the result, but Charlie, David, and Eva have NaN for department_name because their department IDs don't exist in the departments dataframe or are missing.

Left Join with Different Column Names

Sometimes, the column used for joining has different names in the two dataframes. In such cases, you can use left_on and right_on parameters:

python
import pandas as pd

# Create sample dataframes with different key column names
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['John', 'Sarah', 'Mike', 'Lisa'],
'city': ['New York', 'Boston', 'Chicago', 'Los Angeles']
})

orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'cust_id': [1, 3, 5, 6],
'amount': [100, 200, 150, 300]
})

# Display the dataframes
print("Customers DataFrame:")
print(customers)
print("\nOrders DataFrame:")
print(orders)

# Perform left join with different column names
result = customers.merge(orders, left_on='customer_id', right_on='cust_id', how='left')

# Display the result
print("\nResult of Left Join:")
print(result)

Output:

Customers DataFrame:
customer_id name city
0 1 John New York
1 2 Sarah Boston
2 3 Mike Chicago
3 4 Lisa Los Angeles

Orders DataFrame:
order_id cust_id amount
0 101 1 100
1 102 3 200
2 103 5 150
3 104 6 300

Result of Left Join:
customer_id name city order_id cust_id amount
0 1 John New York 101.0 1.0 100.0
1 2 Sarah Boston NaN NaN NaN
2 3 Mike Chicago 102.0 3.0 200.0
3 4 Lisa Los Angeles NaN NaN NaN

In this example, Sarah and Lisa didn't place any orders, so their order details are filled with NaN.

Left Join on Multiple Columns

You can also join dataframes based on multiple columns:

python
import pandas as pd

# Create sample dataframes
sales = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product_id': [1, 2, 1, 3],
'quantity': [10, 5, 8, 12]
})

products = pd.DataFrame({
'product_id': [1, 2, 3, 4],
'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
})

inventory = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product_id': [1, 2, 1, 4],
'stock': [100, 200, 90, 50]
})

# Display the dataframes
print("Sales DataFrame:")
print(sales)
print("\nProducts DataFrame:")
print(products)
print("\nInventory DataFrame:")
print(inventory)

# First merge sales with products
sales_with_products = sales.merge(products, on='product_id', how='left')

# Then merge with inventory based on date and product_id
result = sales_with_products.merge(
inventory,
on=['date', 'product_id'],
how='left'
)

# Display the result
print("\nFinal Result after Both Joins:")
print(result)

Output:

Sales DataFrame:
date product_id quantity
0 2023-01-01 1 10
1 2023-01-01 2 5
2 2023-01-02 1 8
3 2023-01-02 3 12

Products DataFrame:
product_id name category
0 1 Laptop Electronics
1 2 Mouse Accessories
2 3 Keyboard Accessories
3 4 Monitor Electronics

Inventory DataFrame:
date product_id stock
0 2023-01-01 1 100
1 2023-01-01 2 200
2 2023-01-02 1 90
3 2023-01-02 4 50

Final Result after Both Joins:
date product_id quantity name category stock
0 2023-01-01 1 10 Laptop Electronics 100.0
1 2023-01-01 2 5 Mouse Accessories 200.0
2 2023-01-02 1 8 Laptop Electronics 90.0
3 2023-01-02 3 12 Keyboard Accessories NaN

Notice that the last row has NaN for stock because there's no inventory record for product_id 3 on 2023-01-02.

Real-World Application: Analyzing Customer Orders

Let's look at a more practical example where we analyze customer orders and enhance our dataset with customer demographic information:

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data
orders = pd.DataFrame({
'order_id': list(range(1, 11)),
'customer_id': [1, 2, 1, 3, 2, 4, 1, 5, 2, 3],
'order_date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'amount': [120, 45, 33, 180, 67, 43, 88, 25, 91, 55]
})

customers = pd.DataFrame({
'customer_id': [1, 2, 3, 5, 6],
'name': ['Alice', 'Bob', 'Charlie', 'Eve', 'Frank'],
'age': [32, 45, 27, 35, 52],
'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Premium']
})

# Merge orders with customer information
complete_orders = orders.merge(customers, on='customer_id', how='left')

print("Complete Orders Dataset:")
print(complete_orders)

# Check if we have any orders without customer information
missing_customer_info = complete_orders[complete_orders['name'].isna()]
print("\nOrders with Missing Customer Information:")
print(missing_customer_info)

# Analyze orders by customer segment
segment_analysis = complete_orders.groupby('segment')['amount'].agg(['sum', 'mean', 'count']).reset_index()
print("\nOrder Analysis by Customer Segment:")
print(segment_analysis)

# Plot total order amounts by customer
customer_totals = complete_orders.groupby(['customer_id', 'name'])['amount'].sum().reset_index()
customer_totals = customer_totals.sort_values(by='amount', ascending=False)

plt.figure(figsize=(10, 5))
plt.bar(customer_totals['name'], customer_totals['amount'])
plt.title('Total Order Amount by Customer')
plt.xlabel('Customer')
plt.ylabel('Total Amount')
plt.xticks(rotation=45)
# Display chart code (in a real environment this would show the chart)
print("\nChart would display total order amount by customer")

Output:

Complete Orders Dataset:
order_id customer_id order_date amount name age segment
0 1 1 2023-01-01 120 Alice 32.0 Premium
1 2 2 2023-01-02 45 Bob 45.0 Standard
2 3 1 2023-01-03 33 Alice 32.0 Premium
3 4 3 2023-01-04 180 Charlie 27.0 Premium
4 5 2 2023-01-05 67 Bob 45.0 Standard
5 6 4 2023-01-06 43 NaN NaN NaN
6 7 1 2023-01-07 88 Alice 32.0 Premium
7 8 5 2023-01-08 25 Eve 35.0 Standard
8 9 2 2023-01-09 91 Bob 45.0 Standard
9 10 3 2023-01-10 55 Charlie 27.0 Premium

Orders with Missing Customer Information:
order_id customer_id order_date amount name age segment
5 6 4 2023-01-06 43 NaN NaN NaN

Order Analysis by Customer Segment:
segment sum mean count
0 Premium 476 119.00000 4
1 Standard 228 57.00000 4

In this example, we merged order data with customer information. Notice that order #6 has missing customer information because customer_id 4 doesn't exist in our customers dataframe. The analysis shows the value of having complete customer data to segment our analysis.

Handling Missing Values After a Left Join

After performing a left join, you may need to handle missing values in the merged dataframe:

python
import pandas as pd

# Create sample dataframes
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'value': ['A', 'B', 'C', 'D']
})

df2 = pd.DataFrame({
'id': [1, 2, 5],
'status': ['Active', 'Inactive', 'Active']
})

# Perform left join
result = df1.merge(df2, on='id', how='left')
print("Result after left join:")
print(result)

# 1. Fill missing values with a specific value
result_fill = result.copy()
result_fill['status'] = result_fill['status'].fillna('Unknown')
print("\nAfter filling NaN values:")
print(result_fill)

# 2. Drop rows with missing values
result_drop = result.copy()
result_drop = result_drop.dropna()
print("\nAfter dropping rows with NaN values:")
print(result_drop)

# 3. Create a new column indicating if the row had a match
result_flag = result.copy()
result_flag['has_status'] = result_flag['status'].notna()
print("\nWith flag for matched rows:")
print(result_flag)

Output:

Result after left join:
id value status
0 1 A Active
1 2 B Inactive
2 3 C NaN
3 4 D NaN

After filling NaN values:
id value status
0 1 A Active
1 2 B Inactive
2 3 C Unknown
3 4 D Unknown

After dropping rows with NaN values:
id value status
0 1 A Active
1 2 B Inactive

With flag for matched rows:
id value status has_status
0 1 A Active True
1 2 B Inactive True
2 3 C NaN False
3 4 D NaN False

When to Use Left Join

A left join is particularly useful when:

  1. You want to keep all records from your primary dataset, even if there are no matches in the secondary dataset
  2. You're enriching your main data with additional information
  3. You need to identify missing relationships (e.g., customers without orders)
  4. You're performing data analysis where all original records must be preserved

Left Join vs. Other Join Types

To better understand left join, here's a brief comparison with other join types:

  • Left Join: Returns all rows from the left table and matching rows from the right table
  • Right Join: Returns all rows from the right table and matching rows from the left table
  • Inner Join: Returns only matching rows from both tables
  • Outer Join: Returns all rows from both tables
python
import pandas as pd

# Create sample dataframes
df_left = pd.DataFrame({
'key': ['A', 'B', 'C'],
'value_left': [1, 2, 3]
})

df_right = pd.DataFrame({
'key': ['B', 'C', 'D'],
'value_right': [4, 5, 6]
})

# Compare different join types
left_join = pd.merge(df_left, df_right, on='key', how='left')
right_join = pd.merge(df_left, df_right, on='key', how='right')
inner_join = pd.merge(df_left, df_right, on='key', how='inner')
outer_join = pd.merge(df_left, df_right, on='key', how='outer')

print("Left Join Result:")
print(left_join)
print("\nRight Join Result:")
print(right_join)
print("\nInner Join Result:")
print(inner_join)
print("\nOuter Join Result:")
print(outer_join)

Output:

Left Join Result:
key value_left value_right
0 A 1 NaN
1 B 2 4.0
2 C 3 5.0

Right Join Result:
key value_left value_right
0 B 2.0 4
1 C 3.0 5
2 D NaN 6

Inner Join Result:
key value_left value_right
0 B 2 4
1 C 3 5

Outer Join Result:
key value_left value_right
0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0

Summary

In this tutorial, you've learned:

  1. What a left join is and when to use it
  2. How to perform a left join using pandas' merge() function
  3. Joining dataframes with different key column names
  4. Merging based on multiple columns
  5. Handling missing values after a left join
  6. Real-world applications of left joins
  7. How left join compares to other join types

Left join is a fundamental operation in data analysis, allowing you to combine datasets while preserving all records from your primary dataset. It's particularly valuable for data enrichment and identifying missing relationships.

Additional Resources

Here are some resources to further expand your knowledge:

  1. Pandas Documentation on Merge, Join, and Concat
  2. SQL JOIN Types Explained (for understanding the concepts from a SQL perspective)

Exercises

To practice what you've learned:

  1. Create two dataframes representing students and their grades. Perform a left join to display all students and their grades, even if some students don't have grades.

  2. Using the customers and orders sample dataframes from earlier examples, find all customers who haven't placed any orders.

  3. Create a dataset with employee and department information. Use left join to find how many employees are in each department and identify departments with no employees.

  4. Download a real-world dataset from Kaggle and practice joining multiple dataframes to answer an interesting question about the data.

Happy coding!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)