Pandas Left Join
Introduction
When working with real-world data, you'll often need to combine multiple datasets for comprehensive analysis. Pandas' left join is a crucial operation that allows you to merge two dataframes while preserving all rows from the left dataframe, regardless of whether they have matching values in the right dataframe. This tutorial will walk you through left joining dataframes with pandas, a common operation in data cleaning and preparation.
Understanding Pandas Left Join
A left join returns all records from the left dataframe and only the matched records from the right dataframe. If there's no match from the right dataframe, the result will contain NaN
(Not a Number) values for columns from the right dataframe.
Let's visualize how left join works:
Left DF Right DF Result (Left Join)
A B B C A B C
1 X X 100 1 X 100
2 Y Y 200 2 Y 200
3 Z 3 Z NaN
Basic Left Join Syntax
In pandas, you can perform a left join using the merge()
function with the how='left'
parameter:
pd.merge(left_df, right_df, on='key_column', how='left')
Or using the dataframe's merge()
method:
left_df.merge(right_df, on='key_column', how='left')
Simple Left Join Example
Let's start with a basic example of left join:
import pandas as pd
# Create sample dataframes
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'department_id': [101, 102, 103, 104, None]
})
departments = pd.DataFrame({
'department_id': [101, 102, 105],
'department_name': ['HR', 'Engineering', 'Finance']
})
# Display the dataframes
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
# Perform left join
result = employees.merge(departments, on='department_id', how='left')
# Display the result
print("\nResult of Left Join:")
print(result)
Output:
Employees DataFrame:
employee_id name department_id
0 1 Alice 101.0
1 2 Bob 102.0
2 3 Charlie 103.0
3 4 David 104.0
4 5 Eva NaN
Departments DataFrame:
department_id department_name
0 101 HR
1 102 Engineering
2 105 Finance
Result of Left Join:
employee_id name department_id department_name
0 1 Alice 101.0 HR
1 2 Bob 102.0 Engineering
2 3 Charlie 103.0 NaN
3 4 David 104.0 NaN
4 5 Eva NaN NaN
Notice that all employees are included in the result, but Charlie, David, and Eva have NaN
for department_name
because their department IDs don't exist in the departments dataframe or are missing.
Left Join with Different Column Names
Sometimes, the column used for joining has different names in the two dataframes. In such cases, you can use left_on
and right_on
parameters:
import pandas as pd
# Create sample dataframes with different key column names
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['John', 'Sarah', 'Mike', 'Lisa'],
'city': ['New York', 'Boston', 'Chicago', 'Los Angeles']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'cust_id': [1, 3, 5, 6],
'amount': [100, 200, 150, 300]
})
# Display the dataframes
print("Customers DataFrame:")
print(customers)
print("\nOrders DataFrame:")
print(orders)
# Perform left join with different column names
result = customers.merge(orders, left_on='customer_id', right_on='cust_id', how='left')
# Display the result
print("\nResult of Left Join:")
print(result)
Output:
Customers DataFrame:
customer_id name city
0 1 John New York
1 2 Sarah Boston
2 3 Mike Chicago
3 4 Lisa Los Angeles
Orders DataFrame:
order_id cust_id amount
0 101 1 100
1 102 3 200
2 103 5 150
3 104 6 300
Result of Left Join:
customer_id name city order_id cust_id amount
0 1 John New York 101.0 1.0 100.0
1 2 Sarah Boston NaN NaN NaN
2 3 Mike Chicago 102.0 3.0 200.0
3 4 Lisa Los Angeles NaN NaN NaN
In this example, Sarah and Lisa didn't place any orders, so their order details are filled with NaN
.
Left Join on Multiple Columns
You can also join dataframes based on multiple columns:
import pandas as pd
# Create sample dataframes
sales = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product_id': [1, 2, 1, 3],
'quantity': [10, 5, 8, 12]
})
products = pd.DataFrame({
'product_id': [1, 2, 3, 4],
'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
})
inventory = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'product_id': [1, 2, 1, 4],
'stock': [100, 200, 90, 50]
})
# Display the dataframes
print("Sales DataFrame:")
print(sales)
print("\nProducts DataFrame:")
print(products)
print("\nInventory DataFrame:")
print(inventory)
# First merge sales with products
sales_with_products = sales.merge(products, on='product_id', how='left')
# Then merge with inventory based on date and product_id
result = sales_with_products.merge(
inventory,
on=['date', 'product_id'],
how='left'
)
# Display the result
print("\nFinal Result after Both Joins:")
print(result)
Output:
Sales DataFrame:
date product_id quantity
0 2023-01-01 1 10
1 2023-01-01 2 5
2 2023-01-02 1 8
3 2023-01-02 3 12
Products DataFrame:
product_id name category
0 1 Laptop Electronics
1 2 Mouse Accessories
2 3 Keyboard Accessories
3 4 Monitor Electronics
Inventory DataFrame:
date product_id stock
0 2023-01-01 1 100
1 2023-01-01 2 200
2 2023-01-02 1 90
3 2023-01-02 4 50
Final Result after Both Joins:
date product_id quantity name category stock
0 2023-01-01 1 10 Laptop Electronics 100.0
1 2023-01-01 2 5 Mouse Accessories 200.0
2 2023-01-02 1 8 Laptop Electronics 90.0
3 2023-01-02 3 12 Keyboard Accessories NaN
Notice that the last row has NaN
for stock because there's no inventory record for product_id 3 on 2023-01-02.
Real-World Application: Analyzing Customer Orders
Let's look at a more practical example where we analyze customer orders and enhance our dataset with customer demographic information:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample data
orders = pd.DataFrame({
'order_id': list(range(1, 11)),
'customer_id': [1, 2, 1, 3, 2, 4, 1, 5, 2, 3],
'order_date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'amount': [120, 45, 33, 180, 67, 43, 88, 25, 91, 55]
})
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 5, 6],
'name': ['Alice', 'Bob', 'Charlie', 'Eve', 'Frank'],
'age': [32, 45, 27, 35, 52],
'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Premium']
})
# Merge orders with customer information
complete_orders = orders.merge(customers, on='customer_id', how='left')
print("Complete Orders Dataset:")
print(complete_orders)
# Check if we have any orders without customer information
missing_customer_info = complete_orders[complete_orders['name'].isna()]
print("\nOrders with Missing Customer Information:")
print(missing_customer_info)
# Analyze orders by customer segment
segment_analysis = complete_orders.groupby('segment')['amount'].agg(['sum', 'mean', 'count']).reset_index()
print("\nOrder Analysis by Customer Segment:")
print(segment_analysis)
# Plot total order amounts by customer
customer_totals = complete_orders.groupby(['customer_id', 'name'])['amount'].sum().reset_index()
customer_totals = customer_totals.sort_values(by='amount', ascending=False)
plt.figure(figsize=(10, 5))
plt.bar(customer_totals['name'], customer_totals['amount'])
plt.title('Total Order Amount by Customer')
plt.xlabel('Customer')
plt.ylabel('Total Amount')
plt.xticks(rotation=45)
# Display chart code (in a real environment this would show the chart)
print("\nChart would display total order amount by customer")
Output:
Complete Orders Dataset:
order_id customer_id order_date amount name age segment
0 1 1 2023-01-01 120 Alice 32.0 Premium
1 2 2 2023-01-02 45 Bob 45.0 Standard
2 3 1 2023-01-03 33 Alice 32.0 Premium
3 4 3 2023-01-04 180 Charlie 27.0 Premium
4 5 2 2023-01-05 67 Bob 45.0 Standard
5 6 4 2023-01-06 43 NaN NaN NaN
6 7 1 2023-01-07 88 Alice 32.0 Premium
7 8 5 2023-01-08 25 Eve 35.0 Standard
8 9 2 2023-01-09 91 Bob 45.0 Standard
9 10 3 2023-01-10 55 Charlie 27.0 Premium
Orders with Missing Customer Information:
order_id customer_id order_date amount name age segment
5 6 4 2023-01-06 43 NaN NaN NaN
Order Analysis by Customer Segment:
segment sum mean count
0 Premium 476 119.00000 4
1 Standard 228 57.00000 4
In this example, we merged order data with customer information. Notice that order #6 has missing customer information because customer_id 4 doesn't exist in our customers dataframe. The analysis shows the value of having complete customer data to segment our analysis.
Handling Missing Values After a Left Join
After performing a left join, you may need to handle missing values in the merged dataframe:
import pandas as pd
# Create sample dataframes
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'value': ['A', 'B', 'C', 'D']
})
df2 = pd.DataFrame({
'id': [1, 2, 5],
'status': ['Active', 'Inactive', 'Active']
})
# Perform left join
result = df1.merge(df2, on='id', how='left')
print("Result after left join:")
print(result)
# 1. Fill missing values with a specific value
result_fill = result.copy()
result_fill['status'] = result_fill['status'].fillna('Unknown')
print("\nAfter filling NaN values:")
print(result_fill)
# 2. Drop rows with missing values
result_drop = result.copy()
result_drop = result_drop.dropna()
print("\nAfter dropping rows with NaN values:")
print(result_drop)
# 3. Create a new column indicating if the row had a match
result_flag = result.copy()
result_flag['has_status'] = result_flag['status'].notna()
print("\nWith flag for matched rows:")
print(result_flag)
Output:
Result after left join:
id value status
0 1 A Active
1 2 B Inactive
2 3 C NaN
3 4 D NaN
After filling NaN values:
id value status
0 1 A Active
1 2 B Inactive
2 3 C Unknown
3 4 D Unknown
After dropping rows with NaN values:
id value status
0 1 A Active
1 2 B Inactive
With flag for matched rows:
id value status has_status
0 1 A Active True
1 2 B Inactive True
2 3 C NaN False
3 4 D NaN False
When to Use Left Join
A left join is particularly useful when:
- You want to keep all records from your primary dataset, even if there are no matches in the secondary dataset
- You're enriching your main data with additional information
- You need to identify missing relationships (e.g., customers without orders)
- You're performing data analysis where all original records must be preserved
Left Join vs. Other Join Types
To better understand left join, here's a brief comparison with other join types:
- Left Join: Returns all rows from the left table and matching rows from the right table
- Right Join: Returns all rows from the right table and matching rows from the left table
- Inner Join: Returns only matching rows from both tables
- Outer Join: Returns all rows from both tables
import pandas as pd
# Create sample dataframes
df_left = pd.DataFrame({
'key': ['A', 'B', 'C'],
'value_left': [1, 2, 3]
})
df_right = pd.DataFrame({
'key': ['B', 'C', 'D'],
'value_right': [4, 5, 6]
})
# Compare different join types
left_join = pd.merge(df_left, df_right, on='key', how='left')
right_join = pd.merge(df_left, df_right, on='key', how='right')
inner_join = pd.merge(df_left, df_right, on='key', how='inner')
outer_join = pd.merge(df_left, df_right, on='key', how='outer')
print("Left Join Result:")
print(left_join)
print("\nRight Join Result:")
print(right_join)
print("\nInner Join Result:")
print(inner_join)
print("\nOuter Join Result:")
print(outer_join)
Output:
Left Join Result:
key value_left value_right
0 A 1 NaN
1 B 2 4.0
2 C 3 5.0
Right Join Result:
key value_left value_right
0 B 2.0 4
1 C 3.0 5
2 D NaN 6
Inner Join Result:
key value_left value_right
0 B 2 4
1 C 3 5
Outer Join Result:
key value_left value_right
0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0
Summary
In this tutorial, you've learned:
- What a left join is and when to use it
- How to perform a left join using pandas'
merge()
function - Joining dataframes with different key column names
- Merging based on multiple columns
- Handling missing values after a left join
- Real-world applications of left joins
- How left join compares to other join types
Left join is a fundamental operation in data analysis, allowing you to combine datasets while preserving all records from your primary dataset. It's particularly valuable for data enrichment and identifying missing relationships.
Additional Resources
Here are some resources to further expand your knowledge:
- Pandas Documentation on Merge, Join, and Concat
- SQL JOIN Types Explained (for understanding the concepts from a SQL perspective)
Exercises
To practice what you've learned:
-
Create two dataframes representing students and their grades. Perform a left join to display all students and their grades, even if some students don't have grades.
-
Using the customers and orders sample dataframes from earlier examples, find all customers who haven't placed any orders.
-
Create a dataset with employee and department information. Use left join to find how many employees are in each department and identify departments with no employees.
-
Download a real-world dataset from Kaggle and practice joining multiple dataframes to answer an interesting question about the data.
Happy coding!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)