Pandas Left Join

Introduction

When working with real-world data, you'll often need to combine multiple datasets for comprehensive analysis. Pandas' left join is a crucial operation that allows you to merge two dataframes while preserving all rows from the left dataframe, regardless of whether they have matching values in the right dataframe. This tutorial will walk you through left joining dataframes with pandas, a common operation in data cleaning and preparation.

Understanding Pandas Left Join

A left join returns all records from the left dataframe and only the matched records from the right dataframe. If there's no match from the right dataframe, the result will contain NaN (Not a Number) values for columns from the right dataframe.

Let's visualize how left join works:

Left DF    Right DF       Result (Left Join)
A  B       B  C          A  B    C
1  X       X  100        1  X    100
2  Y       Y  200        2  Y    200
3  Z                     3  Z    NaN

Basic Left Join Syntax

In pandas, you can perform a left join using the merge() function with the how='left' parameter:

pd.merge(left_df, right_df, on='key_column', how='left')

Or using the dataframe's merge() method:

left_df.merge(right_df, on='key_column', how='left')

Simple Left Join Example

Let's start with a basic example of left join:

import pandas as pd

# Create sample dataframes
employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'department_id': [101, 102, 103, 104, None]
})

departments = pd.DataFrame({
    'department_id': [101, 102, 105],
    'department_name': ['HR', 'Engineering', 'Finance']
})

# Display the dataframes
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

# Perform left join
result = employees.merge(departments, on='department_id', how='left')

# Display the result
print("\nResult of Left Join:")
print(result)

Output:

Employees DataFrame:
   employee_id    name  department_id
0           1   Alice          101.0
1           2     Bob          102.0
2           3 Charlie          103.0
3           4   David          104.0
4           5     Eva            NaN

Departments DataFrame:
   department_id department_name
0           101              HR
1           102      Engineering
2           105          Finance

Result of Left Join:
   employee_id    name  department_id department_name
0           1   Alice          101.0              HR
1           2     Bob          102.0      Engineering
2           3 Charlie          103.0             NaN
3           4   David          104.0             NaN
4           5     Eva            NaN             NaN

Notice that all employees are included in the result, but Charlie, David, and Eva have NaN for department_name because their department IDs don't exist in the departments dataframe or are missing.

Left Join with Different Column Names

Sometimes, the column used for joining has different names in the two dataframes. In such cases, you can use left_on and right_on parameters:

import pandas as pd

# Create sample dataframes with different key column names
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['John', 'Sarah', 'Mike', 'Lisa'],
    'city': ['New York', 'Boston', 'Chicago', 'Los Angeles']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'cust_id': [1, 3, 5, 6],
    'amount': [100, 200, 150, 300]
})

# Display the dataframes
print("Customers DataFrame:")
print(customers)
print("\nOrders DataFrame:")
print(orders)

# Perform left join with different column names
result = customers.merge(orders, left_on='customer_id', right_on='cust_id', how='left')

# Display the result
print("\nResult of Left Join:")
print(result)

Output:

Customers DataFrame:
   customer_id   name         city
0           1   John     New York
1           2  Sarah       Boston
2           3   Mike      Chicago
3           4   Lisa  Los Angeles

Orders DataFrame:
   order_id  cust_id  amount
0       101        1     100
1       102        3     200
2       103        5     150
3       104        6     300

Result of Left Join:
   customer_id   name         city  order_id  cust_id  amount
0           1   John     New York     101.0      1.0   100.0
1           2  Sarah       Boston       NaN      NaN     NaN
2           3   Mike      Chicago     102.0      3.0   200.0
3           4   Lisa  Los Angeles       NaN      NaN     NaN

In this example, Sarah and Lisa didn't place any orders, so their order details are filled with NaN.

Left Join on Multiple Columns

You can also join dataframes based on multiple columns:

import pandas as pd

# Create sample dataframes
sales = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'product_id': [1, 2, 1, 3],
    'quantity': [10, 5, 8, 12]
})

products = pd.DataFrame({
    'product_id': [1, 2, 3, 4],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
})

inventory = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'product_id': [1, 2, 1, 4],
    'stock': [100, 200, 90, 50]
})

# Display the dataframes
print("Sales DataFrame:")
print(sales)
print("\nProducts DataFrame:")
print(products)
print("\nInventory DataFrame:")
print(inventory)

# First merge sales with products
sales_with_products = sales.merge(products, on='product_id', how='left')

# Then merge with inventory based on date and product_id
result = sales_with_products.merge(
    inventory, 
    on=['date', 'product_id'], 
    how='left'
)

# Display the result
print("\nFinal Result after Both Joins:")
print(result)

Output:

Sales DataFrame:
         date  product_id  quantity
0  2023-01-01           1        10
1  2023-01-01           2         5
2  2023-01-02           1         8
3  2023-01-02           3        12

Products DataFrame:
   product_id      name      category
0           1    Laptop   Electronics
1           2     Mouse  Accessories
2           3  Keyboard  Accessories
3           4   Monitor   Electronics

Inventory DataFrame:
         date  product_id  stock
0  2023-01-01           1    100
1  2023-01-01           2    200
2  2023-01-02           1     90
3  2023-01-02           4     50

Final Result after Both Joins:
         date  product_id  quantity      name      category  stock
0  2023-01-01           1        10    Laptop   Electronics  100.0
1  2023-01-01           2         5     Mouse  Accessories  200.0
2  2023-01-02           1         8    Laptop   Electronics   90.0
3  2023-01-02           3        12  Keyboard  Accessories    NaN

Notice that the last row has NaN for stock because there's no inventory record for product_id 3 on 2023-01-02.

Real-World Application: Analyzing Customer Orders

Let's look at a more practical example where we analyze customer orders and enhance our dataset with customer demographic information:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data
orders = pd.DataFrame({
    'order_id': list(range(1, 11)),
    'customer_id': [1, 2, 1, 3, 2, 4, 1, 5, 2, 3],
    'order_date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'amount': [120, 45, 33, 180, 67, 43, 88, 25, 91, 55]
})

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 5, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'Eve', 'Frank'],
    'age': [32, 45, 27, 35, 52],
    'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Premium']
})

# Merge orders with customer information
complete_orders = orders.merge(customers, on='customer_id', how='left')

print("Complete Orders Dataset:")
print(complete_orders)

# Check if we have any orders without customer information
missing_customer_info = complete_orders[complete_orders['name'].isna()]
print("\nOrders with Missing Customer Information:")
print(missing_customer_info)

# Analyze orders by customer segment
segment_analysis = complete_orders.groupby('segment')['amount'].agg(['sum', 'mean', 'count']).reset_index()
print("\nOrder Analysis by Customer Segment:")
print(segment_analysis)

# Plot total order amounts by customer
customer_totals = complete_orders.groupby(['customer_id', 'name'])['amount'].sum().reset_index()
customer_totals = customer_totals.sort_values(by='amount', ascending=False)

plt.figure(figsize=(10, 5))
plt.bar(customer_totals['name'], customer_totals['amount'])
plt.title('Total Order Amount by Customer')
plt.xlabel('Customer')
plt.ylabel('Total Amount')
plt.xticks(rotation=45)
# Display chart code (in a real environment this would show the chart)
print("\nChart would display total order amount by customer")

Output:

Complete Orders Dataset:
   order_id  customer_id order_date  amount     name   age    segment
0         1            1 2023-01-01     120    Alice  32.0    Premium
1         2            2 2023-01-02      45      Bob  45.0   Standard
2         3            1 2023-01-03      33    Alice  32.0    Premium
3         4            3 2023-01-04     180  Charlie  27.0    Premium
4         5            2 2023-01-05      67      Bob  45.0   Standard
5         6            4 2023-01-06      43      NaN   NaN        NaN
6         7            1 2023-01-07      88    Alice  32.0    Premium
7         8            5 2023-01-08      25      Eve  35.0   Standard
8         9            2 2023-01-09      91      Bob  45.0   Standard
9        10            3 2023-01-10      55  Charlie  27.0    Premium

Orders with Missing Customer Information:
   order_id  customer_id order_date  amount  name  age segment
5         6            4 2023-01-06      43   NaN  NaN     NaN

Order Analysis by Customer Segment:
     segment   sum       mean  count
0    Premium  476  119.00000      4
1   Standard  228   57.00000      4

In this example, we merged order data with customer information. Notice that order #6 has missing customer information because customer_id 4 doesn't exist in our customers dataframe. The analysis shows the value of having complete customer data to segment our analysis.

Handling Missing Values After a Left Join

After performing a left join, you may need to handle missing values in the merged dataframe:

import pandas as pd

# Create sample dataframes
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'value': ['A', 'B', 'C', 'D']
})

df2 = pd.DataFrame({
    'id': [1, 2, 5],
    'status': ['Active', 'Inactive', 'Active']
})

# Perform left join
result = df1.merge(df2, on='id', how='left')
print("Result after left join:")
print(result)

# 1. Fill missing values with a specific value
result_fill = result.copy()
result_fill['status'] = result_fill['status'].fillna('Unknown')
print("\nAfter filling NaN values:")
print(result_fill)

# 2. Drop rows with missing values
result_drop = result.copy()
result_drop = result_drop.dropna()
print("\nAfter dropping rows with NaN values:")
print(result_drop)

# 3. Create a new column indicating if the row had a match
result_flag = result.copy()
result_flag['has_status'] = result_flag['status'].notna()
print("\nWith flag for matched rows:")
print(result_flag)

Output:

Result after left join:
   id value    status
0   1     A    Active
1   2     B  Inactive
2   3     C       NaN
3   4     D       NaN

After filling NaN values:
   id value    status
0   1     A    Active
1   2     B  Inactive
2   3     C   Unknown
3   4     D   Unknown

After dropping rows with NaN values:
   id value    status
0   1     A    Active
1   2     B  Inactive

With flag for matched rows:
   id value    status has_status
0   1     A    Active       True
1   2     B  Inactive       True
2   3     C       NaN      False
3   4     D       NaN      False

When to Use Left Join

A left join is particularly useful when:

You want to keep all records from your primary dataset, even if there are no matches in the secondary dataset
You're enriching your main data with additional information
You need to identify missing relationships (e.g., customers without orders)
You're performing data analysis where all original records must be preserved

Left Join vs. Other Join Types

To better understand left join, here's a brief comparison with other join types:

Left Join: Returns all rows from the left table and matching rows from the right table
Right Join: Returns all rows from the right table and matching rows from the left table
Inner Join: Returns only matching rows from both tables
Outer Join: Returns all rows from both tables

import pandas as pd

# Create sample dataframes
df_left = pd.DataFrame({
    'key': ['A', 'B', 'C'],
    'value_left': [1, 2, 3]
})

df_right = pd.DataFrame({
    'key': ['B', 'C', 'D'],
    'value_right': [4, 5, 6]
})

# Compare different join types
left_join = pd.merge(df_left, df_right, on='key', how='left')
right_join = pd.merge(df_left, df_right, on='key', how='right')
inner_join = pd.merge(df_left, df_right, on='key', how='inner')
outer_join = pd.merge(df_left, df_right, on='key', how='outer')

print("Left Join Result:")
print(left_join)
print("\nRight Join Result:")
print(right_join)
print("\nInner Join Result:")
print(inner_join)
print("\nOuter Join Result:")
print(outer_join)

Output:

Left Join Result:
  key  value_left  value_right
0   A           1         NaN
1   B           2         4.0
2   C           3         5.0

Right Join Result:
  key  value_left  value_right
0   B         2.0           4
1   C         3.0           5
2   D         NaN           6

Inner Join Result:
  key  value_left  value_right
0   B           2           4
1   C           3           5

Outer Join Result:
  key  value_left  value_right
0   A         1.0         NaN
1   B         2.0         4.0
2   C         3.0         5.0
3   D         NaN         6.0

Summary

In this tutorial, you've learned:

What a left join is and when to use it
How to perform a left join using pandas' merge() function
Joining dataframes with different key column names
Merging based on multiple columns
Handling missing values after a left join
Real-world applications of left joins
How left join compares to other join types

Left join is a fundamental operation in data analysis, allowing you to combine datasets while preserving all records from your primary dataset. It's particularly valuable for data enrichment and identifying missing relationships.

Additional Resources

Here are some resources to further expand your knowledge:

Pandas Documentation on Merge, Join, and Concat
SQL JOIN Types Explained (for understanding the concepts from a SQL perspective)

Exercises

To practice what you've learned:

Create two dataframes representing students and their grades. Perform a left join to display all students and their grades, even if some students don't have grades.
Using the customers and orders sample dataframes from earlier examples, find all customers who haven't placed any orders.
Create a dataset with employee and department information. Use left join to find how many employees are in each department and identify departments with no employees.
Download a real-world dataset from Kaggle and practice joining multiple dataframes to answer an interesting question about the data.

Happy coding!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Pandas Left Join​

Basic Left Join Syntax​

Simple Left Join Example​

Left Join with Different Column Names​

Left Join on Multiple Columns​

Real-World Application: Analyzing Customer Orders​

Handling Missing Values After a Left Join​

When to Use Left Join​

Left Join vs. Other Join Types​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding Pandas Left Join

Basic Left Join Syntax

Simple Left Join Example

Left Join with Different Column Names

Left Join on Multiple Columns

Real-World Application: Analyzing Customer Orders

Handling Missing Values After a Left Join

When to Use Left Join

Left Join vs. Other Join Types

Summary

Additional Resources

Exercises