Skip to main content

Pandas Inner Join

In data analysis, you'll often need to combine data from multiple sources. Pandas provides powerful functionality for this through its merge operations, with inner join being one of the most commonly used approaches. This tutorial will help you understand and master inner joins in pandas.

What is an Inner Join?

An inner join combines data from two DataFrames based on matching values in specified columns, keeping only the rows where there is a match in both DataFrames. It's like finding the intersection between two sets.

Inner Join Visualization

Basic Inner Join Syntax

The basic syntax for performing an inner join in pandas is:

python
pd.merge(left_df, right_df, on='key_column', how='inner')

Where:

  • left_df and right_df are the DataFrames you want to join
  • on specifies the column(s) to join on
  • how='inner' specifies that you want an inner join (this is actually the default)

Simple Inner Join Example

Let's start with a simple example to understand the concept:

python
import pandas as pd

# Create sample DataFrames
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'department_id': [1, 2, 1, 3, 2]
})

departments = pd.DataFrame({
'department_id': [1, 2, 4],
'department_name': ['HR', 'Engineering', 'Marketing']
})

# Display the original DataFrames
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

# Perform inner join
result = pd.merge(employees, departments, on='department_id', how='inner')

print("\nResult after inner join:")
print(result)

Output:

Employees DataFrame:
employee_id name department_id
0 101 Alice 1
1 102 Bob 2
2 103 Charlie 1
3 104 David 3
4 105 Eva 2

Departments DataFrame:
department_id department_name
0 1 HR
1 2 Engineering
2 4 Marketing

Result after inner join:
employee_id name department_id department_name
0 101 Alice 1 HR
1 103 Charlie 1 HR
2 102 Bob 2 Engineering
3 105 Eva 2 Engineering

In this example:

  • We have two DataFrames: employees and departments.
  • We perform an inner join on the department_id column.
  • Note that employee David (with department_id=3) is missing from the result because there's no matching department in the departments DataFrame.
  • Similarly, the Marketing department (department_id=4) is missing because no employee belongs to it.

Inner Join with Different Column Names

Sometimes, the columns you want to join on have different names in each DataFrame. In such cases, you can use left_on and right_on instead of on:

python
import pandas as pd

# Create sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'customer_name': ['Alex', 'Brian', 'Catherine', 'Diana', 'Edward'],
'city': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Chicago']
})

orders = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005],
'cust_id': [2, 3, 3, 6, 7],
'amount': [100, 150, 200, 50, 300]
})

# Display the original DataFrames
print("Customers DataFrame:")
print(customers)
print("\nOrders DataFrame:")
print(orders)

# Perform inner join with different column names
result = pd.merge(customers, orders, left_on='customer_id', right_on='cust_id', how='inner')

print("\nResult after inner join:")
print(result)

Output:

Customers DataFrame:
customer_id customer_name city
0 1 Alex New York
1 2 Brian Chicago
2 3 Catherine Los Angeles
3 4 Diana Houston
4 5 Edward Chicago

Orders DataFrame:
order_id cust_id amount
0 1001 2 100
1 1002 3 150
2 1003 3 200
3 1004 6 50
4 1005 7 300

Result after inner join:
customer_id customer_name city order_id cust_id amount
0 2 Brian Chicago 1001 2 100
1 3 Catherine Los Angeles 1002 3 150
2 3 Catherine Los Angeles 1003 3 200

Notice that:

  • Only customers with IDs 2 and 3 appear in the result because they're the only ones who placed orders.
  • Catherine (customer_id=3) appears twice because she placed two orders.

Inner Join on Multiple Columns

You can join on multiple columns by passing a list of column names:

python
import pandas as pd

# Create sample DataFrames
sales_2022 = pd.DataFrame({
'product_id': [101, 102, 103, 104],
'category': ['A', 'B', 'A', 'C'],
'sales_2022': [1000, 1500, 800, 2000]
})

sales_2023 = pd.DataFrame({
'product_id': [101, 102, 105, 106],
'category': ['A', 'B', 'B', 'C'],
'sales_2023': [1200, 1400, 900, 2200]
})

# Display the original DataFrames
print("2022 Sales DataFrame:")
print(sales_2022)
print("\n2023 Sales DataFrame:")
print(sales_2023)

# Perform inner join on multiple columns
result = pd.merge(sales_2022, sales_2023, on=['product_id', 'category'], how='inner')

print("\nResult after inner join on multiple columns:")
print(result)

Output:

2022 Sales DataFrame:
product_id category sales_2022
0 101 A 1000
1 102 B 1500
2 103 A 800
3 104 C 2000

2023 Sales DataFrame:
product_id category sales_2023
0 101 A 1200
1 102 B 1400
2 105 B 900
3 106 C 2200

Result after inner join on multiple columns:
product_id category sales_2022 sales_2023
0 101 A 1000 1200
1 102 B 1500 1400

In this example:

  • We join on both product_id AND category.
  • Only products 101 and 102 appear in the result because they're the only ones that have the same product_id AND category in both DataFrames.

Using the DataFrame Method

Instead of using pd.merge(), you can also use the merge() method directly on a DataFrame:

python
# This is equivalent to pd.merge(employees, departments, ...)
result = employees.merge(departments, on='department_id', how='inner')

Both approaches produce the same result.

Real-World Application: Sales Analysis

Here's a more practical example showing how inner joins can be used for sales analysis:

python
import pandas as pd
import matplotlib.pyplot as plt

# Create sample DataFrames
sales = pd.DataFrame({
'date': ['2023-01-15', '2023-01-20', '2023-02-05', '2023-02-10', '2023-03-01'],
'product_id': [101, 102, 101, 103, 102],
'units_sold': [5, 3, 7, 2, 4]
})

products = pd.DataFrame({
'product_id': [101, 102, 103, 104],
'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor'],
'price': [1200, 800, 400, 300]
})

# Convert date to datetime
sales['date'] = pd.to_datetime(sales['date'])

# Perform inner join
combined_data = pd.merge(sales, products, on='product_id', how='inner')

# Add revenue column
combined_data['revenue'] = combined_data['units_sold'] * combined_data['price']

print("Combined sales data with product information:")
print(combined_data)

# Monthly sales analysis
combined_data['month'] = combined_data['date'].dt.strftime('%Y-%m')
monthly_sales = combined_data.groupby(['month', 'product_name'])['revenue'].sum().reset_index()

print("\nMonthly sales by product:")
print(monthly_sales)

# Plotting (optional but practical)
# pivot_data = monthly_sales.pivot(index='month', columns='product_name', values='revenue')
# pivot_data.plot(kind='bar', figsize=(10, 6))
# plt.title('Monthly Revenue by Product')
# plt.ylabel('Revenue ($)')
# plt.tight_layout()
# plt.show()

Output:

Combined sales data with product information:
date product_id units_sold product_name price revenue
0 2023-01-15 101 5 Laptop 1200 6000
1 2023-01-20 102 3 Smartphone 800 2400
2 2023-02-05 101 7 Laptop 1200 8400
3 2023-02-10 103 2 Tablet 400 800
4 2023-03-01 102 4 Smartphone 800 3200

Monthly sales by product:
month product_name revenue
0 2023-01 Laptop 6000
1 2023-01 Smartphone 2400
2 2023-02 Laptop 8400
3 2023-02 Tablet 800
4 2023-03 Smartphone 3200

In this example, we:

  1. Join sales data with product information using inner join
  2. Calculate revenue for each sale
  3. Perform monthly sales analysis by product
  4. Could generate visualizations of the data (commented out code)

Common Pitfalls and Tips

  1. Missing matches: Remember that inner joins only keep rows that have matches in both DataFrames. If you're getting fewer rows than expected, this might be the reason.

  2. Duplicate column names: After a join, you might end up with duplicate column names if both DataFrames have columns with the same name (besides the join columns). To handle this situation, you can use the suffixes parameter:

python
pd.merge(df1, df2, on='common_column', how='inner', suffixes=('_left', '_right'))
  1. Checking results: Always check the shape (dimensions) of your result to make sure the join operation worked as expected:
python
print(f"Left DataFrame shape: {left_df.shape}")
print(f"Right DataFrame shape: {right_df.shape}")
print(f"Result DataFrame shape: {result.shape}")
  1. Memory usage: Joins can create large DataFrames that consume significant memory. If working with big data, consider using filters before joining.

Summary

Inner joins in pandas are a powerful way to combine related data from different DataFrames based on matching values in one or more columns. Key points to remember:

  • Inner joins keep only rows with matching values in both DataFrames
  • You can join on one or multiple columns
  • You can join on columns with different names using left_on and right_on
  • The default join type for pd.merge() is already 'inner'

Inner joins are fundamental for data analysis tasks involving multiple data sources, enabling you to create comprehensive datasets for further analysis and visualization.

Additional Resources

  1. Pandas Merge Documentation
  2. Pandas Merge, Join and Concatenate Guide

Practice Exercises

  1. Create two DataFrames, one with student information and another with exam results, and join them to analyze which students took which exams.

  2. Use the pd.merge() function to join two DataFrames of weather data and population data by city and analyze if there's a correlation between population and average temperature.

  3. Join three DataFrames (customers, orders, products) to generate a complete sales report.

  4. Compare the results of inner join, left join, right join, and outer join on the same pair of DataFrames to understand the differences.

Happy data wrangling with pandas!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)