Pandas Inner Join

In data analysis, you'll often need to combine data from multiple sources. Pandas provides powerful functionality for this through its merge operations, with inner join being one of the most commonly used approaches. This tutorial will help you understand and master inner joins in pandas.

What is an Inner Join?

An inner join combines data from two DataFrames based on matching values in specified columns, keeping only the rows where there is a match in both DataFrames. It's like finding the intersection between two sets.

Inner Join Visualization

Basic Inner Join Syntax

The basic syntax for performing an inner join in pandas is:

python
pd.merge(left_df, right_df, on='key_column', how='inner')

Where:

left_df and right_df are the DataFrames you want to join
on specifies the column(s) to join on
how='inner' specifies that you want an inner join (this is actually the default)

Simple Inner Join Example

Let's start with a simple example to understand the concept:

python
import pandas as pd

# Create sample DataFrames
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'department_id': [1, 2, 1, 3, 2]
})

departments = pd.DataFrame({
    'department_id': [1, 2, 4],
    'department_name': ['HR', 'Engineering', 'Marketing']
})

# Display the original DataFrames
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

# Perform inner join
result = pd.merge(employees, departments, on='department_id', how='inner')

print("\nResult after inner join:")
print(result)

Output:

Employees DataFrame:
   employee_id     name  department_id
0         101    Alice              1
1         102      Bob              2
2         103  Charlie              1
3         104    David              3
4         105      Eva              2

Departments DataFrame:
   department_id department_name
0             1             HR
1             2     Engineering
2             4       Marketing

Result after inner join:
   employee_id     name  department_id department_name
0         101    Alice              1             HR
1         103  Charlie              1             HR
2         102      Bob              2     Engineering
3         105      Eva              2     Engineering

In this example:

We have two DataFrames: employees and departments.
We perform an inner join on the department_id column.
Note that employee David (with department_id=3) is missing from the result because there's no matching department in the departments DataFrame.
Similarly, the Marketing department (department_id=4) is missing because no employee belongs to it.

Inner Join with Different Column Names

Sometimes, the columns you want to join on have different names in each DataFrame. In such cases, you can use left_on and right_on instead of on:

python
import pandas as pd

# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'customer_name': ['Alex', 'Brian', 'Catherine', 'Diana', 'Edward'],
    'city': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Chicago']
})

orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'cust_id': [2, 3, 3, 6, 7],
    'amount': [100, 150, 200, 50, 300]
})

# Display the original DataFrames
print("Customers DataFrame:")
print(customers)
print("\nOrders DataFrame:")
print(orders)

# Perform inner join with different column names
result = pd.merge(customers, orders, left_on='customer_id', right_on='cust_id', how='inner')

print("\nResult after inner join:")
print(result)

Output:

Customers DataFrame:
   customer_id customer_name         city
0            1          Alex     New York
1            2         Brian      Chicago
2            3     Catherine  Los Angeles
3            4         Diana      Houston
4            5        Edward      Chicago

Orders DataFrame:
   order_id  cust_id  amount
0      1001        2     100
1      1002        3     150
2      1003        3     200
3      1004        6      50
4      1005        7     300

Result after inner join:
   customer_id customer_name         city  order_id  cust_id  amount
0            2         Brian      Chicago      1001        2     100
1            3     Catherine  Los Angeles      1002        3     150
2            3     Catherine  Los Angeles      1003        3     200

Notice that:

Only customers with IDs 2 and 3 appear in the result because they're the only ones who placed orders.
Catherine (customer_id=3) appears twice because she placed two orders.

Inner Join on Multiple Columns

You can join on multiple columns by passing a list of column names:

python
import pandas as pd

# Create sample DataFrames
sales_2022 = pd.DataFrame({
    'product_id': [101, 102, 103, 104],
    'category': ['A', 'B', 'A', 'C'],
    'sales_2022': [1000, 1500, 800, 2000]
})

sales_2023 = pd.DataFrame({
    'product_id': [101, 102, 105, 106],
    'category': ['A', 'B', 'B', 'C'],
    'sales_2023': [1200, 1400, 900, 2200]
})

# Display the original DataFrames
print("2022 Sales DataFrame:")
print(sales_2022)
print("\n2023 Sales DataFrame:")
print(sales_2023)

# Perform inner join on multiple columns
result = pd.merge(sales_2022, sales_2023, on=['product_id', 'category'], how='inner')

print("\nResult after inner join on multiple columns:")
print(result)

Output:

Sales DataFrame:
   product_id category  sales_2022
       101        A        1000
       102        B        1500
       103        A         800
       104        C        2000

Sales DataFrame:
   product_id category  sales_2023
       101        A        1200
       102        B        1400
       105        B         900
       106        C        2200

Result after inner join on multiple columns:
   product_id category  sales_2022  sales_2023
       101        A        1000        1200
       102        B        1500        1400

In this example:

We join on both product_id AND category.
Only products 101 and 102 appear in the result because they're the only ones that have the same product_id AND category in both DataFrames.

Using the DataFrame Method

Instead of using pd.merge(), you can also use the merge() method directly on a DataFrame:

python
# This is equivalent to pd.merge(employees, departments, ...)
result = employees.merge(departments, on='department_id', how='inner')

Both approaches produce the same result.

Real-World Application: Sales Analysis

Here's a more practical example showing how inner joins can be used for sales analysis:

python
import pandas as pd
import matplotlib.pyplot as plt

# Create sample DataFrames
sales = pd.DataFrame({
    'date': ['2023-01-15', '2023-01-20', '2023-02-05', '2023-02-10', '2023-03-01'],
    'product_id': [101, 102, 101, 103, 102],
    'units_sold': [5, 3, 7, 2, 4]
})

products = pd.DataFrame({
    'product_id': [101, 102, 103, 104],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Monitor'],
    'price': [1200, 800, 400, 300]
})

# Convert date to datetime
sales['date'] = pd.to_datetime(sales['date'])

# Perform inner join
combined_data = pd.merge(sales, products, on='product_id', how='inner')

# Add revenue column
combined_data['revenue'] = combined_data['units_sold'] * combined_data['price']

print("Combined sales data with product information:")
print(combined_data)

# Monthly sales analysis
combined_data['month'] = combined_data['date'].dt.strftime('%Y-%m')
monthly_sales = combined_data.groupby(['month', 'product_name'])['revenue'].sum().reset_index()

print("\nMonthly sales by product:")
print(monthly_sales)

# Plotting (optional but practical)
# pivot_data = monthly_sales.pivot(index='month', columns='product_name', values='revenue')
# pivot_data.plot(kind='bar', figsize=(10, 6))
# plt.title('Monthly Revenue by Product')
# plt.ylabel('Revenue ($)')
# plt.tight_layout()
# plt.show()

Output:

Combined sales data with product information:
        date  product_id  units_sold product_name  price  revenue
2023-01-15         101           5      Laptop   1200     6000
2023-01-20         102           3  Smartphone    800     2400
2023-02-05         101           7      Laptop   1200     8400
2023-02-10         103           2      Tablet    400      800
2023-03-01         102           4  Smartphone    800     3200

Monthly sales by product:
     month product_name  revenue
2023-01      Laptop     6000
2023-01  Smartphone     2400
2023-02      Laptop     8400
2023-02      Tablet      800
2023-03  Smartphone     3200

In this example, we:

Join sales data with product information using inner join
Calculate revenue for each sale
Perform monthly sales analysis by product
Could generate visualizations of the data (commented out code)

Common Pitfalls and Tips

Missing matches: Remember that inner joins only keep rows that have matches in both DataFrames. If you're getting fewer rows than expected, this might be the reason.
Duplicate column names: After a join, you might end up with duplicate column names if both DataFrames have columns with the same name (besides the join columns). To handle this situation, you can use the suffixes parameter:

python
pd.merge(df1, df2, on='common_column', how='inner', suffixes=('_left', '_right'))

Checking results: Always check the shape (dimensions) of your result to make sure the join operation worked as expected:

python
print(f"Left DataFrame shape: {left_df.shape}")
print(f"Right DataFrame shape: {right_df.shape}")
print(f"Result DataFrame shape: {result.shape}")

Memory usage: Joins can create large DataFrames that consume significant memory. If working with big data, consider using filters before joining.

Summary

Inner joins in pandas are a powerful way to combine related data from different DataFrames based on matching values in one or more columns. Key points to remember:

Inner joins keep only rows with matching values in both DataFrames
You can join on one or multiple columns
You can join on columns with different names using left_on and right_on
The default join type for pd.merge() is already 'inner'

Inner joins are fundamental for data analysis tasks involving multiple data sources, enabling you to create comprehensive datasets for further analysis and visualization.

Additional Resources

Practice Exercises

Create two DataFrames, one with student information and another with exam results, and join them to analyze which students took which exams.
Use the pd.merge() function to join two DataFrames of weather data and population data by city and analyze if there's a correlation between population and average temperature.
Join three DataFrames (customers, orders, products) to generate a complete sales report.
Compare the results of inner join, left join, right join, and outer join on the same pair of DataFrames to understand the differences.

Happy data wrangling with pandas!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is an Inner Join?​

Basic Inner Join Syntax​

Simple Inner Join Example​

Inner Join with Different Column Names​

Inner Join on Multiple Columns​

Using the DataFrame Method​

Real-World Application: Sales Analysis​

Common Pitfalls and Tips​

Summary​

Additional Resources​

Practice Exercises​

What is an Inner Join?

Basic Inner Join Syntax

Simple Inner Join Example

Inner Join with Different Column Names

Inner Join on Multiple Columns

Using the DataFrame Method

Real-World Application: Sales Analysis

Common Pitfalls and Tips

Summary

Additional Resources

Practice Exercises