Pandas Concat

In data analysis, you'll often need to combine multiple datasets into a single DataFrame. Pandas provides several methods for this, and one of the most versatile is the concat() function. This powerful tool lets you join DataFrames along either rows or columns, providing flexibility for various data manipulation needs.

Introduction to Pandas Concat

The pandas.concat() function allows you to glue together DataFrames along a particular axis. It's similar to the SQL UNION operation or stacking pieces of paper on top of each other (for row-wise concatenation) or placing them side by side (for column-wise concatenation).

The basic syntax is:

python
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None)

Where:

objs: A sequence or mapping of Series or DataFrame objects
axis: The axis to concatenate along (0 for rows, 1 for columns)
join: How to handle indexes on other axis ('inner' or 'outer')
ignore_index: Whether to discard the index and reset it
keys: Create a hierarchical index (MultiIndex) using the passed keys

Basic Concatenation Along Rows

Let's start with the most common use case: stacking DataFrames on top of each other.

python
import pandas as pd
import numpy as np

# Create sample DataFrames
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
})

# Concatenate along rows (axis=0)
result = pd.concat([df1, df2])
print("DataFrames concatenated along rows:")
print(result)

Output:

DataFrames concatenated along rows:
    A   B
A0  B0
A1  B1
A2  B2
A3  B3
A4  B4
A5  B5

Notice that the index values are duplicated. If you want to reset the index, use ignore_index=True:

python
# Concatenate with reset index
result = pd.concat([df1, df2], ignore_index=True)
print("DataFrames concatenated with reset index:")
print(result)

Output:

DataFrames concatenated with reset index:
    A   B
A0  B0
A1  B1
A2  B2
A3  B3
A4  B4
A5  B5

Concatenating Along Columns

To concatenate DataFrames horizontally (side by side), use axis=1:

python
# Create sample DataFrames
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
    'C': ['C0', 'C1', 'C2'],
    'D': ['D0', 'D1', 'D2']
})

# Concatenate along columns (axis=1)
result = pd.concat([df1, df2], axis=1)
print("DataFrames concatenated along columns:")
print(result)

Output:

DataFrames concatenated along columns:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2

Handling Different Indexes

When DataFrames have different indexes, the default behavior is to include all indexes (join='outer'):

python
# Create DataFrames with different indexes
df1 = pd.DataFrame({
    'A': ['A0', 'A1'],
    'B': ['B0', 'B1']
}, index=[0, 1])

df2 = pd.DataFrame({
    'A': ['A2', 'A3'],
    'B': ['B2', 'B3']
}, index=[1, 2])

# Concatenate with outer join (default)
result = pd.concat([df1, df2])
print("Concatenation with outer join (default):")
print(result)

Output:

Concatenation with outer join (default):
    A   B
0  A0  B0
1  A1  B1
1  A2  B2
2  A3  B3

If you only want to keep indexes that appear in all DataFrames, use join='inner':

python
# Concatenate with inner join
result = pd.concat([df1, df2], join='inner')
print("Concatenation with inner join:")
print(result)

Output:

Concatenation with inner join:
    A   B
0  A0  B0
1  A1  B1
1  A2  B2
2  A3  B3

Note that in this case, the output looks the same because both DataFrames have all the same columns. Let's try with different columns:

python
# Create DataFrames with different columns
df1 = pd.DataFrame({
    'A': ['A0', 'A1'],
    'B': ['B0', 'B1']
}, index=[0, 1])

df2 = pd.DataFrame({
    'B': ['B2', 'B3'],
    'C': ['C2', 'C3']
}, index=[1, 2])

# Concatenate with outer join
outer_result = pd.concat([df1, df2])
print("Concatenation with outer join:")
print(outer_result)

# Concatenate with inner join
inner_result = pd.concat([df1, df2], join='inner')
print("\nConcatenation with inner join:")
print(inner_result)

Output:

Concatenation with outer join:
     A    B    C
0   A0   B0  NaN
1   A1   B1  NaN
1  NaN   B2   C2
2  NaN   B3   C3

Concatenation with inner join:
     B
0   B0
1   B1
1   B2
2   B3

With join='outer', all columns from both DataFrames are included, with NaN values where data is missing. With join='inner', only columns that appear in all DataFrames are kept.

Using Keys to Create Hierarchical Indexes

The keys parameter lets you create a hierarchical index to track which DataFrame each row came from:

python
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
})

# Concatenate with keys
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print("Concatenation with keys:")
print(result)

Output:

Concatenation with keys:
        A   B
df1 0  A0  B0
    1  A1  B1
    2  A2  B2
df2 0  A3  B3
    1  A4  B4
    2  A5  B5

This creates a MultiIndex, which is useful for identifying the source of each row and enables powerful indexing capabilities:

python
# Access data from the first DataFrame
print("\nData from df1:")
print(result.loc['df1'])

Output:

Data from df1:
    A   B
0  A0  B0
1  A1  B1
2  A2  B2

Handling Missing Columns

When concatenating DataFrames with different columns, you may want to fill missing values:

python
df1 = pd.DataFrame({
    'A': ['A0', 'A1'],
    'B': ['B0', 'B1']
})

df2 = pd.DataFrame({
    'B': ['B2', 'B3'],
    'C': ['C2', 'C3']
})

# Concatenate and fill missing values
result = pd.concat([df1, df2], ignore_index=True).fillna('Missing')
print("Concatenated DataFrame with filled missing values:")
print(result)

Output:

Concatenated DataFrame with filled missing values:
      A   B       C
0    A0  B0  Missing
1    A1  B1  Missing
2  Missing  B2      C2
3  Missing  B3      C3

Real-World Example: Combining Multiple CSV Files

A common use case is combining multiple CSV files with the same structure into a single DataFrame:

python
# Let's create some example CSV files first
df1 = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'Sales': [100, 120, 90],
    'Region': ['North', 'North', 'North']
})

df2 = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'Sales': [80, 95, 110],
    'Region': ['South', 'South', 'South']
})

df3 = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'Sales': [150, 160, 140],
    'Region': ['East', 'East', 'East']
})

# Save as CSV files (just for demonstration)
df1.to_csv('north_sales.csv', index=False)
df2.to_csv('south_sales.csv', index=False)
df3.to_csv('east_sales.csv', index=False)

# Function to combine multiple CSV files
def combine_csv_files(file_list):
    # Create an empty list to store DataFrames
    dfs = []
    
    # Read each CSV file and append to the list
    for file in file_list:
        df = pd.read_csv(file)
        dfs.append(df)
    
    # Concatenate all DataFrames in the list
    combined_df = pd.concat(dfs, ignore_index=True)
    return combined_df

# Combine the CSV files
files = ['north_sales.csv', 'south_sales.csv', 'east_sales.csv']
combined_sales = combine_csv_files(files)

print("Combined sales data:")
print(combined_sales)

Output:

Combined sales data:
         Date  Sales Region
2023-01-01    100  North
2023-01-02    120  North
2023-01-03     90  North
2023-01-01     80  South
2023-01-02     95  South
2023-01-03    110  South
2023-01-01    150   East
2023-01-02    160   East
2023-01-03    140   East

Now we can perform analysis on the combined dataset:

python
# Total sales by region
region_sales = combined_sales.groupby('Region')['Sales'].sum().reset_index()
print("\nTotal sales by region:")
print(region_sales)

# Average sales by date
date_avg_sales = combined_sales.groupby('Date')['Sales'].mean().reset_index()
print("\nAverage sales by date:")
print(date_avg_sales)

Output:

Total sales by region:
  Region  Sales
0   East    450
1  North    310
2  South    285

Average sales by date:
         Date  Sales
0  2023-01-01  110.0
1  2023-01-02  125.0
2  2023-01-03  113.3

Concatenating Different Types of Objects

Pandas concat() can work with both Series and DataFrames:

python
# Create a Series and a DataFrame
s1 = pd.Series(['X0', 'X1', 'X2'], name='X')
df = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

# Concatenate Series and DataFrame
result = pd.concat([s1, df], axis=1)
print("Series and DataFrame concatenated:")
print(result)

Output:

Series and DataFrame concatenated:
    X   A   B
0  X0  A0  B0
1  X1  A1  B1
2  X2  A2  B2

Summary

The pandas.concat() function is a flexible tool for combining DataFrames:

Use axis=0 (default) to stack DataFrames vertically (row-wise)
Use axis=1 to join DataFrames horizontally (column-wise)
Control how indexes are handled with join='outer' (default) or join='inner'
Reset indexes with ignore_index=True to avoid duplicates
Track the source of rows with the keys parameter
Handle missing values with .fillna() after concatenation

This function is essential for data preprocessing tasks like combining multiple data sources, merging time series data, or restructuring datasets for analysis.

Practice Exercises

Create three DataFrames with different columns and concatenate them vertically. How would you handle the missing values?
Load the built-in datasets iris and tips from the seaborn library. Create subsets of these datasets and practice concatenating them in different ways.
Challenge: Download monthly sales data files from an open data portal and write a script to combine them into a single DataFrame for analysis.

Additional Resources

Remember that concat() is just one of several data combining methods in pandas. For key-based joining and merging operations, explore the merge() and join() functions as well.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Pandas Concat​

Basic Concatenation Along Rows​

Concatenating Along Columns​

Handling Different Indexes​

Using Keys to Create Hierarchical Indexes​

Handling Missing Columns​

Real-World Example: Combining Multiple CSV Files​

Concatenating Different Types of Objects​

Summary​

Practice Exercises​

Additional Resources​

Introduction to Pandas Concat

Basic Concatenation Along Rows

Concatenating Along Columns

Handling Different Indexes

Using Keys to Create Hierarchical Indexes

Handling Missing Columns

Real-World Example: Combining Multiple CSV Files

Concatenating Different Types of Objects

Summary

Practice Exercises

Additional Resources