Skip to main content

Pandas Concat

In data analysis, you'll often need to combine multiple datasets into a single DataFrame. Pandas provides several methods for this, and one of the most versatile is the concat() function. This powerful tool lets you join DataFrames along either rows or columns, providing flexibility for various data manipulation needs.

Introduction to Pandas Concat

The pandas.concat() function allows you to glue together DataFrames along a particular axis. It's similar to the SQL UNION operation or stacking pieces of paper on top of each other (for row-wise concatenation) or placing them side by side (for column-wise concatenation).

The basic syntax is:

python
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None)

Where:

  • objs: A sequence or mapping of Series or DataFrame objects
  • axis: The axis to concatenate along (0 for rows, 1 for columns)
  • join: How to handle indexes on other axis ('inner' or 'outer')
  • ignore_index: Whether to discard the index and reset it
  • keys: Create a hierarchical index (MultiIndex) using the passed keys

Basic Concatenation Along Rows

Let's start with the most common use case: stacking DataFrames on top of each other.

python
import pandas as pd
import numpy as np

# Create sample DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})

# Concatenate along rows (axis=0)
result = pd.concat([df1, df2])
print("DataFrames concatenated along rows:")
print(result)

Output:

DataFrames concatenated along rows:
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5

Notice that the index values are duplicated. If you want to reset the index, use ignore_index=True:

python
# Concatenate with reset index
result = pd.concat([df1, df2], ignore_index=True)
print("DataFrames concatenated with reset index:")
print(result)

Output:

DataFrames concatenated with reset index:
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5

Concatenating Along Columns

To concatenate DataFrames horizontally (side by side), use axis=1:

python
# Create sample DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']
})

# Concatenate along columns (axis=1)
result = pd.concat([df1, df2], axis=1)
print("DataFrames concatenated along columns:")
print(result)

Output:

DataFrames concatenated along columns:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2

Handling Different Indexes

When DataFrames have different indexes, the default behavior is to include all indexes (join='outer'):

python
# Create DataFrames with different indexes
df1 = pd.DataFrame({
'A': ['A0', 'A1'],
'B': ['B0', 'B1']
}, index=[0, 1])

df2 = pd.DataFrame({
'A': ['A2', 'A3'],
'B': ['B2', 'B3']
}, index=[1, 2])

# Concatenate with outer join (default)
result = pd.concat([df1, df2])
print("Concatenation with outer join (default):")
print(result)

Output:

Concatenation with outer join (default):
A B
0 A0 B0
1 A1 B1
1 A2 B2
2 A3 B3

If you only want to keep indexes that appear in all DataFrames, use join='inner':

python
# Concatenate with inner join
result = pd.concat([df1, df2], join='inner')
print("Concatenation with inner join:")
print(result)

Output:

Concatenation with inner join:
A B
0 A0 B0
1 A1 B1
1 A2 B2
2 A3 B3

Note that in this case, the output looks the same because both DataFrames have all the same columns. Let's try with different columns:

python
# Create DataFrames with different columns
df1 = pd.DataFrame({
'A': ['A0', 'A1'],
'B': ['B0', 'B1']
}, index=[0, 1])

df2 = pd.DataFrame({
'B': ['B2', 'B3'],
'C': ['C2', 'C3']
}, index=[1, 2])

# Concatenate with outer join
outer_result = pd.concat([df1, df2])
print("Concatenation with outer join:")
print(outer_result)

# Concatenate with inner join
inner_result = pd.concat([df1, df2], join='inner')
print("\nConcatenation with inner join:")
print(inner_result)

Output:

Concatenation with outer join:
A B C
0 A0 B0 NaN
1 A1 B1 NaN
1 NaN B2 C2
2 NaN B3 C3

Concatenation with inner join:
B
0 B0
1 B1
1 B2
2 B3

With join='outer', all columns from both DataFrames are included, with NaN values where data is missing. With join='inner', only columns that appear in all DataFrames are kept.

Using Keys to Create Hierarchical Indexes

The keys parameter lets you create a hierarchical index to track which DataFrame each row came from:

python
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})

# Concatenate with keys
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print("Concatenation with keys:")
print(result)

Output:

Concatenation with keys:
A B
df1 0 A0 B0
1 A1 B1
2 A2 B2
df2 0 A3 B3
1 A4 B4
2 A5 B5

This creates a MultiIndex, which is useful for identifying the source of each row and enables powerful indexing capabilities:

python
# Access data from the first DataFrame
print("\nData from df1:")
print(result.loc['df1'])

Output:

Data from df1:
A B
0 A0 B0
1 A1 B1
2 A2 B2

Handling Missing Columns

When concatenating DataFrames with different columns, you may want to fill missing values:

python
df1 = pd.DataFrame({
'A': ['A0', 'A1'],
'B': ['B0', 'B1']
})

df2 = pd.DataFrame({
'B': ['B2', 'B3'],
'C': ['C2', 'C3']
})

# Concatenate and fill missing values
result = pd.concat([df1, df2], ignore_index=True).fillna('Missing')
print("Concatenated DataFrame with filled missing values:")
print(result)

Output:

Concatenated DataFrame with filled missing values:
A B C
0 A0 B0 Missing
1 A1 B1 Missing
2 Missing B2 C2
3 Missing B3 C3

Real-World Example: Combining Multiple CSV Files

A common use case is combining multiple CSV files with the same structure into a single DataFrame:

python
# Let's create some example CSV files first
df1 = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'Sales': [100, 120, 90],
'Region': ['North', 'North', 'North']
})

df2 = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'Sales': [80, 95, 110],
'Region': ['South', 'South', 'South']
})

df3 = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'Sales': [150, 160, 140],
'Region': ['East', 'East', 'East']
})

# Save as CSV files (just for demonstration)
df1.to_csv('north_sales.csv', index=False)
df2.to_csv('south_sales.csv', index=False)
df3.to_csv('east_sales.csv', index=False)

# Function to combine multiple CSV files
def combine_csv_files(file_list):
# Create an empty list to store DataFrames
dfs = []

# Read each CSV file and append to the list
for file in file_list:
df = pd.read_csv(file)
dfs.append(df)

# Concatenate all DataFrames in the list
combined_df = pd.concat(dfs, ignore_index=True)
return combined_df

# Combine the CSV files
files = ['north_sales.csv', 'south_sales.csv', 'east_sales.csv']
combined_sales = combine_csv_files(files)

print("Combined sales data:")
print(combined_sales)

Output:

Combined sales data:
Date Sales Region
0 2023-01-01 100 North
1 2023-01-02 120 North
2 2023-01-03 90 North
3 2023-01-01 80 South
4 2023-01-02 95 South
5 2023-01-03 110 South
6 2023-01-01 150 East
7 2023-01-02 160 East
8 2023-01-03 140 East

Now we can perform analysis on the combined dataset:

python
# Total sales by region
region_sales = combined_sales.groupby('Region')['Sales'].sum().reset_index()
print("\nTotal sales by region:")
print(region_sales)

# Average sales by date
date_avg_sales = combined_sales.groupby('Date')['Sales'].mean().reset_index()
print("\nAverage sales by date:")
print(date_avg_sales)

Output:

Total sales by region:
Region Sales
0 East 450
1 North 310
2 South 285

Average sales by date:
Date Sales
0 2023-01-01 110.0
1 2023-01-02 125.0
2 2023-01-03 113.3

Concatenating Different Types of Objects

Pandas concat() can work with both Series and DataFrames:

python
# Create a Series and a DataFrame
s1 = pd.Series(['X0', 'X1', 'X2'], name='X')
df = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

# Concatenate Series and DataFrame
result = pd.concat([s1, df], axis=1)
print("Series and DataFrame concatenated:")
print(result)

Output:

Series and DataFrame concatenated:
X A B
0 X0 A0 B0
1 X1 A1 B1
2 X2 A2 B2

Summary

The pandas.concat() function is a flexible tool for combining DataFrames:

  • Use axis=0 (default) to stack DataFrames vertically (row-wise)
  • Use axis=1 to join DataFrames horizontally (column-wise)
  • Control how indexes are handled with join='outer' (default) or join='inner'
  • Reset indexes with ignore_index=True to avoid duplicates
  • Track the source of rows with the keys parameter
  • Handle missing values with .fillna() after concatenation

This function is essential for data preprocessing tasks like combining multiple data sources, merging time series data, or restructuring datasets for analysis.

Practice Exercises

  1. Create three DataFrames with different columns and concatenate them vertically. How would you handle the missing values?

  2. Load the built-in datasets iris and tips from the seaborn library. Create subsets of these datasets and practice concatenating them in different ways.

  3. Challenge: Download monthly sales data files from an open data portal and write a script to combine them into a single DataFrame for analysis.

Additional Resources

Remember that concat() is just one of several data combining methods in pandas. For key-based joining and merging operations, explore the merge() and join() functions as well.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)