Skip to main content

Pandas Excel Import

Introduction

Excel is one of the most widely used tools for data storage and analysis in business and research. As a data analyst or scientist, you'll frequently need to work with data stored in Excel files. Pandas, Python's premier data analysis library, provides robust functionality for reading, manipulating, and analyzing data from Excel files.

In this tutorial, you'll learn how to import data from Excel files into pandas DataFrames, handle different Excel formats, work with multiple sheets, and apply various configuration options to customize your data import process.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python 3.x
  • pandas
  • openpyxl (for xlsx files)
  • xlrd (for older xls files)

You can install these packages using pip:

python
pip install pandas openpyxl xlrd

Basic Excel Import

Let's start with the most basic way to import an Excel file into pandas:

python
import pandas as pd

# Read an Excel file
df = pd.read_excel('data.xlsx')

# Display the first 5 rows
print(df.head())

Output:

   ID       Name  Age  Salary Department
0 1 John D 28 65000 IT
1 2 Sarah M. 34 71000 HR
2 3 Michael T 41 85000 Finance
3 4 Lisa R. 26 62000 Marketing
4 5 James K. 32 68000 IT

In this simple example, pandas automatically:

  • Uses the first row as column headers
  • Imports the first sheet of the Excel file
  • Assigns appropriate data types to columns

Specifying Sheet Names or Numbers

Excel workbooks often contain multiple sheets. You can specify which sheet to import by name or index:

python
# Import by sheet name
df_sales = pd.read_excel('company_data.xlsx', sheet_name='Sales')

# Import by sheet index (0-based)
df_expenses = pd.read_excel('company_data.xlsx', sheet_name=1)

# Import multiple sheets
sheets = pd.read_excel('company_data.xlsx', sheet_name=['Sales', 'Expenses'])

# Import all sheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)

# Access a specific sheet when importing multiple
print(sheets['Sales'].head())

Output:

  Product     Date  Amount  Sales Rep
0 Desk 1/5/2022 12000 John D
1 Chair 1/7/2022 5000 Sarah M
2 Shelf 1/8/2022 3500 Michael T
3 Lamp 1/12/2022 800 Lisa R
4 Desk 1/15/2022 12000 James K

Customizing Header Rows

Sometimes your Excel file might have metadata or titles before the actual header row, or you may want to specify custom column names:

python
# Skip rows at the top of the file
df = pd.read_excel('report.xlsx', skiprows=3)

# Use a specific row as the header (0-indexed)
df = pd.read_excel('report.xlsx', header=2)

# Provide custom column names
df = pd.read_excel('data.xlsx', names=['employee_id', 'employee_name', 'age', 'compensation', 'team'])

print(df.head())

Output:

   employee_id employee_name  age  compensation    team
0 1 John D 28 65000 IT
1 2 Sarah M. 34 71000 HR
2 3 Michael T 41 85000 Finance
3 4 Lisa R. 26 62000 Marketing
4 5 James K. 32 68000 IT

Handling Selected Columns and Rows

For large Excel files, you might only need specific columns or rows:

python
# Read only specific columns
df = pd.read_excel('employee_data.xlsx', usecols=['Name', 'Department', 'Salary'])

# Read only specific columns by index
df = pd.read_excel('employee_data.xlsx', usecols=[0, 1, 4])

# Limit the number of rows to import
df = pd.read_excel('large_file.xlsx', nrows=100)

print(df.head())

Output:

        Name Department  Salary
0 John D IT 65000
1 Sarah M. HR 71000
2 Michael T Finance 85000
3 Lisa R. Marketing 62000
4 James K. IT 68000

Setting Data Types

You can specify data types for columns during import to ensure proper parsing:

python
# Define data types for specific columns
dtype_dict = {
'ID': int,
'Name': str,
'Age': int,
'Salary': float,
'Hire Date': str # We'll convert this to datetime later
}

df = pd.read_excel('employee_data.xlsx', dtype=dtype_dict)

# Convert string date to datetime after import
df['Hire Date'] = pd.to_datetime(df['Hire Date'])

print(df.dtypes)

Output:

ID                   int64
Name object
Age int64
Salary float64
Hire Date datetime64[ns]
dtype: object

Handling Missing Values

Excel files often contain empty cells or special values that represent missing data:

python
# Specify values to treat as NaN
df = pd.read_excel(
'data_with_gaps.xlsx',
na_values=['N/A', 'Not Available', '-'],
keep_default_na=True
)

# Count missing values in each column
print(df.isna().sum())

# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Department'].fillna('Not Assigned', inplace=True)

print(df.head())

Output:

ID             0
Name 0
Age 2
Salary 1
Department 3
dtype: int64

ID Name Age Salary Department
0 1 John D 28.0 65000 IT
1 2 Sarah M. 34.0 71000 HR
2 3 Michael T 41.0 85000 Finance
3 4 Lisa R. 26.0 NaN Not Assigned
4 5 James K. 32.0 68000 IT

Real-World Example: Sales Data Analysis

Let's put everything together in a real-world example. We'll import sales data from an Excel file, clean it, and perform some basic analysis:

python
import pandas as pd
import matplotlib.pyplot as plt

# Import the sales data
sales_df = pd.read_excel(
'quarterly_sales.xlsx',
sheet_name='2023',
skiprows=2, # Skip the title and description rows
usecols='B:F', # Only columns B through F contain data
na_values=['N/A', 'Not Reported'],
parse_dates=['Date'] # Parse the Date column as datetime
)

# Clean column names
sales_df.columns = [col.strip().lower().replace(' ', '_') for col in sales_df.columns]

# Filter for complete records only
clean_sales = sales_df.dropna(subset=['amount', 'product'])

# Group by product and calculate total sales
product_sales = clean_sales.groupby('product')['amount'].sum().sort_values(ascending=False)

# Display top selling products
print("Top Selling Products:")
print(product_sales.head())

# Calculate monthly sales
sales_df['month'] = sales_df['date'].dt.month_name()
monthly_sales = sales_df.groupby('month')['amount'].sum()

# Create a simple bar chart of monthly sales
plt.figure(figsize=(10, 6))
monthly_sales.plot(kind='bar')
plt.title('Monthly Sales - 2023')
plt.xlabel('Month')
plt.ylabel('Sales Amount ($)')
plt.tight_layout()
plt.savefig('monthly_sales.png')

Output:

Top Selling Products:
product
Laptop 542000
Smartphone 356000
Monitor 198000
Tablet 175000
Printer 124000
Name: amount, dtype: int64

Handling Excel File Formats and Engines

Pandas supports different Excel formats through various engines:

python
# For .xlsx files (Excel 2007+)
df_new = pd.read_excel('data.xlsx', engine='openpyxl')

# For .xls files (Excel 2003 and earlier)
df_old = pd.read_excel('legacy_data.xls', engine='xlrd')

# For writing to Excel formats
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, sheet_name='Processed Data')
df2.to_excel(writer, sheet_name='Summary', index=False)

Common Challenges and Solutions

Large Excel Files

For very large Excel files, you might encounter memory issues. Here are some solutions:

python
# Read the file in chunks
chunks = []
for chunk in pd.read_excel('huge_data.xlsx', chunksize=10000):
# Process each chunk
chunks.append(chunk)

# Concatenate all chunks
df = pd.concat(chunks)

Password Protected Files

For password-protected Excel files:

python
import pandas as pd
from openpyxl import load_workbook

# Load the workbook with password
wb = load_workbook('protected.xlsx', password='your_password')
sheet = wb.active

# Convert to pandas DataFrame
data = sheet.values
columns = next(data)
df = pd.DataFrame(data, columns=columns)

Excel Formulas

By default, pandas reads the results of Excel formulas, not the formulas themselves:

python
# Get calculated values (default)
df_values = pd.read_excel('calculations.xlsx')

# Get formulas instead of values
from openpyxl import load_workbook
wb = load_workbook('calculations.xlsx', data_only=False)
sheet = wb.active

Summary

In this tutorial, you've learned how to:

  • Import Excel files into pandas DataFrames
  • Work with specific sheets in multi-sheet workbooks
  • Customize header rows and column selection
  • Handle missing values and specify data types
  • Process real-world Excel data for analysis
  • Work with different Excel formats and engines
  • Address common challenges when importing Excel files

Excel files are a common data source in real-world data analysis projects, and pandas provides a flexible and powerful interface for working with this data. With the techniques covered in this tutorial, you should be well-equipped to handle a wide range of Excel import scenarios in your data analysis projects.

Additional Resources

Exercises

  1. Import an Excel file with multiple sheets and create a summary DataFrame that includes key statistics from each sheet.
  2. Create a function that cleans an Excel import by handling missing values, standardizing column names, and converting data types.
  3. Import an Excel file containing date information and create a time series visualization of the data.
  4. Combine data from multiple Excel files into a single DataFrame, handling potential mismatches in column names and data types.
  5. Create a script that reads an Excel file, processes the data, and writes the results back to a new Excel file with formatting and multiple worksheets.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)