Pandas Excel Import

Introduction

Excel is one of the most widely used tools for data storage and analysis in business and research. As a data analyst or scientist, you'll frequently need to work with data stored in Excel files. Pandas, Python's premier data analysis library, provides robust functionality for reading, manipulating, and analyzing data from Excel files.

In this tutorial, you'll learn how to import data from Excel files into pandas DataFrames, handle different Excel formats, work with multiple sheets, and apply various configuration options to customize your data import process.

Prerequisites

Before we begin, make sure you have the following installed:

Python 3.x
pandas
openpyxl (for xlsx files)
xlrd (for older xls files)

You can install these packages using pip:

pip install pandas openpyxl xlrd

Basic Excel Import

Let's start with the most basic way to import an Excel file into pandas:

import pandas as pd

# Read an Excel file
df = pd.read_excel('data.xlsx')

# Display the first 5 rows
print(df.head())

Output:

   ID       Name  Age  Salary Department
 1     John D   28   65000         IT
 2   Sarah M.   34   71000         HR
 3  Michael T   41   85000   Finance
 4    Lisa R.   26   62000  Marketing
 5   James K.   32   68000         IT

In this simple example, pandas automatically:

Uses the first row as column headers
Imports the first sheet of the Excel file
Assigns appropriate data types to columns

Specifying Sheet Names or Numbers

Excel workbooks often contain multiple sheets. You can specify which sheet to import by name or index:

# Import by sheet name
df_sales = pd.read_excel('company_data.xlsx', sheet_name='Sales')

# Import by sheet index (0-based)
df_expenses = pd.read_excel('company_data.xlsx', sheet_name=1)

# Import multiple sheets
sheets = pd.read_excel('company_data.xlsx', sheet_name=['Sales', 'Expenses'])

# Import all sheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)

# Access a specific sheet when importing multiple
print(sheets['Sales'].head())

Output:

  Product     Date  Amount  Sales Rep
  Desk  1/5/2022   12000    John D
 Chair  1/7/2022    5000   Sarah M
 Shelf  1/8/2022    3500  Michael T
  Lamp  1/12/2022    800    Lisa R
  Desk  1/15/2022   12000   James K

Customizing Header Rows

Sometimes your Excel file might have metadata or titles before the actual header row, or you may want to specify custom column names:

# Skip rows at the top of the file
df = pd.read_excel('report.xlsx', skiprows=3)

# Use a specific row as the header (0-indexed)
df = pd.read_excel('report.xlsx', header=2)

# Provide custom column names
df = pd.read_excel('data.xlsx', names=['employee_id', 'employee_name', 'age', 'compensation', 'team'])

print(df.head())

Output:

   employee_id employee_name  age  compensation    team
          1       John D   28         65000      IT
          2     Sarah M.   34         71000      HR
          3    Michael T   41         85000  Finance
          4      Lisa R.   26         62000  Marketing
          5     James K.   32         68000      IT

Handling Selected Columns and Rows

For large Excel files, you might only need specific columns or rows:

# Read only specific columns
df = pd.read_excel('employee_data.xlsx', usecols=['Name', 'Department', 'Salary'])

# Read only specific columns by index
df = pd.read_excel('employee_data.xlsx', usecols=[0, 1, 4])

# Limit the number of rows to import
df = pd.read_excel('large_file.xlsx', nrows=100)

print(df.head())

Output:

        Name Department  Salary
   John D         IT   65000
 Sarah M.         HR   71000
Michael T    Finance   85000
  Lisa R.  Marketing   62000
 James K.         IT   68000

Setting Data Types

You can specify data types for columns during import to ensure proper parsing:

# Define data types for specific columns
dtype_dict = {
    'ID': int,
    'Name': str,
    'Age': int,
    'Salary': float,
    'Hire Date': str  # We'll convert this to datetime later
}

df = pd.read_excel('employee_data.xlsx', dtype=dtype_dict)

# Convert string date to datetime after import
df['Hire Date'] = pd.to_datetime(df['Hire Date'])

print(df.dtypes)

Output:

ID                   int64
Name                object
Age                  int64
Salary             float64
Hire Date    datetime64[ns]
dtype: object

Handling Missing Values

Excel files often contain empty cells or special values that represent missing data:

# Specify values to treat as NaN
df = pd.read_excel(
    'data_with_gaps.xlsx',
    na_values=['N/A', 'Not Available', '-'],
    keep_default_na=True
)

# Count missing values in each column
print(df.isna().sum())

# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Department'].fillna('Not Assigned', inplace=True)

print(df.head())

Output:

ID             0
Name           0
Age            2
Salary         1
Department     3
dtype: int64

   ID       Name   Age  Salary    Department
0   1     John D  28.0   65000            IT
1   2   Sarah M.  34.0   71000            HR
2   3  Michael T  41.0   85000       Finance
3   4    Lisa R.  26.0     NaN  Not Assigned
4   5   James K.  32.0   68000            IT

Real-World Example: Sales Data Analysis

Let's put everything together in a real-world example. We'll import sales data from an Excel file, clean it, and perform some basic analysis:

import pandas as pd
import matplotlib.pyplot as plt

# Import the sales data
sales_df = pd.read_excel(
    'quarterly_sales.xlsx',
    sheet_name='2023',
    skiprows=2,  # Skip the title and description rows
    usecols='B:F',  # Only columns B through F contain data
    na_values=['N/A', 'Not Reported'],
    parse_dates=['Date']  # Parse the Date column as datetime
)

# Clean column names
sales_df.columns = [col.strip().lower().replace(' ', '_') for col in sales_df.columns]

# Filter for complete records only
clean_sales = sales_df.dropna(subset=['amount', 'product'])

# Group by product and calculate total sales
product_sales = clean_sales.groupby('product')['amount'].sum().sort_values(ascending=False)

# Display top selling products
print("Top Selling Products:")
print(product_sales.head())

# Calculate monthly sales
sales_df['month'] = sales_df['date'].dt.month_name()
monthly_sales = sales_df.groupby('month')['amount'].sum()

# Create a simple bar chart of monthly sales
plt.figure(figsize=(10, 6))
monthly_sales.plot(kind='bar')
plt.title('Monthly Sales - 2023')
plt.xlabel('Month')
plt.ylabel('Sales Amount ($)')
plt.tight_layout()
plt.savefig('monthly_sales.png')

Output:

Top Selling Products:
product
Laptop        542000
Smartphone    356000
Monitor       198000
Tablet        175000
Printer       124000
Name: amount, dtype: int64

Handling Excel File Formats and Engines

Pandas supports different Excel formats through various engines:

# For .xlsx files (Excel 2007+)
df_new = pd.read_excel('data.xlsx', engine='openpyxl')

# For .xls files (Excel 2003 and earlier)
df_old = pd.read_excel('legacy_data.xls', engine='xlrd')

# For writing to Excel formats
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
    df.to_excel(writer, sheet_name='Processed Data')
    df2.to_excel(writer, sheet_name='Summary', index=False)

Common Challenges and Solutions

Large Excel Files

For very large Excel files, you might encounter memory issues. Here are some solutions:

# Read the file in chunks
chunks = []
for chunk in pd.read_excel('huge_data.xlsx', chunksize=10000):
    # Process each chunk
    chunks.append(chunk)

# Concatenate all chunks
df = pd.concat(chunks)

Password Protected Files

For password-protected Excel files:

import pandas as pd
from openpyxl import load_workbook

# Load the workbook with password
wb = load_workbook('protected.xlsx', password='your_password')
sheet = wb.active

# Convert to pandas DataFrame
data = sheet.values
columns = next(data)
df = pd.DataFrame(data, columns=columns)

Excel Formulas

By default, pandas reads the results of Excel formulas, not the formulas themselves:

# Get calculated values (default)
df_values = pd.read_excel('calculations.xlsx')

# Get formulas instead of values
from openpyxl import load_workbook
wb = load_workbook('calculations.xlsx', data_only=False)
sheet = wb.active

Summary

In this tutorial, you've learned how to:

Import Excel files into pandas DataFrames
Work with specific sheets in multi-sheet workbooks
Customize header rows and column selection
Handle missing values and specify data types
Process real-world Excel data for analysis
Work with different Excel formats and engines
Address common challenges when importing Excel files

Excel files are a common data source in real-world data analysis projects, and pandas provides a flexible and powerful interface for working with this data. With the techniques covered in this tutorial, you should be well-equipped to handle a wide range of Excel import scenarios in your data analysis projects.

Additional Resources

Exercises

Import an Excel file with multiple sheets and create a summary DataFrame that includes key statistics from each sheet.
Create a function that cleans an Excel import by handling missing values, standardizing column names, and converting data types.
Import an Excel file containing date information and create a time series visualization of the data.
Combine data from multiple Excel files into a single DataFrame, handling potential mismatches in column names and data types.
Create a script that reads an Excel file, processes the data, and writes the results back to a new Excel file with formatting and multiple worksheets.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic Excel Import​

Specifying Sheet Names or Numbers​

Customizing Header Rows​

Handling Selected Columns and Rows​

Setting Data Types​

Handling Missing Values​

Real-World Example: Sales Data Analysis​

Handling Excel File Formats and Engines​

Common Challenges and Solutions​

Large Excel Files​

Password Protected Files​

Excel Formulas​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Basic Excel Import

Specifying Sheet Names or Numbers

Customizing Header Rows

Handling Selected Columns and Rows

Setting Data Types

Handling Missing Values

Real-World Example: Sales Data Analysis

Handling Excel File Formats and Engines

Common Challenges and Solutions

Large Excel Files

Password Protected Files

Excel Formulas

Summary

Additional Resources

Exercises