Pandas Excel Advanced

Introduction

In the previous section, we learned the basics of reading and writing Excel files with pandas. Now, we'll dive into more advanced Excel operations that can streamline your data workflow. Pandas offers powerful tools for working with Excel files beyond simple import/export, allowing you to handle complex formatting, formulas, multiple worksheets, and large datasets efficiently.

In this tutorial, we'll explore advanced techniques for manipulating Excel files using pandas, focusing on practical applications that data analysts and scientists commonly encounter in real-world scenarios.

Prerequisites

Before diving into the advanced content, ensure you have the following libraries installed:

pip install pandas openpyxl xlsxwriter

You should also have a basic understanding of pandas DataFrames and how to read/write Excel files. Let's get started!

Working with Multiple Sheets

Reading Multiple Sheets

Excel workbooks often contain multiple sheets. Pandas makes it easy to read them all or select specific ones:

import pandas as pd

# Read all sheets into a dictionary of DataFrames
excel_file = "sales_data.xlsx"
all_sheets = pd.read_excel(excel_file, sheet_name=None)

# Access individual sheets from the dictionary
north_sales = all_sheets['North']
south_sales = all_sheets['South']

print(f"Sheets in the workbook: {list(all_sheets.keys())}")
print("\nPreview of North region data:")
print(north_sales.head())

Output:

Sheets in the workbook: ['North', 'South', 'East', 'West']

Preview of North region data:
   Product ID  Product Name  Quantity  Price  Total Sales
0         101      Keyboard         5  29.99       149.95
1         102         Mouse         8  19.99       159.92
2         103      Monitor         3  99.99       299.97
3         104      Headset         6  49.99       299.94
4         105      Webcam          4  39.99       159.96

Writing to Multiple Sheets

To create an Excel file with multiple sheets:

import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'X': ['a', 'b', 'c'], 'Y': ['d', 'e', 'f']})

# Create a Pandas Excel writer using XlsxWriter as the engine
with pd.ExcelWriter('multiple_sheets.xlsx', engine='xlsxwriter') as writer:
    df1.to_excel(writer, sheet_name='First_Sheet', index=False)
    df2.to_excel(writer, sheet_name='Second_Sheet', index=False)
    
print("Excel file with multiple sheets created successfully!")

Advanced Excel Formatting

While pandas' basic to_excel() function is convenient, you might want more control over the appearance of your Excel files. We can leverage the underlying Excel engines (like XlsxWriter) for advanced formatting:

Styling Excel Output

import pandas as pd
import numpy as np

# Create sample sales data
np.random.seed(42)
data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'Units Sold': np.random.randint(50, 500, 5),
    'Unit Price': [1200, 25, 45, 350, 85],
    'Revenue': None  # We'll calculate this
}

df = pd.DataFrame(data)
df['Revenue'] = df['Units Sold'] * df['Unit Price']

# Create a Pandas Excel writer using XlsxWriter
with pd.ExcelWriter('styled_sales_report.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Sales', index=False)
    
    # Get the workbook and worksheet objects
    workbook = writer.book
    worksheet = writer.sheets['Sales']
    
    # Add formats
    header_format = workbook.add_format({
        'bold': True,
        'text_wrap': True,
        'valign': 'top',
        'fg_color': '#D7E4BC',
        'border': 1
    })
    
    # Add a currency format
    currency_format = workbook.add_format({'num_format': '$#,##0'})
    
    # Apply the header format to the first row
    for col_num, value in enumerate(df.columns.values):
        worksheet.write(0, col_num, value, header_format)
    
    # Apply currency format to Revenue and Unit Price columns
    worksheet.set_column('C:C', 10, currency_format)  # Unit Price
    worksheet.set_column('D:D', 12, currency_format)  # Revenue
    
    # Add conditional formatting to highlight high revenue products
    worksheet.conditional_format('D2:D6', {'type': 'cell',
                                          'criteria': '>',
                                          'value': 10000,
                                          'format': workbook.add_format({'bg_color': '#FFC7CE'})})

print("Styled Excel file created successfully!")

This creates an Excel file with formatted headers, currency formatting, and conditional formatting to highlight high-revenue products.

Adding Charts

XlsxWriter allows us to add charts to our Excel files:

import pandas as pd
import numpy as np

# Create sample monthly sales data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
product_a = np.random.randint(500, 1000, 6)
product_b = np.random.randint(400, 800, 6)

df = pd.DataFrame({
    'Month': months,
    'Product A': product_a,
    'Product B': product_b
})

# Create Excel file with chart
with pd.ExcelWriter('sales_chart.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Sales', index=False)
    
    workbook = writer.book
    worksheet = writer.sheets['Sales']
    
    # Create a chart object
    chart = workbook.add_chart({'type': 'line'})
    
    # Configure the series for Product A
    chart.add_series({
        'name':       '=Sales!$B$1',
        'categories': '=Sales!$A$2:$A$7',
        'values':     '=Sales!$B$2:$B$7',
        'line':       {'color': 'blue'}
    })
    
    # Configure the series for Product B
    chart.add_series({
        'name':       '=Sales!$C$1',
        'categories': '=Sales!$A$2:$A$7',
        'values':     '=Sales!$C$2:$C$7',
        'line':       {'color': 'red'}
    })
    
    # Add chart title and axis labels
    chart.set_title({'name': 'Monthly Sales Comparison'})
    chart.set_x_axis({'name': 'Month'})
    chart.set_y_axis({'name': 'Units Sold'})
    
    # Insert the chart into the worksheet
    worksheet.insert_chart('E2', chart)

print("Excel file with chart created successfully!")

Working with Excel Formulas

You can also include Excel formulas in your output:

import pandas as pd

# Create sample data
data = {
    'Item': ['Item 1', 'Item 2', 'Item 3'],
    'Quantity': [5, 8, 3],
    'Price': [10, 25, 50]
}

df = pd.DataFrame(data)

# Write to Excel with formulas
with pd.ExcelWriter('excel_with_formulas.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Invoice', index=False)
    
    workbook = writer.book
    worksheet = writer.sheets['Invoice']
    
    # Add a total formula in column D
    worksheet.write('D1', 'Total')
    for i in range(len(df)):
        # Excel row is 1-indexed and has a header row
        row = i + 2
        worksheet.write_formula(f'D{row}', f'=B{row}*C{row}')
    
    # Add a sum formula at the bottom
    sum_row = len(df) + 2
    worksheet.write(sum_row, 0, 'Total')
    worksheet.write_formula(f'D{sum_row}', f'=SUM(D2:D{sum_row-1})')

print("Excel file with formulas created successfully!")

Handling Large Excel Files

When working with large Excel files, memory usage can become a concern. Here's how to handle them efficiently:

Reading Large Files in Chunks

import pandas as pd

# Define the chunk size
chunk_size = 1000

# Create an empty list to store the chunks
chunks = []

# Read the Excel file in chunks
for chunk in pd.read_excel('large_dataset.xlsx', chunksize=chunk_size):
    # Process each chunk (for example, filter rows)
    processed_chunk = chunk[chunk['Value'] > 100]
    chunks.append(processed_chunk)

# Combine all chunks into a single DataFrame
combined_data = pd.concat(chunks)
print(f"Processed {len(combined_data)} rows efficiently!")

Reading Specific Columns

To save memory, you can also specify which columns to read:

import pandas as pd

# Read only specific columns
selected_columns = ['Date', 'Customer', 'Total']
df = pd.read_excel('large_sales_data.xlsx', usecols=selected_columns)

print(f"Selected columns: {df.columns.tolist()}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1048576:.2f} MB")

Using the optimum_dtype Parameter

Recent versions of pandas allow automatic detection of the most memory-efficient data types:

import pandas as pd

# Read with optimized data types
df = pd.read_excel('large_dataset.xlsx', dtype_backend='numpy_nullable')

print("Memory-optimized DataFrame created")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1048576:.2f} MB")

Practical Real-World Example: Sales Dashboard

Let's combine several advanced Excel techniques to create a comprehensive sales dashboard:

import pandas as pd
import numpy as np
import datetime as dt

# Generate sample sales data
np.random.seed(42)
n_records = 1000

# Create date range
start_date = dt.datetime(2023, 1, 1)
end_date = dt.datetime(2023, 12, 31)
days = (end_date - start_date).days
dates = [start_date + dt.timedelta(days=np.random.randint(0, days)) for _ in range(n_records)]

# Create other columns
products = ['Laptop', 'Desktop', 'Monitor', 'Keyboard', 'Mouse', 'Headphones', 'Speakers']
regions = ['North', 'South', 'East', 'West']

data = {
    'Date': dates,
    'Product': np.random.choice(products, n_records),
    'Region': np.random.choice(regions, n_records),
    'Units': np.random.randint(1, 10, n_records),
    'Price': np.random.uniform(50, 1500, n_records)
}

# Create DataFrame
sales_df = pd.DataFrame(data)
sales_df['Revenue'] = sales_df['Units'] * sales_df['Price']

# Create monthly summary
sales_df['Month'] = sales_df['Date'].dt.strftime('%Y-%m')
monthly_sales = sales_df.groupby(['Month', 'Product'])['Revenue'].sum().reset_index()

# Create regional summary
regional_sales = sales_df.groupby('Region')['Revenue'].sum().reset_index()

# Create product summary
product_sales = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False).reset_index()

# Now create the dashboard Excel file
with pd.ExcelWriter('sales_dashboard.xlsx', engine='xlsxwriter') as writer:
    # Write each DataFrame to a different sheet
    sales_df.to_excel(writer, sheet_name='Raw_Data', index=False)
    monthly_sales.to_excel(writer, sheet_name='Monthly_Summary', index=False)
    regional_sales.to_excel(writer, sheet_name='Regional_Summary', index=False)
    product_sales.to_excel(writer, sheet_name='Product_Summary', index=False)
    
    workbook = writer.book
    
    # Get worksheet objects
    raw_sheet = writer.sheets['Raw_Data']
    monthly_sheet = writer.sheets['Monthly_Summary']
    regional_sheet = writer.sheets['Regional_Summary']
    product_sheet = writer.sheets['Product_Summary']
    
    # Add formats
    header_format = workbook.add_format({
        'bold': True,
        'bg_color': '#C6EFCE',
        'border': 1
    })
    
    currency_format = workbook.add_format({'num_format': '$#,##0.00'})
    
    # Format Raw Data sheet
    for col_num, value in enumerate(sales_df.columns.values):
        raw_sheet.write(0, col_num, value, header_format)
    raw_sheet.set_column('F:F', 12, currency_format)  # Revenue column
    
    # Add regional sales chart
    chart = workbook.add_chart({'type': 'pie'})
    num_regions = len(regional_sales)
    
    chart.add_series({
        'name': 'Regional Sales',
        'categories': f'=Regional_Summary!$A$2:$A${num_regions+1}',
        'values': f'=Regional_Summary!$B$2:$B${num_regions+1}',
    })
    
    chart.set_title({'name': 'Sales by Region'})
    regional_sheet.insert_chart('D2', chart)
    
    # Add product sales chart
    bar_chart = workbook.add_chart({'type': 'column'})
    num_products = len(product_sales)
    
    bar_chart.add_series({
        'name': 'Product Revenue',
        'categories': f'=Product_Summary!$A$2:$A${num_products+1}',
        'values': f'=Product_Summary!$B$2:$B${num_products+1}',
        'fill': {'color': '#5B9BD5'}
    })
    
    bar_chart.set_title({'name': 'Revenue by Product'})
    bar_chart.set_y_axis({'name': 'Revenue ($)'})
    product_sheet.insert_chart('D2', bar_chart)

print("Sales dashboard Excel file created successfully!")

This example demonstrates how to create a comprehensive Excel-based dashboard with multiple sheets, charts, and formatted data.

Excel Pivot Tables with Pandas

Pandas can also help us create Excel pivot tables:

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Laptop', 'Tablet', 'Phone', 'Monitor']
regions = ['North', 'South', 'East', 'West']

data = []
for _ in range(1000):
    date = np.random.choice(dates)
    product = np.random.choice(products)
    region = np.random.choice(regions)
    units = np.random.randint(1, 10)
    price = np.random.uniform(100, 1000)
    revenue = units * price
    data.append([date, product, region, units, price, revenue])

df = pd.DataFrame(data, columns=['Date', 'Product', 'Region', 'Units', 'Price', 'Revenue'])

# Create a pivot table
pivot = pd.pivot_table(
    df, 
    values='Revenue',
    index=['Region', 'Product'],
    columns=pd.Grouper(key='Date', freq='M'),
    aggfunc='sum',
    fill_value=0
)

# Format the date columns
pivot.columns = pivot.columns.strftime('%b-%Y')

# Write to Excel
with pd.ExcelWriter('sales_pivot.xlsx', engine='xlsxwriter') as writer:
    pivot.to_excel(writer, sheet_name='Pivot')
    
    # Get workbook and worksheet
    workbook = writer.book
    worksheet = writer.sheets['Pivot']
    
    # Add formats
    header_format = workbook.add_format({
        'bold': True,
        'text_wrap': True,
        'valign': 'top',
        'bg_color': '#D7E4BC',
        'border': 1
    })
    
    currency_format = workbook.add_format({'num_format': '$#,##0'})
    
    # Apply header format
    for col_num, value in enumerate(pivot.columns.values):
        worksheet.write(0, col_num + 1, value, header_format)
    
    # Apply currency format to all data cells
    for col in range(1, len(pivot.columns) + 1):
        worksheet.set_column(col, col, 12, currency_format)

print("Excel pivot table created successfully!")

Summary

In this tutorial, we've explored advanced techniques for working with Excel files using pandas:

Working with multiple sheets in a workbook
Applying advanced formatting to Excel output
Adding charts and visualizations
Including Excel formulas in your output
Efficiently handling large Excel files
Creating complete dashboards with multiple elements
Generating pivot tables directly from pandas

These techniques allow you to create professional, informative Excel documents directly from your pandas workflow, making data distribution and reporting much more efficient.

Additional Resources

Here are some resources to further explore the topic:

Exercises

Create an Excel file with sales data for multiple products across different regions, including charts to visualize the top-performing products.
Modify the dashboard example to include conditional formatting that highlights the best and worst performing products.
Create an Excel file that includes dynamic formulas for calculating month-over-month growth percentages.
Read a large Excel file in chunks, process each chunk to filter for specific criteria, and write the results to a new Excel file with proper formatting.
Create a function that takes a DataFrame and automatically generates an Excel report with a summary sheet, detailed data sheet, and appropriate charts based on the data types in the DataFrame.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Working with Multiple Sheets​

Reading Multiple Sheets​

Writing to Multiple Sheets​

Advanced Excel Formatting​

Styling Excel Output​

Adding Charts​

Working with Excel Formulas​

Handling Large Excel Files​

Reading Large Files in Chunks​

Reading Specific Columns​

Using the optimum_dtype Parameter​

Practical Real-World Example: Sales Dashboard​

Excel Pivot Tables with Pandas​

Summary​

Additional Resources​

Exercises​