Pandas Excel Advanced
Introduction
In the previous section, we learned the basics of reading and writing Excel files with pandas. Now, we'll dive into more advanced Excel operations that can streamline your data workflow. Pandas offers powerful tools for working with Excel files beyond simple import/export, allowing you to handle complex formatting, formulas, multiple worksheets, and large datasets efficiently.
In this tutorial, we'll explore advanced techniques for manipulating Excel files using pandas, focusing on practical applications that data analysts and scientists commonly encounter in real-world scenarios.
Prerequisites
Before diving into the advanced content, ensure you have the following libraries installed:
pip install pandas openpyxl xlsxwriter
You should also have a basic understanding of pandas DataFrames and how to read/write Excel files. Let's get started!
Working with Multiple Sheets
Reading Multiple Sheets
Excel workbooks often contain multiple sheets. Pandas makes it easy to read them all or select specific ones:
import pandas as pd
# Read all sheets into a dictionary of DataFrames
excel_file = "sales_data.xlsx"
all_sheets = pd.read_excel(excel_file, sheet_name=None)
# Access individual sheets from the dictionary
north_sales = all_sheets['North']
south_sales = all_sheets['South']
print(f"Sheets in the workbook: {list(all_sheets.keys())}")
print("\nPreview of North region data:")
print(north_sales.head())
Output:
Sheets in the workbook: ['North', 'South', 'East', 'West']
Preview of North region data:
Product ID Product Name Quantity Price Total Sales
0 101 Keyboard 5 29.99 149.95
1 102 Mouse 8 19.99 159.92
2 103 Monitor 3 99.99 299.97
3 104 Headset 6 49.99 299.94
4 105 Webcam 4 39.99 159.96
Writing to Multiple Sheets
To create an Excel file with multiple sheets:
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'X': ['a', 'b', 'c'], 'Y': ['d', 'e', 'f']})
# Create a Pandas Excel writer using XlsxWriter as the engine
with pd.ExcelWriter('multiple_sheets.xlsx', engine='xlsxwriter') as writer:
df1.to_excel(writer, sheet_name='First_Sheet', index=False)
df2.to_excel(writer, sheet_name='Second_Sheet', index=False)
print("Excel file with multiple sheets created successfully!")
Advanced Excel Formatting
While pandas' basic to_excel()
function is convenient, you might want more control over the appearance of your Excel files. We can leverage the underlying Excel engines (like XlsxWriter) for advanced formatting:
Styling Excel Output
import pandas as pd
import numpy as np
# Create sample sales data
np.random.seed(42)
data = {
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
'Units Sold': np.random.randint(50, 500, 5),
'Unit Price': [1200, 25, 45, 350, 85],
'Revenue': None # We'll calculate this
}
df = pd.DataFrame(data)
df['Revenue'] = df['Units Sold'] * df['Unit Price']
# Create a Pandas Excel writer using XlsxWriter
with pd.ExcelWriter('styled_sales_report.xlsx', engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name='Sales', index=False)
# Get the workbook and worksheet objects
workbook = writer.book
worksheet = writer.sheets['Sales']
# Add formats
header_format = workbook.add_format({
'bold': True,
'text_wrap': True,
'valign': 'top',
'fg_color': '#D7E4BC',
'border': 1
})
# Add a currency format
currency_format = workbook.add_format({'num_format': '$#,##0'})
# Apply the header format to the first row
for col_num, value in enumerate(df.columns.values):
worksheet.write(0, col_num, value, header_format)
# Apply currency format to Revenue and Unit Price columns
worksheet.set_column('C:C', 10, currency_format) # Unit Price
worksheet.set_column('D:D', 12, currency_format) # Revenue
# Add conditional formatting to highlight high revenue products
worksheet.conditional_format('D2:D6', {'type': 'cell',
'criteria': '>',
'value': 10000,
'format': workbook.add_format({'bg_color': '#FFC7CE'})})
print("Styled Excel file created successfully!")
This creates an Excel file with formatted headers, currency formatting, and conditional formatting to highlight high-revenue products.
Adding Charts
XlsxWriter allows us to add charts to our Excel files:
import pandas as pd
import numpy as np
# Create sample monthly sales data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
product_a = np.random.randint(500, 1000, 6)
product_b = np.random.randint(400, 800, 6)
df = pd.DataFrame({
'Month': months,
'Product A': product_a,
'Product B': product_b
})
# Create Excel file with chart
with pd.ExcelWriter('sales_chart.xlsx', engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name='Sales', index=False)
workbook = writer.book
worksheet = writer.sheets['Sales']
# Create a chart object
chart = workbook.add_chart({'type': 'line'})
# Configure the series for Product A
chart.add_series({
'name': '=Sales!$B$1',
'categories': '=Sales!$A$2:$A$7',
'values': '=Sales!$B$2:$B$7',
'line': {'color': 'blue'}
})
# Configure the series for Product B
chart.add_series({
'name': '=Sales!$C$1',
'categories': '=Sales!$A$2:$A$7',
'values': '=Sales!$C$2:$C$7',
'line': {'color': 'red'}
})
# Add chart title and axis labels
chart.set_title({'name': 'Monthly Sales Comparison'})
chart.set_x_axis({'name': 'Month'})
chart.set_y_axis({'name': 'Units Sold'})
# Insert the chart into the worksheet
worksheet.insert_chart('E2', chart)
print("Excel file with chart created successfully!")
Working with Excel Formulas
You can also include Excel formulas in your output:
import pandas as pd
# Create sample data
data = {
'Item': ['Item 1', 'Item 2', 'Item 3'],
'Quantity': [5, 8, 3],
'Price': [10, 25, 50]
}
df = pd.DataFrame(data)
# Write to Excel with formulas
with pd.ExcelWriter('excel_with_formulas.xlsx', engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name='Invoice', index=False)
workbook = writer.book
worksheet = writer.sheets['Invoice']
# Add a total formula in column D
worksheet.write('D1', 'Total')
for i in range(len(df)):
# Excel row is 1-indexed and has a header row
row = i + 2
worksheet.write_formula(f'D{row}', f'=B{row}*C{row}')
# Add a sum formula at the bottom
sum_row = len(df) + 2
worksheet.write(sum_row, 0, 'Total')
worksheet.write_formula(f'D{sum_row}', f'=SUM(D2:D{sum_row-1})')
print("Excel file with formulas created successfully!")
Handling Large Excel Files
When working with large Excel files, memory usage can become a concern. Here's how to handle them efficiently:
Reading Large Files in Chunks
import pandas as pd
# Define the chunk size
chunk_size = 1000
# Create an empty list to store the chunks
chunks = []
# Read the Excel file in chunks
for chunk in pd.read_excel('large_dataset.xlsx', chunksize=chunk_size):
# Process each chunk (for example, filter rows)
processed_chunk = chunk[chunk['Value'] > 100]
chunks.append(processed_chunk)
# Combine all chunks into a single DataFrame
combined_data = pd.concat(chunks)
print(f"Processed {len(combined_data)} rows efficiently!")
Reading Specific Columns
To save memory, you can also specify which columns to read:
import pandas as pd
# Read only specific columns
selected_columns = ['Date', 'Customer', 'Total']
df = pd.read_excel('large_sales_data.xlsx', usecols=selected_columns)
print(f"Selected columns: {df.columns.tolist()}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1048576:.2f} MB")
Using the optimum_dtype Parameter
Recent versions of pandas allow automatic detection of the most memory-efficient data types:
import pandas as pd
# Read with optimized data types
df = pd.read_excel('large_dataset.xlsx', dtype_backend='numpy_nullable')
print("Memory-optimized DataFrame created")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1048576:.2f} MB")
Practical Real-World Example: Sales Dashboard
Let's combine several advanced Excel techniques to create a comprehensive sales dashboard:
import pandas as pd
import numpy as np
import datetime as dt
# Generate sample sales data
np.random.seed(42)
n_records = 1000
# Create date range
start_date = dt.datetime(2023, 1, 1)
end_date = dt.datetime(2023, 12, 31)
days = (end_date - start_date).days
dates = [start_date + dt.timedelta(days=np.random.randint(0, days)) for _ in range(n_records)]
# Create other columns
products = ['Laptop', 'Desktop', 'Monitor', 'Keyboard', 'Mouse', 'Headphones', 'Speakers']
regions = ['North', 'South', 'East', 'West']
data = {
'Date': dates,
'Product': np.random.choice(products, n_records),
'Region': np.random.choice(regions, n_records),
'Units': np.random.randint(1, 10, n_records),
'Price': np.random.uniform(50, 1500, n_records)
}
# Create DataFrame
sales_df = pd.DataFrame(data)
sales_df['Revenue'] = sales_df['Units'] * sales_df['Price']
# Create monthly summary
sales_df['Month'] = sales_df['Date'].dt.strftime('%Y-%m')
monthly_sales = sales_df.groupby(['Month', 'Product'])['Revenue'].sum().reset_index()
# Create regional summary
regional_sales = sales_df.groupby('Region')['Revenue'].sum().reset_index()
# Create product summary
product_sales = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False).reset_index()
# Now create the dashboard Excel file
with pd.ExcelWriter('sales_dashboard.xlsx', engine='xlsxwriter') as writer:
# Write each DataFrame to a different sheet
sales_df.to_excel(writer, sheet_name='Raw_Data', index=False)
monthly_sales.to_excel(writer, sheet_name='Monthly_Summary', index=False)
regional_sales.to_excel(writer, sheet_name='Regional_Summary', index=False)
product_sales.to_excel(writer, sheet_name='Product_Summary', index=False)
workbook = writer.book
# Get worksheet objects
raw_sheet = writer.sheets['Raw_Data']
monthly_sheet = writer.sheets['Monthly_Summary']
regional_sheet = writer.sheets['Regional_Summary']
product_sheet = writer.sheets['Product_Summary']
# Add formats
header_format = workbook.add_format({
'bold': True,
'bg_color': '#C6EFCE',
'border': 1
})
currency_format = workbook.add_format({'num_format': '$#,##0.00'})
# Format Raw Data sheet
for col_num, value in enumerate(sales_df.columns.values):
raw_sheet.write(0, col_num, value, header_format)
raw_sheet.set_column('F:F', 12, currency_format) # Revenue column
# Add regional sales chart
chart = workbook.add_chart({'type': 'pie'})
num_regions = len(regional_sales)
chart.add_series({
'name': 'Regional Sales',
'categories': f'=Regional_Summary!$A$2:$A${num_regions+1}',
'values': f'=Regional_Summary!$B$2:$B${num_regions+1}',
})
chart.set_title({'name': 'Sales by Region'})
regional_sheet.insert_chart('D2', chart)
# Add product sales chart
bar_chart = workbook.add_chart({'type': 'column'})
num_products = len(product_sales)
bar_chart.add_series({
'name': 'Product Revenue',
'categories': f'=Product_Summary!$A$2:$A${num_products+1}',
'values': f'=Product_Summary!$B$2:$B${num_products+1}',
'fill': {'color': '#5B9BD5'}
})
bar_chart.set_title({'name': 'Revenue by Product'})
bar_chart.set_y_axis({'name': 'Revenue ($)'})
product_sheet.insert_chart('D2', bar_chart)
print("Sales dashboard Excel file created successfully!")
This example demonstrates how to create a comprehensive Excel-based dashboard with multiple sheets, charts, and formatted data.
Excel Pivot Tables with Pandas
Pandas can also help us create Excel pivot tables:
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Laptop', 'Tablet', 'Phone', 'Monitor']
regions = ['North', 'South', 'East', 'West']
data = []
for _ in range(1000):
date = np.random.choice(dates)
product = np.random.choice(products)
region = np.random.choice(regions)
units = np.random.randint(1, 10)
price = np.random.uniform(100, 1000)
revenue = units * price
data.append([date, product, region, units, price, revenue])
df = pd.DataFrame(data, columns=['Date', 'Product', 'Region', 'Units', 'Price', 'Revenue'])
# Create a pivot table
pivot = pd.pivot_table(
df,
values='Revenue',
index=['Region', 'Product'],
columns=pd.Grouper(key='Date', freq='M'),
aggfunc='sum',
fill_value=0
)
# Format the date columns
pivot.columns = pivot.columns.strftime('%b-%Y')
# Write to Excel
with pd.ExcelWriter('sales_pivot.xlsx', engine='xlsxwriter') as writer:
pivot.to_excel(writer, sheet_name='Pivot')
# Get workbook and worksheet
workbook = writer.book
worksheet = writer.sheets['Pivot']
# Add formats
header_format = workbook.add_format({
'bold': True,
'text_wrap': True,
'valign': 'top',
'bg_color': '#D7E4BC',
'border': 1
})
currency_format = workbook.add_format({'num_format': '$#,##0'})
# Apply header format
for col_num, value in enumerate(pivot.columns.values):
worksheet.write(0, col_num + 1, value, header_format)
# Apply currency format to all data cells
for col in range(1, len(pivot.columns) + 1):
worksheet.set_column(col, col, 12, currency_format)
print("Excel pivot table created successfully!")
Summary
In this tutorial, we've explored advanced techniques for working with Excel files using pandas:
- Working with multiple sheets in a workbook
- Applying advanced formatting to Excel output
- Adding charts and visualizations
- Including Excel formulas in your output
- Efficiently handling large Excel files
- Creating complete dashboards with multiple elements
- Generating pivot tables directly from pandas
These techniques allow you to create professional, informative Excel documents directly from your pandas workflow, making data distribution and reporting much more efficient.
Additional Resources
Here are some resources to further explore the topic:
Exercises
-
Create an Excel file with sales data for multiple products across different regions, including charts to visualize the top-performing products.
-
Modify the dashboard example to include conditional formatting that highlights the best and worst performing products.
-
Create an Excel file that includes dynamic formulas for calculating month-over-month growth percentages.
-
Read a large Excel file in chunks, process each chunk to filter for specific criteria, and write the results to a new Excel file with proper formatting.
-
Create a function that takes a DataFrame and automatically generates an Excel report with a summary sheet, detailed data sheet, and appropriate charts based on the data types in the DataFrame.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)