Pandas HTML Import
Introduction
HTML tables are a common way to present structured data on the web. Whether you're analyzing data from a website, extracting information from HTML reports, or scraping web content, Pandas offers a powerful tool called read_html()
to help you import this data directly into DataFrames.
In this tutorial, we'll explore how to use Pandas to extract data from HTML tables, handle various HTML sources, and process the extracted data for analysis.
Prerequisites
Before diving in, make sure you have the necessary libraries installed:
pip install pandas lxml html5lib beautifulsoup4
Pandas requires at least one of these parsers (lxml, html5lib, or beautifulsoup4) to read HTML content.
Basic HTML Table Import
Importing from a URL
The most common way to import HTML tables is directly from a URL:
import pandas as pd
# URL containing HTML tables
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
# Read all tables from the URL
tables = pd.read_html(url)
# Print the number of tables found
print(f"Number of tables found: {len(tables)}")
# Display the first table
print(tables[0])
Output:
Number of tables found: 5
Country or area ... Date Source
0 China[a] ... May 10, 2023 Official figure
1 India ... Mar 22, 2023 Official projection
2 United States ... Mar 13, 2023 Official estimate
3 Indonesia ... Jul 1, 2022 Official projection
4 Pakistan ... Jul 1, 2022 UN projection
... ... ... ... ...
Importing from HTML String
You can also import from an HTML string:
import pandas as pd
# HTML string containing a table
html = """
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>24</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>Los Angeles</td>
</tr>
<tr>
<td>Charlie</td>
<td>22</td>
<td>Chicago</td>
</tr>
</tbody>
</table>
"""
# Read the table from the HTML string
tables = pd.read_html(html)
# Display the table
print(tables[0])
Output:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
Importing from a Local HTML File
If you have a local HTML file:
import pandas as pd
# Path to local HTML file
file_path = "sample_table.html"
# Read tables from the file
tables = pd.read_html(file_path)
# Display the first table
print(tables[0])
Advanced Options with read_html()
The read_html()
function comes with several parameters to customize the import process:
Selecting Specific Tables
You can filter tables by their attributes or position:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
# Using match parameter to find tables containing specific text
gdp_tables = pd.read_html(url, match="Population")
print(f"Number of matching tables: {len(gdp_tables)}")
print(gdp_tables[0].head())
Specifying Table Headers
You can specify which row should be used as the header:
import pandas as pd
html = """
<table>
<tr>
<td>ID</td>
<td>Product</td>
<td>Price</td>
</tr>
<tr>
<td>1</td>
<td>Laptop</td>
<td>1200</td>
</tr>
<tr>
<td>2</td>
<td>Phone</td>
<td>800</td>
</tr>
</table>
"""
# Read the table with specified header
tables = pd.read_html(html, header=0)
print(tables[0])
Output:
ID Product Price
0 1 Laptop 1200
1 2 Phone 800
Handling Complex Tables
For tables with merged cells or complex structures:
import pandas as pd
url = "https://en.wikipedia.org/wiki/2020_Summer_Olympics_medal_table"
# Import tables and handle complex structures
tables = pd.read_html(url, header=0)
# Display the medal table (usually the main table)
medal_table = None
for table in tables:
if "Gold" in table.columns and "Total" in table.columns:
medal_table = table
break
if medal_table is not None:
print(medal_table.head())
else:
print("Medal table not found")
Working with Real-World Examples
Example 1: Analyzing Stock Data
Let's extract and analyze stock data from a financial website:
import pandas as pd
import matplotlib.pyplot as plt
# URL with stock market indices
url = "https://finance.yahoo.com/world-indices/"
try:
# Read all tables
tables = pd.read_html(url)
# The main table is typically the first one
stock_data = tables[0]
# Clean up the data
stock_data = stock_data.dropna()
# Convert percentage strings to floats
stock_data['% Change'] = stock_data['% Change'].str.rstrip('%').astype('float') / 100.0
# Sort by percentage change
stock_data = stock_data.sort_values('% Change', ascending=False)
print(stock_data[['Name', 'Last Price', '% Change']].head())
# Create a simple bar chart of top performers
plt.figure(figsize=(10, 6))
plt.bar(stock_data['Name'].head(5), stock_data['% Change'].head(5))
plt.title('Top Performing Stock Indices')
plt.xlabel('Index')
plt.ylabel('Percentage Change')
plt.xticks(rotation=45)
plt.tight_layout()
# plt.show() # Uncomment to display the plot
except Exception as e:
print(f"An error occurred: {e}")
print("This example may not work if the website structure has changed.")
Example 2: Extracting COVID-19 Data
Let's extract COVID-19 statistics from a public health website:
import pandas as pd
# URL with COVID-19 data
url = "https://www.worldometers.info/coronavirus/"
try:
# Extract tables
tables = pd.read_html(url)
# Find the main COVID data table (usually has 'Country' and 'Total Cases' columns)
covid_data = None
for table in tables:
if 'Country' in table.columns and 'Total Cases' in table.columns:
covid_data = table
break
if covid_data is not None:
# Clean and process the data
covid_data = covid_data.iloc[8:20] # Focus on top countries
covid_data = covid_data.sort_values('Total Cases', ascending=False)
# Display top countries by case count
print("Top Countries by COVID-19 Cases:")
print(covid_data[['Country', 'Total Cases', 'Total Deaths']].head(10))
else:
print("COVID-19 data table not found")
except Exception as e:
print(f"An error occurred: {e}")
print("This example may not work if the website structure has changed.")
Common Issues and Solutions
1. No tables found
If read_html()
doesn't find any tables, check if:
- The HTML actually contains
<table>
elements - The tables are generated by JavaScript (Pandas can't parse dynamically generated content)
- You need a parser that can handle the specific HTML structure
# Try with a different parser
tables = pd.read_html(url, flavor='html5lib')
2. Authentication or Blocked Access
Some websites block web scraping:
import pandas as pd
import requests
# Create a session with headers that mimic a browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
session = requests.Session()
response = session.get(url, headers=headers)
# Now use the HTML content from the response
tables = pd.read_html(response.text)
3. Handling Missing Values
Sometimes HTML tables have inconsistent data:
# Handle missing values during import
tables = pd.read_html(url, na_values=['', 'N/A', '--'])
# Or clean up after import
df = tables[0]
df = df.fillna(0) # Replace NaN with 0
Best Practices
- Always inspect the result -
read_html()
returns a list of DataFrames, so check how many tables were found - Check for data types - Convert columns to appropriate types after import
- Clean up the data - Handle missing values, remove unwanted characters, etc.
- Be respectful - Check a website's robots.txt and terms of service before scraping
- Implement error handling - Websites change, so your code should handle failing gracefully
Summary
Pandas' read_html()
function provides a powerful and convenient way to extract tabular data from HTML sources. With just a few lines of code, you can import data from websites, local HTML files, or HTML strings into DataFrame objects ready for analysis.
Key points to remember:
- You can extract tables from URLs, HTML strings, or local files
- Multiple tables can be extracted from a single source
- Additional parameters help you filter and customize the extraction
- Real-world applications include financial data analysis, public health statistics, and more
- Error handling is important due to the dynamic nature of web content
Additional Resources and Exercises
Resources
- Official Pandas read_html documentation
- Beautiful Soup documentation for more advanced web scraping
Exercises
-
Basic: Import the table of Olympic medal counts from Wikipedia and create a bar chart of the top 10 countries by total medals.
-
Intermediate: Extract stock price data for major technology companies and calculate their average price-to-earnings ratio.
-
Advanced: Create a function that monitors a specific website table daily and alerts you if values change by more than a defined threshold.
-
Challenge: Combine
read_html()
with other web scraping techniques to extract both tabular and non-tabular data from a news website, then perform sentiment analysis on the results.
By mastering Pandas' HTML import capabilities, you'll be able to efficiently collect and analyze web data, opening up countless possibilities for data-driven projects and insights.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)