Skip to main content

Pandas HTML Import

Introduction

HTML tables are a common way to present structured data on the web. Whether you're analyzing data from a website, extracting information from HTML reports, or scraping web content, Pandas offers a powerful tool called read_html() to help you import this data directly into DataFrames.

In this tutorial, we'll explore how to use Pandas to extract data from HTML tables, handle various HTML sources, and process the extracted data for analysis.

Prerequisites

Before diving in, make sure you have the necessary libraries installed:

bash
pip install pandas lxml html5lib beautifulsoup4

Pandas requires at least one of these parsers (lxml, html5lib, or beautifulsoup4) to read HTML content.

Basic HTML Table Import

Importing from a URL

The most common way to import HTML tables is directly from a URL:

python
import pandas as pd

# URL containing HTML tables
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

# Read all tables from the URL
tables = pd.read_html(url)

# Print the number of tables found
print(f"Number of tables found: {len(tables)}")

# Display the first table
print(tables[0])

Output:

Number of tables found: 5
Country or area ... Date Source
0 China[a] ... May 10, 2023 Official figure
1 India ... Mar 22, 2023 Official projection
2 United States ... Mar 13, 2023 Official estimate
3 Indonesia ... Jul 1, 2022 Official projection
4 Pakistan ... Jul 1, 2022 UN projection
... ... ... ... ...

Importing from HTML String

You can also import from an HTML string:

python
import pandas as pd

# HTML string containing a table
html = """
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>24</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>27</td>
<td>Los Angeles</td>
</tr>
<tr>
<td>Charlie</td>
<td>22</td>
<td>Chicago</td>
</tr>
</tbody>
</table>
"""

# Read the table from the HTML string
tables = pd.read_html(html)

# Display the table
print(tables[0])

Output:

      Name  Age         City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago

Importing from a Local HTML File

If you have a local HTML file:

python
import pandas as pd

# Path to local HTML file
file_path = "sample_table.html"

# Read tables from the file
tables = pd.read_html(file_path)

# Display the first table
print(tables[0])

Advanced Options with read_html()

The read_html() function comes with several parameters to customize the import process:

Selecting Specific Tables

You can filter tables by their attributes or position:

python
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

# Using match parameter to find tables containing specific text
gdp_tables = pd.read_html(url, match="Population")

print(f"Number of matching tables: {len(gdp_tables)}")
print(gdp_tables[0].head())

Specifying Table Headers

You can specify which row should be used as the header:

python
import pandas as pd

html = """
<table>
<tr>
<td>ID</td>
<td>Product</td>
<td>Price</td>
</tr>
<tr>
<td>1</td>
<td>Laptop</td>
<td>1200</td>
</tr>
<tr>
<td>2</td>
<td>Phone</td>
<td>800</td>
</tr>
</table>
"""

# Read the table with specified header
tables = pd.read_html(html, header=0)
print(tables[0])

Output:

   ID Product  Price
0 1 Laptop 1200
1 2 Phone 800

Handling Complex Tables

For tables with merged cells or complex structures:

python
import pandas as pd

url = "https://en.wikipedia.org/wiki/2020_Summer_Olympics_medal_table"

# Import tables and handle complex structures
tables = pd.read_html(url, header=0)

# Display the medal table (usually the main table)
medal_table = None
for table in tables:
if "Gold" in table.columns and "Total" in table.columns:
medal_table = table
break

if medal_table is not None:
print(medal_table.head())
else:
print("Medal table not found")

Working with Real-World Examples

Example 1: Analyzing Stock Data

Let's extract and analyze stock data from a financial website:

python
import pandas as pd
import matplotlib.pyplot as plt

# URL with stock market indices
url = "https://finance.yahoo.com/world-indices/"

try:
# Read all tables
tables = pd.read_html(url)

# The main table is typically the first one
stock_data = tables[0]

# Clean up the data
stock_data = stock_data.dropna()

# Convert percentage strings to floats
stock_data['% Change'] = stock_data['% Change'].str.rstrip('%').astype('float') / 100.0

# Sort by percentage change
stock_data = stock_data.sort_values('% Change', ascending=False)

print(stock_data[['Name', 'Last Price', '% Change']].head())

# Create a simple bar chart of top performers
plt.figure(figsize=(10, 6))
plt.bar(stock_data['Name'].head(5), stock_data['% Change'].head(5))
plt.title('Top Performing Stock Indices')
plt.xlabel('Index')
plt.ylabel('Percentage Change')
plt.xticks(rotation=45)
plt.tight_layout()
# plt.show() # Uncomment to display the plot

except Exception as e:
print(f"An error occurred: {e}")
print("This example may not work if the website structure has changed.")

Example 2: Extracting COVID-19 Data

Let's extract COVID-19 statistics from a public health website:

python
import pandas as pd

# URL with COVID-19 data
url = "https://www.worldometers.info/coronavirus/"

try:
# Extract tables
tables = pd.read_html(url)

# Find the main COVID data table (usually has 'Country' and 'Total Cases' columns)
covid_data = None
for table in tables:
if 'Country' in table.columns and 'Total Cases' in table.columns:
covid_data = table
break

if covid_data is not None:
# Clean and process the data
covid_data = covid_data.iloc[8:20] # Focus on top countries
covid_data = covid_data.sort_values('Total Cases', ascending=False)

# Display top countries by case count
print("Top Countries by COVID-19 Cases:")
print(covid_data[['Country', 'Total Cases', 'Total Deaths']].head(10))
else:
print("COVID-19 data table not found")

except Exception as e:
print(f"An error occurred: {e}")
print("This example may not work if the website structure has changed.")

Common Issues and Solutions

1. No tables found

If read_html() doesn't find any tables, check if:

  • The HTML actually contains <table> elements
  • The tables are generated by JavaScript (Pandas can't parse dynamically generated content)
  • You need a parser that can handle the specific HTML structure
python
# Try with a different parser
tables = pd.read_html(url, flavor='html5lib')

2. Authentication or Blocked Access

Some websites block web scraping:

python
import pandas as pd
import requests

# Create a session with headers that mimic a browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

session = requests.Session()
response = session.get(url, headers=headers)

# Now use the HTML content from the response
tables = pd.read_html(response.text)

3. Handling Missing Values

Sometimes HTML tables have inconsistent data:

python
# Handle missing values during import
tables = pd.read_html(url, na_values=['', 'N/A', '--'])

# Or clean up after import
df = tables[0]
df = df.fillna(0) # Replace NaN with 0

Best Practices

  1. Always inspect the result - read_html() returns a list of DataFrames, so check how many tables were found
  2. Check for data types - Convert columns to appropriate types after import
  3. Clean up the data - Handle missing values, remove unwanted characters, etc.
  4. Be respectful - Check a website's robots.txt and terms of service before scraping
  5. Implement error handling - Websites change, so your code should handle failing gracefully

Summary

Pandas' read_html() function provides a powerful and convenient way to extract tabular data from HTML sources. With just a few lines of code, you can import data from websites, local HTML files, or HTML strings into DataFrame objects ready for analysis.

Key points to remember:

  • You can extract tables from URLs, HTML strings, or local files
  • Multiple tables can be extracted from a single source
  • Additional parameters help you filter and customize the extraction
  • Real-world applications include financial data analysis, public health statistics, and more
  • Error handling is important due to the dynamic nature of web content

Additional Resources and Exercises

Resources

Exercises

  1. Basic: Import the table of Olympic medal counts from Wikipedia and create a bar chart of the top 10 countries by total medals.

  2. Intermediate: Extract stock price data for major technology companies and calculate their average price-to-earnings ratio.

  3. Advanced: Create a function that monitors a specific website table daily and alerts you if values change by more than a defined threshold.

  4. Challenge: Combine read_html() with other web scraping techniques to extract both tabular and non-tabular data from a news website, then perform sentiment analysis on the results.

By mastering Pandas' HTML import capabilities, you'll be able to efficiently collect and analyze web data, opening up countless possibilities for data-driven projects and insights.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)