Pandas HTML Import

Introduction

HTML tables are a common way to present structured data on the web. Whether you're analyzing data from a website, extracting information from HTML reports, or scraping web content, Pandas offers a powerful tool called read_html() to help you import this data directly into DataFrames.

In this tutorial, we'll explore how to use Pandas to extract data from HTML tables, handle various HTML sources, and process the extracted data for analysis.

Prerequisites

Before diving in, make sure you have the necessary libraries installed:

bash
pip install pandas lxml html5lib beautifulsoup4

Pandas requires at least one of these parsers (lxml, html5lib, or beautifulsoup4) to read HTML content.

Basic HTML Table Import

Importing from a URL

The most common way to import HTML tables is directly from a URL:

python
import pandas as pd

# URL containing HTML tables
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

# Read all tables from the URL
tables = pd.read_html(url)

# Print the number of tables found
print(f"Number of tables found: {len(tables)}")

# Display the first table
print(tables[0])

Output:

Number of tables found: 5
                              Country or area  ...     Date       Source
0                             China[a]  ...  May 10, 2023  Official figure
1                            India  ...  Mar 22, 2023  Official projection
2                     United States  ...  Mar 13, 2023  Official estimate
3                        Indonesia  ...  Jul 1, 2022  Official projection
4                         Pakistan  ...  Jul 1, 2022  UN projection
...                                ...  ...           ...              ...

Importing from HTML String

You can also import from an HTML string:

python
import pandas as pd

# HTML string containing a table
html = """
<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
      <th>City</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alice</td>
      <td>24</td>
      <td>New York</td>
    </tr>
    <tr>
      <td>Bob</td>
      <td>27</td>
      <td>Los Angeles</td>
    </tr>
    <tr>
      <td>Charlie</td>
      <td>22</td>
      <td>Chicago</td>
    </tr>
  </tbody>
</table>
"""

# Read the table from the HTML string
tables = pd.read_html(html)

# Display the table
print(tables[0])

Output:

      Name  Age         City
  Alice   24     New York
    Bob   27  Los Angeles
Charlie   22      Chicago

Importing from a Local HTML File

If you have a local HTML file:

python
import pandas as pd

# Path to local HTML file
file_path = "sample_table.html"

# Read tables from the file
tables = pd.read_html(file_path)

# Display the first table
print(tables[0])

Advanced Options with `read_html()`

The read_html() function comes with several parameters to customize the import process:

Selecting Specific Tables

You can filter tables by their attributes or position:

python
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

# Using match parameter to find tables containing specific text
gdp_tables = pd.read_html(url, match="Population")

print(f"Number of matching tables: {len(gdp_tables)}")
print(gdp_tables[0].head())

Specifying Table Headers

You can specify which row should be used as the header:

python
import pandas as pd

html = """
<table>
  <tr>
    <td>ID</td>
    <td>Product</td>
    <td>Price</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Laptop</td>
    <td>1200</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Phone</td>
    <td>800</td>
  </tr>
</table>
"""

# Read the table with specified header
tables = pd.read_html(html, header=0)
print(tables[0])

Output:

   ID Product  Price
0   1  Laptop   1200
1   2   Phone    800

Handling Complex Tables

For tables with merged cells or complex structures:

python
import pandas as pd

url = "https://en.wikipedia.org/wiki/2020_Summer_Olympics_medal_table"

# Import tables and handle complex structures
tables = pd.read_html(url, header=0)

# Display the medal table (usually the main table)
medal_table = None
for table in tables:
    if "Gold" in table.columns and "Total" in table.columns:
        medal_table = table
        break

if medal_table is not None:
    print(medal_table.head())
else:
    print("Medal table not found")

Working with Real-World Examples

Example 1: Analyzing Stock Data

Let's extract and analyze stock data from a financial website:

python
import pandas as pd
import matplotlib.pyplot as plt

# URL with stock market indices
url = "https://finance.yahoo.com/world-indices/"

try:
    # Read all tables
    tables = pd.read_html(url)
    
    # The main table is typically the first one
    stock_data = tables[0]
    
    # Clean up the data
    stock_data = stock_data.dropna()
    
    # Convert percentage strings to floats
    stock_data['% Change'] = stock_data['% Change'].str.rstrip('%').astype('float') / 100.0
    
    # Sort by percentage change
    stock_data = stock_data.sort_values('% Change', ascending=False)
    
    print(stock_data[['Name', 'Last Price', '% Change']].head())
    
    # Create a simple bar chart of top performers
    plt.figure(figsize=(10, 6))
    plt.bar(stock_data['Name'].head(5), stock_data['% Change'].head(5))
    plt.title('Top Performing Stock Indices')
    plt.xlabel('Index')
    plt.ylabel('Percentage Change')
    plt.xticks(rotation=45)
    plt.tight_layout()
    # plt.show()  # Uncomment to display the plot
    
except Exception as e:
    print(f"An error occurred: {e}")
    print("This example may not work if the website structure has changed.")

Example 2: Extracting COVID-19 Data

Let's extract COVID-19 statistics from a public health website:

python
import pandas as pd

# URL with COVID-19 data
url = "https://www.worldometers.info/coronavirus/"

try:
    # Extract tables
    tables = pd.read_html(url)
    
    # Find the main COVID data table (usually has 'Country' and 'Total Cases' columns)
    covid_data = None
    for table in tables:
        if 'Country' in table.columns and 'Total Cases' in table.columns:
            covid_data = table
            break
    
    if covid_data is not None:
        # Clean and process the data
        covid_data = covid_data.iloc[8:20]  # Focus on top countries
        covid_data = covid_data.sort_values('Total Cases', ascending=False)
        
        # Display top countries by case count
        print("Top Countries by COVID-19 Cases:")
        print(covid_data[['Country', 'Total Cases', 'Total Deaths']].head(10))
    else:
        print("COVID-19 data table not found")
        
except Exception as e:
    print(f"An error occurred: {e}")
    print("This example may not work if the website structure has changed.")

Common Issues and Solutions

1. No tables found

If read_html() doesn't find any tables, check if:

The HTML actually contains <table> elements
The tables are generated by JavaScript (Pandas can't parse dynamically generated content)
You need a parser that can handle the specific HTML structure

python
# Try with a different parser
tables = pd.read_html(url, flavor='html5lib')

2. Authentication or Blocked Access

Some websites block web scraping:

python
import pandas as pd
import requests

# Create a session with headers that mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

session = requests.Session()
response = session.get(url, headers=headers)

# Now use the HTML content from the response
tables = pd.read_html(response.text)

3. Handling Missing Values

Sometimes HTML tables have inconsistent data:

python
# Handle missing values during import
tables = pd.read_html(url, na_values=['', 'N/A', '--'])

# Or clean up after import
df = tables[0]
df = df.fillna(0)  # Replace NaN with 0

Best Practices

Always inspect the result - read_html() returns a list of DataFrames, so check how many tables were found
Check for data types - Convert columns to appropriate types after import
Clean up the data - Handle missing values, remove unwanted characters, etc.
Be respectful - Check a website's robots.txt and terms of service before scraping
Implement error handling - Websites change, so your code should handle failing gracefully

Summary

Pandas' read_html() function provides a powerful and convenient way to extract tabular data from HTML sources. With just a few lines of code, you can import data from websites, local HTML files, or HTML strings into DataFrame objects ready for analysis.

Key points to remember:

You can extract tables from URLs, HTML strings, or local files
Multiple tables can be extracted from a single source
Additional parameters help you filter and customize the extraction
Real-world applications include financial data analysis, public health statistics, and more
Error handling is important due to the dynamic nature of web content

Additional Resources and Exercises

Resources

Official Pandas read_html documentation
Beautiful Soup documentation for more advanced web scraping

Exercises

Basic: Import the table of Olympic medal counts from Wikipedia and create a bar chart of the top 10 countries by total medals.
Intermediate: Extract stock price data for major technology companies and calculate their average price-to-earnings ratio.
Advanced: Create a function that monitors a specific website table daily and alerts you if values change by more than a defined threshold.
Challenge: Combine read_html() with other web scraping techniques to extract both tabular and non-tabular data from a news website, then perform sentiment analysis on the results.

By mastering Pandas' HTML import capabilities, you'll be able to efficiently collect and analyze web data, opening up countless possibilities for data-driven projects and insights.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic HTML Table Import​

Importing from a URL​

Importing from HTML String​

Importing from a Local HTML File​

Advanced Options with read_html()​

Selecting Specific Tables​

Specifying Table Headers​

Handling Complex Tables​

Working with Real-World Examples​

Example 1: Analyzing Stock Data​

Example 2: Extracting COVID-19 Data​

Common Issues and Solutions​

1. No tables found​

2. Authentication or Blocked Access​

3. Handling Missing Values​

Best Practices​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction

Prerequisites

Basic HTML Table Import

Importing from a URL

Importing from HTML String

Importing from a Local HTML File

Advanced Options with `read_html()`

Selecting Specific Tables

Specifying Table Headers

Handling Complex Tables

Working with Real-World Examples

Example 1: Analyzing Stock Data

Example 2: Extracting COVID-19 Data

Common Issues and Solutions

1. No tables found

2. Authentication or Blocked Access

3. Handling Missing Values

Best Practices

Summary

Additional Resources and Exercises

Resources

Exercises