Pandas Web Scraping

Introduction

Web scraping is a powerful technique that allows you to extract data from websites when no convenient API or download option is available. When combined with pandas' data manipulation capabilities, web scraping becomes an essential tool in a data scientist's toolkit. This guide will show you how to scrape web data and seamlessly integrate it with pandas DataFrames for analysis.

In this tutorial, you'll learn:

The basics of web scraping and its ethical considerations
How to use libraries like BeautifulSoup and requests with pandas
Techniques for parsing HTML tables directly into pandas DataFrames
How to clean and process scraped data for analysis

Prerequisites

Before we begin, make sure you have the following libraries installed:

bash
pip install pandas requests beautifulsoup4 lxml html5lib

Understanding Web Scraping Basics

Web scraping involves programmatically extracting information from websites. The process typically follows these steps:

Send an HTTP request to the URL of the webpage you want to access
Parse the HTML content
Extract the data you need
Store the data in the desired format (in our case, a pandas DataFrame)

Ethical Considerations

Before scraping any website, always:

Check the website's robots.txt file and terms of service
Respect rate limits by adding delays between requests
Only scrape public data that doesn't require authentication
Consider using available APIs before resorting to scraping

Method 1: Using pandas' Built-in HTML Table Parser

The simplest way to scrape tables from websites is to use pandas' built-in read_html() function, which can automatically extract HTML tables into DataFrames.

Example: Scraping Wikipedia Tables

Let's scrape a table listing countries by population from Wikipedia:

python
import pandas as pd

# URL of the page containing the table
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

# Read all tables from the webpage
tables = pd.read_html(url)

# The tables variable is now a list of all tables found on the page
# Let's select the first table (which contains the population data)
population_df = tables[0]

# Display the first 5 rows
print(population_df.head())

Output:

   Rank               Country/Territory  Population         Date  % of world  Source
   1                          China  1,412,600,000  2023-01-17      17.7%  Official estimate
   2                          India  1,375,586,000  2023-03-01      17.2%  Official projection
   3             United States[a][b]    333,287,557  2022-07-01       4.2%  Official estimate
   4                      Indonesia    275,773,800  2022-07-01       3.5%  Official estimate
   5                       Pakistan    235,825,000  2023-01-01       3.0%  Official estimate

How It Works

The read_html() function:

Downloads the HTML content from the specified URL
Searches for <table> tags in the HTML
Attempts to parse each table into a pandas DataFrame
Returns a list of DataFrames (one for each table found)

Handling Multiple Tables

If a page has multiple tables, you need to identify which table you want:

python
# Check how many tables were found
print(f"Number of tables found: {len(tables)}")

# You can iterate through all tables to find the one you need
for i, table in enumerate(tables[:3]):  # Show first 3 tables
    print(f"\nTable {i} shape: {table.shape}")
    print(table.head(2))  # Print first 2 rows of each table

Method 2: Using BeautifulSoup with pandas

When websites have complex structures or when tables aren't properly formatted, you might need a more flexible approach using BeautifulSoup.

Example: Scraping Data and Creating a DataFrame

python
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to scrape - example with quotes from quotes.toscrape.com
url = "http://quotes.toscrape.com/"
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all quote divs
quotes = soup.find_all('div', class_='quote')

# Create lists to store our data
authors = []
texts = []
tags = []

# Extract the data we want
for quote in quotes:
    # Extract the quote text
    text = quote.find('span', class_='text').text
    texts.append(text)
    
    # Extract the author
    author = quote.find('small', class_='author').text
    authors.append(author)
    
    # Extract the tags (and join them with commas)
    quote_tags = quote.find('div', class_='tags').find_all('a', class_='tag')
    quote_tags_text = ', '.join([tag.text for tag in quote_tags])
    tags.append(quote_tags_text)

# Create a DataFrame
quotes_df = pd.DataFrame({
    'Quote': texts,
    'Author': authors,
    'Tags': tags
})

# Show the result
print(quotes_df.head())

Output:

                                               Quote          Author                                  Tags
"The world as we have created it is a process ...  Albert Einstein                           change, deep-thoughts, thinking, world
"It is our choices, Harry, that show what we t...  J.K. Rowling          abilities, choices, difficulty, harry-potter
"There are only two ways to live your life. On...  Albert Einstein                      inspirational, life, live, miracle, miracles
"The person, be it gentleman or lady, who has ...       Jane Austen                                        aliteracy, books, classic, humor
"Imperfection is beauty, madness is genius and...  Marilyn Monroe        be-yourself, inspirational

Method 3: Scraping Paginated Content

Many websites split content across multiple pages. Here's how to handle pagination:

python
import pandas as pd
import requests
from bs4 import BeautifulSoup

def scrape_quotes_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    quotes = soup.find_all('div', class_='quote')
    
    data = []
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        quote_tags = quote.find('div', class_='tags').find_all('a', class_='tag')
        tags = ', '.join([tag.text for tag in quote_tags])
        
        data.append({
            'Quote': text,
            'Author': author,
            'Tags': tags
        })
    
    # Check if there's a next page
    next_page = soup.find('li', class_='next')
    next_url = None
    
    if next_page:
        next_url = 'http://quotes.toscrape.com' + next_page.find('a')['href']
    
    return data, next_url

# Start with the first page
current_url = 'http://quotes.toscrape.com/'
all_quotes = []

# Limit to 3 pages for this example (remove this limit for all pages)
page_count = 0
max_pages = 3

while current_url and page_count < max_pages:
    print(f"Scraping page: {current_url}")
    page_quotes, current_url = scrape_quotes_page(current_url)
    all_quotes.extend(page_quotes)
    page_count += 1
    
    # Be nice to the website by adding a delay between requests
    import time
    time.sleep(1)

# Create the final DataFrame
quotes_df = pd.DataFrame(all_quotes)

# Show stats
print(f"\nTotal quotes collected: {len(quotes_df)}")
print(f"Number of unique authors: {quotes_df['Author'].nunique()}")
print("\nQuotes per author:")
print(quotes_df['Author'].value_counts().head())

Method 4: Scraping Data from Dynamic Websites

Some modern websites load content dynamically using JavaScript. For these sites, you'll need to use a browser automation tool like Selenium:

python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time

# Set up the driver (this will download the appropriate ChromeDriver if needed)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Example with a JavaScript-heavy website
driver.get('https://quotes.toscrape.com/js/')

# Wait for the content to load
time.sleep(3)  # You can use more sophisticated waits

# Now extract the data
quotes = driver.find_elements(By.CLASS_NAME, 'quote')

data = []
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    
    # Get all tag elements
    tag_elements = quote.find_elements(By.CLASS_NAME, 'tag')
    tags = ', '.join([tag.text for tag in tag_elements])
    
    data.append({
        'Quote': text,
        'Author': author,
        'Tags': tags
    })

# Close the browser
driver.quit()

# Create a DataFrame
dynamic_quotes_df = pd.DataFrame(data)
print(dynamic_quotes_df.head())

Real-World Application: Financial Data Analysis

Let's build a practical application that scrapes stock data and uses pandas for analysis:

python
import pandas as pd
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

# Example: Getting S&P 500 companies from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500_companies = tables[0]

# Display first 5 companies
print("S&P 500 Companies:")
print(sp500_companies[['Symbol', 'Security', 'GICS Sector']].head())

# Let's get historical stock data for a specific company using Yahoo Finance
# Note: This is a simplified example. Yahoo Finance structure may change.
def get_stock_data(ticker, period='1y'):
    url = f"https://query1.finance.yahoo.com/v8/finance/chart/{ticker}?interval=1d&range={period}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    response = requests.get(url, headers=headers)
    data = response.json()
    
    # Extract timestamp and close price
    timestamps = data['chart']['result'][0]['timestamp']
    close_prices = data['chart']['result'][0]['indicators']['quote'][0]['close']
    
    # Convert to DataFrame
    df = pd.DataFrame({
        'Date': pd.to_datetime(timestamps, unit='s'),
        'Close': close_prices
    })
    df.set_index('Date', inplace=True)
    return df

# Example usage for Apple (AAPL)
try:
    apple_data = get_stock_data('AAPL')
    
    print("\nApple Stock Data (Last 5 days):")
    print(apple_data.tail())
    
    # Create a simple plot
    plt.figure(figsize=(10, 6))
    apple_data['Close'].plot(title='Apple Stock Price (1 Year)')
    plt.grid(True)
    plt.xlabel('Date')
    plt.ylabel('Close Price ($)')
    plt.savefig('apple_stock.png')  # Save the figure
    print("\nStock chart saved as 'apple_stock.png'")
    
except Exception as e:
    print(f"Error fetching stock data: {str(e)}")
    print("This example might not work if Yahoo Finance has changed their API structure.")

Challenges and Best Practices

Handling Errors

When web scraping, you'll encounter various issues. Here's how to handle them:

python
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import random

def robust_scraper(url, max_retries=3):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    for attempt in range(max_retries):
        try:
            # Add a random delay to be respectful
            time.sleep(random.uniform(1, 3))
            
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()  # Raise an exception for HTTP errors
            
            # If we get here, the request was successful
            return response.text
        
        except requests.exceptions.HTTPError as e:
            print(f"HTTP Error: {e} (Attempt {attempt + 1} of {max_retries})")
        except requests.exceptions.ConnectionError:
            print(f"Connection Error (Attempt {attempt + 1} of {max_retries})")
        except requests.exceptions.Timeout:
            print(f"Timeout Error (Attempt {attempt + 1} of {max_retries})")
        except requests.exceptions.RequestException as e:
            print(f"Error: {e} (Attempt {attempt + 1} of {max_retries})")
        
        # If we get here, the request failed
        if attempt < max_retries - 1:
            # Wait longer before each retry
            time.sleep(random.uniform(2, 5) * (attempt + 1))
    
    # If we get here, all attempts failed
    raise Exception(f"Failed to retrieve data after {max_retries} attempts")

# Example usage
try:
    html_content = robust_scraper('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
    tables = pd.read_html(html_content)
    gdp_data = tables[2]  # The third table has the IMF data
    print(gdp_data.head())
except Exception as e:
    print(f"Scraping failed: {e}")

Cleaning Scraped Data

Scraped data often requires cleaning. Here's a typical workflow:

python
# Example: Cleaning a scraped table with messy data
import pandas as pd
import re

# Assume we've scraped a table with messy currency values and dates
data = {
    'Country': ['United States', 'China', 'Japan', 'Germany', 'India'],
    'GDP (USD)': ['$21,433.2 billion', '$16,642.3 billion', '$5,378.1 billion', 
                 '$4,319.3 billion', '$3,176.3 billion'],
    'Last Updated': ['Jan 2022', 'Dec 2021', 'Jan 2022', 'Nov 2021', 'Jan 2022']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Clean the GDP column by removing $ and "billion", and converting to float
df['GDP (USD)'] = df['GDP (USD)'].apply(
    lambda x: float(re.sub(r'[^\d.]', '', x.replace('billion', '')))
)

# Convert to actual billions (if we removed "billion")
df.rename(columns={'GDP (USD)': 'GDP (Billions USD)'}, inplace=True)

# Parse the dates
df['Last Updated'] = pd.to_datetime(df['Last Updated'], format='%b %Y')

print("\nCleaned DataFrame:")
print(df)

# Add a new calculated column
df['GDP Per Capita (USD)'] = [65000, 11900, 42900, 51800, 2300]  # Example values
print("\nWith additional data:")
print(df)

Summary

In this tutorial, you learned how to:

Use pandas' built-in functions to scrape HTML tables
Work with BeautifulSoup to scrape more complex websites
Handle pagination to collect data from multiple pages
Scrape dynamic websites using Selenium
Apply these techniques to real-world data analysis tasks
Handle errors and clean scraped data

Web scraping combined with pandas provides a powerful workflow for collecting and analyzing data that is not readily available through APIs or downloads. By following ethical guidelines and implementing robust error handling, you can reliably gather data for your analysis needs.

Additional Resources

Exercises

Basic Exercise: Scrape a Wikipedia table of your choice and perform basic data cleaning.
Intermediate Exercise: Create a script that scrapes weather data from a weather website for multiple cities and compares their average temperatures.
Advanced Exercise: Build a dashboard that automatically scrapes financial data for a list of stocks, analyzes their performance, and generates visualizations.
Challenge: Create a web scraper that extracts product information from an e-commerce site (that permits scraping) and identifies price trends or product availability patterns.

Remember to always respect websites' terms of service and robots.txt files when scraping, and consider using official APIs when available.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding Web Scraping Basics​

Ethical Considerations​

Method 1: Using pandas' Built-in HTML Table Parser​

Example: Scraping Wikipedia Tables​

How It Works​

Handling Multiple Tables​

Method 2: Using BeautifulSoup with pandas​

Example: Scraping Data and Creating a DataFrame​

Method 3: Scraping Paginated Content​

Method 4: Scraping Data from Dynamic Websites​

Real-World Application: Financial Data Analysis​

Challenges and Best Practices​

Handling Errors​

Cleaning Scraped Data​

Summary​

Additional Resources​

Exercises​