Pandas Web Scraping
Introduction
Web scraping is a powerful technique that allows you to extract data from websites when no convenient API or download option is available. When combined with pandas' data manipulation capabilities, web scraping becomes an essential tool in a data scientist's toolkit. This guide will show you how to scrape web data and seamlessly integrate it with pandas DataFrames for analysis.
In this tutorial, you'll learn:
- The basics of web scraping and its ethical considerations
- How to use libraries like BeautifulSoup and requests with pandas
- Techniques for parsing HTML tables directly into pandas DataFrames
- How to clean and process scraped data for analysis
Prerequisites
Before we begin, make sure you have the following libraries installed:
pip install pandas requests beautifulsoup4 lxml html5lib
Understanding Web Scraping Basics
Web scraping involves programmatically extracting information from websites. The process typically follows these steps:
- Send an HTTP request to the URL of the webpage you want to access
- Parse the HTML content
- Extract the data you need
- Store the data in the desired format (in our case, a pandas DataFrame)
Ethical Considerations
Before scraping any website, always:
- Check the website's robots.txt file and terms of service
- Respect rate limits by adding delays between requests
- Only scrape public data that doesn't require authentication
- Consider using available APIs before resorting to scraping
Method 1: Using pandas' Built-in HTML Table Parser
The simplest way to scrape tables from websites is to use pandas' built-in read_html()
function, which can automatically extract HTML tables into DataFrames.
Example: Scraping Wikipedia Tables
Let's scrape a table listing countries by population from Wikipedia:
import pandas as pd
# URL of the page containing the table
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
# Read all tables from the webpage
tables = pd.read_html(url)
# The tables variable is now a list of all tables found on the page
# Let's select the first table (which contains the population data)
population_df = tables[0]
# Display the first 5 rows
print(population_df.head())
Output:
Rank Country/Territory Population Date % of world Source
0 1 China 1,412,600,000 2023-01-17 17.7% Official estimate
1 2 India 1,375,586,000 2023-03-01 17.2% Official projection
2 3 United States[a][b] 333,287,557 2022-07-01 4.2% Official estimate
3 4 Indonesia 275,773,800 2022-07-01 3.5% Official estimate
4 5 Pakistan 235,825,000 2023-01-01 3.0% Official estimate
How It Works
The read_html()
function:
- Downloads the HTML content from the specified URL
- Searches for
<table>
tags in the HTML - Attempts to parse each table into a pandas DataFrame
- Returns a list of DataFrames (one for each table found)
Handling Multiple Tables
If a page has multiple tables, you need to identify which table you want:
# Check how many tables were found
print(f"Number of tables found: {len(tables)}")
# You can iterate through all tables to find the one you need
for i, table in enumerate(tables[:3]): # Show first 3 tables
print(f"\nTable {i} shape: {table.shape}")
print(table.head(2)) # Print first 2 rows of each table
Method 2: Using BeautifulSoup with pandas
When websites have complex structures or when tables aren't properly formatted, you might need a more flexible approach using BeautifulSoup.
Example: Scraping Data and Creating a DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
# URL to scrape - example with quotes from quotes.toscrape.com
url = "http://quotes.toscrape.com/"
response = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all quote divs
quotes = soup.find_all('div', class_='quote')
# Create lists to store our data
authors = []
texts = []
tags = []
# Extract the data we want
for quote in quotes:
# Extract the quote text
text = quote.find('span', class_='text').text
texts.append(text)
# Extract the author
author = quote.find('small', class_='author').text
authors.append(author)
# Extract the tags (and join them with commas)
quote_tags = quote.find('div', class_='tags').find_all('a', class_='tag')
quote_tags_text = ', '.join([tag.text for tag in quote_tags])
tags.append(quote_tags_text)
# Create a DataFrame
quotes_df = pd.DataFrame({
'Quote': texts,
'Author': authors,
'Tags': tags
})
# Show the result
print(quotes_df.head())
Output:
Quote Author Tags
0 "The world as we have created it is a process ... Albert Einstein change, deep-thoughts, thinking, world
1 "It is our choices, Harry, that show what we t... J.K. Rowling abilities, choices, difficulty, harry-potter
2 "There are only two ways to live your life. On... Albert Einstein inspirational, life, live, miracle, miracles
3 "The person, be it gentleman or lady, who has ... Jane Austen aliteracy, books, classic, humor
4 "Imperfection is beauty, madness is genius and... Marilyn Monroe be-yourself, inspirational
Method 3: Scraping Paginated Content
Many websites split content across multiple pages. Here's how to handle pagination:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def scrape_quotes_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
quotes = soup.find_all('div', class_='quote')
data = []
for quote in quotes:
text = quote.find('span', class_='text').text
author = quote.find('small', class_='author').text
quote_tags = quote.find('div', class_='tags').find_all('a', class_='tag')
tags = ', '.join([tag.text for tag in quote_tags])
data.append({
'Quote': text,
'Author': author,
'Tags': tags
})
# Check if there's a next page
next_page = soup.find('li', class_='next')
next_url = None
if next_page:
next_url = 'http://quotes.toscrape.com' + next_page.find('a')['href']
return data, next_url
# Start with the first page
current_url = 'http://quotes.toscrape.com/'
all_quotes = []
# Limit to 3 pages for this example (remove this limit for all pages)
page_count = 0
max_pages = 3
while current_url and page_count < max_pages:
print(f"Scraping page: {current_url}")
page_quotes, current_url = scrape_quotes_page(current_url)
all_quotes.extend(page_quotes)
page_count += 1
# Be nice to the website by adding a delay between requests
import time
time.sleep(1)
# Create the final DataFrame
quotes_df = pd.DataFrame(all_quotes)
# Show stats
print(f"\nTotal quotes collected: {len(quotes_df)}")
print(f"Number of unique authors: {quotes_df['Author'].nunique()}")
print("\nQuotes per author:")
print(quotes_df['Author'].value_counts().head())
Method 4: Scraping Data from Dynamic Websites
Some modern websites load content dynamically using JavaScript. For these sites, you'll need to use a browser automation tool like Selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time
# Set up the driver (this will download the appropriate ChromeDriver if needed)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Example with a JavaScript-heavy website
driver.get('https://quotes.toscrape.com/js/')
# Wait for the content to load
time.sleep(3) # You can use more sophisticated waits
# Now extract the data
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
data = []
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, 'text').text
author = quote.find_element(By.CLASS_NAME, 'author').text
# Get all tag elements
tag_elements = quote.find_elements(By.CLASS_NAME, 'tag')
tags = ', '.join([tag.text for tag in tag_elements])
data.append({
'Quote': text,
'Author': author,
'Tags': tags
})
# Close the browser
driver.quit()
# Create a DataFrame
dynamic_quotes_df = pd.DataFrame(data)
print(dynamic_quotes_df.head())
Real-World Application: Financial Data Analysis
Let's build a practical application that scrapes stock data and uses pandas for analysis:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
# Example: Getting S&P 500 companies from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
sp500_companies = tables[0]
# Display first 5 companies
print("S&P 500 Companies:")
print(sp500_companies[['Symbol', 'Security', 'GICS Sector']].head())
# Let's get historical stock data for a specific company using Yahoo Finance
# Note: This is a simplified example. Yahoo Finance structure may change.
def get_stock_data(ticker, period='1y'):
url = f"https://query1.finance.yahoo.com/v8/finance/chart/{ticker}?interval=1d&range={period}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
data = response.json()
# Extract timestamp and close price
timestamps = data['chart']['result'][0]['timestamp']
close_prices = data['chart']['result'][0]['indicators']['quote'][0]['close']
# Convert to DataFrame
df = pd.DataFrame({
'Date': pd.to_datetime(timestamps, unit='s'),
'Close': close_prices
})
df.set_index('Date', inplace=True)
return df
# Example usage for Apple (AAPL)
try:
apple_data = get_stock_data('AAPL')
print("\nApple Stock Data (Last 5 days):")
print(apple_data.tail())
# Create a simple plot
plt.figure(figsize=(10, 6))
apple_data['Close'].plot(title='Apple Stock Price (1 Year)')
plt.grid(True)
plt.xlabel('Date')
plt.ylabel('Close Price ($)')
plt.savefig('apple_stock.png') # Save the figure
print("\nStock chart saved as 'apple_stock.png'")
except Exception as e:
print(f"Error fetching stock data: {str(e)}")
print("This example might not work if Yahoo Finance has changed their API structure.")
Challenges and Best Practices
Handling Errors
When web scraping, you'll encounter various issues. Here's how to handle them:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import random
def robust_scraper(url, max_retries=3):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
for attempt in range(max_retries):
try:
# Add a random delay to be respectful
time.sleep(random.uniform(1, 3))
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raise an exception for HTTP errors
# If we get here, the request was successful
return response.text
except requests.exceptions.HTTPError as e:
print(f"HTTP Error: {e} (Attempt {attempt + 1} of {max_retries})")
except requests.exceptions.ConnectionError:
print(f"Connection Error (Attempt {attempt + 1} of {max_retries})")
except requests.exceptions.Timeout:
print(f"Timeout Error (Attempt {attempt + 1} of {max_retries})")
except requests.exceptions.RequestException as e:
print(f"Error: {e} (Attempt {attempt + 1} of {max_retries})")
# If we get here, the request failed
if attempt < max_retries - 1:
# Wait longer before each retry
time.sleep(random.uniform(2, 5) * (attempt + 1))
# If we get here, all attempts failed
raise Exception(f"Failed to retrieve data after {max_retries} attempts")
# Example usage
try:
html_content = robust_scraper('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
tables = pd.read_html(html_content)
gdp_data = tables[2] # The third table has the IMF data
print(gdp_data.head())
except Exception as e:
print(f"Scraping failed: {e}")
Cleaning Scraped Data
Scraped data often requires cleaning. Here's a typical workflow:
# Example: Cleaning a scraped table with messy data
import pandas as pd
import re
# Assume we've scraped a table with messy currency values and dates
data = {
'Country': ['United States', 'China', 'Japan', 'Germany', 'India'],
'GDP (USD)': ['$21,433.2 billion', '$16,642.3 billion', '$5,378.1 billion',
'$4,319.3 billion', '$3,176.3 billion'],
'Last Updated': ['Jan 2022', 'Dec 2021', 'Jan 2022', 'Nov 2021', 'Jan 2022']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Clean the GDP column by removing $ and "billion", and converting to float
df['GDP (USD)'] = df['GDP (USD)'].apply(
lambda x: float(re.sub(r'[^\d.]', '', x.replace('billion', '')))
)
# Convert to actual billions (if we removed "billion")
df.rename(columns={'GDP (USD)': 'GDP (Billions USD)'}, inplace=True)
# Parse the dates
df['Last Updated'] = pd.to_datetime(df['Last Updated'], format='%b %Y')
print("\nCleaned DataFrame:")
print(df)
# Add a new calculated column
df['GDP Per Capita (USD)'] = [65000, 11900, 42900, 51800, 2300] # Example values
print("\nWith additional data:")
print(df)
Summary
In this tutorial, you learned how to:
- Use pandas' built-in functions to scrape HTML tables
- Work with BeautifulSoup to scrape more complex websites
- Handle pagination to collect data from multiple pages
- Scrape dynamic websites using Selenium
- Apply these techniques to real-world data analysis tasks
- Handle errors and clean scraped data
Web scraping combined with pandas provides a powerful workflow for collecting and analyzing data that is not readily available through APIs or downloads. By following ethical guidelines and implementing robust error handling, you can reliably gather data for your analysis needs.
Additional Resources
- pandas documentation on read_html()
- BeautifulSoup documentation
- Selenium with Python documentation
- Web Scraping Ethics and Legal Considerations
Exercises
-
Basic Exercise: Scrape a Wikipedia table of your choice and perform basic data cleaning.
-
Intermediate Exercise: Create a script that scrapes weather data from a weather website for multiple cities and compares their average temperatures.
-
Advanced Exercise: Build a dashboard that automatically scrapes financial data for a list of stocks, analyzes their performance, and generates visualizations.
-
Challenge: Create a web scraper that extracts product information from an e-commerce site (that permits scraping) and identifies price trends or product availability patterns.
Remember to always respect websites' terms of service and robots.txt files when scraping, and consider using official APIs when available.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)