Python Web Scraping

Introduction

Web scraping is the process of automatically extracting information from websites. It's an essential skill for any Python developer who works with data or builds web applications. With web scraping, you can collect data that isn't available through APIs, automate repetitive tasks, or monitor websites for changes.

In this tutorial, you'll learn:

The basics of web scraping ethics and legality
How to use Python libraries for web scraping
How to parse HTML content and extract specific data
How to handle common challenges in web scraping
Building practical web scraping projects

Prerequisites

Before starting, make sure you have:

Basic knowledge of Python
Understanding of HTML structure
Python installed on your system
Familiarity with pip (Python package installer)

Understanding Web Scraping Ethics

Before diving into the technical aspects, it's important to understand the ethical and legal considerations of web scraping:

Check Terms of Service: Always review a website's Terms of Service before scraping it
Respect robots.txt: This file outlines which parts of a site can be scraped
Throttle your requests: Don't overload servers with too many requests
Identify your scraper: Use custom user-agents to identify your bot
Data usage: Be mindful of how you use and share scraped data

Required Libraries

Let's install the main libraries we'll use for web scraping:

pip install requests beautifulsoup4 lxml

These libraries serve different purposes:

Requests: Sends HTTP requests to websites and retrieves their content
BeautifulSoup4: Parses HTML and XML documents for easy data extraction
lxml: A fast HTML parser that works with BeautifulSoup

Basic Web Scraping with Requests and BeautifulSoup

Let's start with a simple example to scrape a webpage:

import requests
from bs4 import BeautifulSoup

# Send a GET request to the URL
url = "https://quotes.toscrape.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Print the title of the page
    print(soup.title.text)
    
    # Find all quote elements
    quotes = soup.find_all('span', class_='text')
    
    # Print the first 5 quotes
    for i, quote in enumerate(quotes[:5], 1):
        print(f"{i}. {quote.text}")
else:
    print(f"Failed to retrieve the page: {response.status_code}")

Output:

Quotes to Scrape
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."

Breaking Down the Code

We import the requests library to make HTTP requests and BeautifulSoup to parse HTML.
We send a GET request to "quotes.toscrape.com" and store the response.
We check if the request was successful (status code 200).
We create a BeautifulSoup object to parse the HTML content.
We extract the page title and print it.
We find all elements with the tag span and class text (which contain quotes).
We print the first 5 quotes.

Navigating HTML Structure

BeautifulSoup provides several methods to navigate through HTML and find elements:

Find Elements by Tags

# Find the first h1 tag
heading = soup.find('h1')

# Find all paragraph tags
paragraphs = soup.find_all('p')

# Find elements with specific attributes
links = soup.find_all('a', href=True)

CSS Selectors

BeautifulSoup also supports CSS selectors for more complex queries:

# Find elements using CSS selectors
results = soup.select('div.quote span.text')

# Find nested elements
authors = soup.select('div.quote span.author')

Practical Example: Scraping Product Information

Let's build a more practical example by scraping product information from a fictional bookstore website:

import requests
from bs4 import BeautifulSoup
import csv

def scrape_books():
    books = []
    url = "http://books.toscrape.com/"
    
    response = requests.get(url)
    if response.status_code != 200:
        print("Failed to retrieve the page")
        return books
    
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Find all book containers
    book_containers = soup.select('article.product_pod')
    
    for book in book_containers:
        # Extract book details
        title = book.h3.a['title']
        price = book.select_one('div.product_price p.price_color').text
        availability = book.select_one('div.product_price p.availability').text.strip()
        rating = book.p['class'][1]  # The rating is stored as a class name
        
        books.append({
            'title': title,
            'price': price,
            'availability': availability,
            'rating': rating
        })
    
    return books

# Run the scraper and save to CSV
books = scrape_books()

# Print the first 3 books
for book in books[:3]:
    print(f"Title: {book['title']}")
    print(f"Price: {book['price']}")
    print(f"Availability: {book['availability']}")
    print(f"Rating: {book['rating']}")
    print("-" * 50)

# Save to CSV
with open('books.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['title', 'price', 'availability', 'rating']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for book in books:
        writer.writerow(book)

print(f"Total books scraped: {len(books)}")

Output:

Title: A Light in the Attic
Price: £51.77
Availability: In stock
Rating: Three
--------------------------------------------------
Title: Tipping the Velvet
Price: £53.74
Availability: In stock
Rating: One
--------------------------------------------------
Title: Soumission
Price: £50.10
Availability: In stock
Rating: One
--------------------------------------------------
Total books scraped: 20

This example demonstrates a more complex scraping task where we:

Extract multiple pieces of information from each product
Organize the data in a structured format
Save the results to a CSV file for further analysis

Handling Pagination

Many websites display content across multiple pages. To scrape all available data, we need to implement pagination:

def scrape_all_book_pages():
    base_url = "http://books.toscrape.com/catalogue/page-{}.html"
    all_books = []
    page = 1
    
    while True:
        url = base_url.format(page)
        response = requests.get(url)
        
        # Check if we've reached the end of available pages
        if response.status_code != 200:
            break
        
        print(f"Scraping page {page}...")
        soup = BeautifulSoup(response.text, 'lxml')
        
        # Find all book containers
        book_containers = soup.select('article.product_pod')
        
        # If no books found, break
        if not book_containers:
            break
        
        for book in book_containers:
            # Extract book details (same as before)
            title = book.h3.a['title']
            price = book.select_one('div.product_price p.price_color').text
            
            all_books.append({
                'title': title,
                'price': price,
                'page': page
            })
        
        page += 1
    
    return all_books

# Run the pagination scraper
all_books = scrape_all_book_pages()
print(f"Total books scraped: {len(all_books)}")
print(f"Total pages scraped: {all_books[-1]['page']}")

Handling Dynamic Content with Selenium

Some websites load content dynamically using JavaScript. In these cases, we need to use Selenium, a browser automation tool:

# First install selenium: pip install selenium
# You'll also need a webdriver for your browser of choice
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Set up the driver
options = Options()
options.add_argument("--headless")  # Run in headless mode (no browser window)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Navigate to a page with dynamic content
driver.get("https://quotes.toscrape.com/js/")

# Wait for JavaScript to load the content
time.sleep(2)

# Now extract the data
quotes = driver.find_elements(By.CLASS_NAME, "quote")
for quote in quotes[:3]:
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    print(f"{text} - {author}")

# Close the browser
driver.quit()

Best Practices for Web Scraping

Rate Limiting: Add delays between requests to avoid overloading the server

import time
import random

# Add a random delay between 1-3 seconds
time.sleep(random.uniform(1, 3))

User-Agent Rotation: Vary your user-agent to mimic different browsers

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

Use Proxies: Rotate IP addresses to avoid being blocked

proxies = {
    'http': 'http://10.10.10.10:8000',
    'https': 'http://10.10.10.10:8000'
}
response = requests.get(url, proxies=proxies)

Handle Errors Gracefully: Implement error handling and retries

def get_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise exception for error status codes
            return response
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt + 1 < max_retries:
                time.sleep(5)  # Wait before retrying
    
    return None

Save Progress: For large scraping jobs, save progress incrementally

import json

def save_progress(data, filename="scrape_progress.json"):
    with open(filename, 'w') as f:
        json.dump(data, f)

def load_progress(filename="scrape_progress.json"):
    try:
        with open(filename, 'r') as f:
            return json.load(f)
    except FileNotFoundError:
        return []

Advanced Project: Building a News Aggregator

Let's put everything together in a more advanced project that scrapes news headlines from multiple sources:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from datetime import datetime

class NewsAggregator:
    def __init__(self):
        self.sources = {
            'BBC': {
                'url': 'https://www.bbc.com/news',
                'headline_selector': 'h3.gs-c-promo-heading__title'
            },
            'CNN': {
                'url': 'https://www.cnn.com',
                'headline_selector': 'span.container__headline-text'
            }
        }
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
    def get_headlines(self, source_name):
        source_info = self.sources.get(source_name)
        if not source_info:
            print(f"Source {source_name} not found!")
            return []
            
        print(f"Fetching headlines from {source_name}...")
        
        try:
            response = requests.get(source_info['url'], headers=self.headers)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'lxml')
            headlines = soup.select(source_info['headline_selector'])
            
            results = []
            for headline in headlines:
                results.append({
                    'source': source_name,
                    'headline': headline.text.strip(),
                    'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                })
                
            print(f"Found {len(results)} headlines")
            return results
            
        except Exception as e:
            print(f"Error scraping {source_name}: {e}")
            return []
    
    def get_all_headlines(self):
        all_headlines = []
        
        for source_name in self.sources.keys():
            headlines = self.get_headlines(source_name)
            all_headlines.extend(headlines)
            
            # Be nice to the websites
            time.sleep(random.uniform(2, 5))
            
        return all_headlines
        
    def save_to_csv(self, headlines, filename="headlines.csv"):
        if not headlines:
            print("No headlines to save!")
            return
            
        df = pd.DataFrame(headlines)
        df.to_csv(filename, index=False)
        print(f"Saved {len(headlines)} headlines to {filename}")

# Run the aggregator
if __name__ == "__main__":
    aggregator = NewsAggregator()
    headlines = aggregator.get_all_headlines()
    
    # Display some sample headlines
    for headline in headlines[:5]:
        print(f"[{headline['source']}] {headline['headline']}")
        
    # Save all headlines
    aggregator.save_to_csv(headlines)

This advanced example demonstrates:

Object-oriented approach to web scraping
Handling multiple sources with different HTML structures
Error handling and proper delays
Data processing and storage

Common Challenges and Solutions

1. Handling CAPTCHAs

Some websites implement CAPTCHAs to prevent scraping:

Use specialized CAPTCHA solving services
Implement browser automation with tools like Selenium
Consider using legitimate APIs instead of scraping

2. Websites that Block Scrapers

Strategies for bypassing blocking mechanisms:

Respect robots.txt and terms of service
Implement delays between requests
Rotate user agents and IP addresses
Use headless browsers that execute JavaScript

3. Dealing with Website Changes

Websites frequently update their structure, breaking scrapers:

Design your code to be resilient to minor changes
Implement logging and monitoring
Set up alerts when scraping patterns fail
Use more robust selectors that are less likely to change

Summary

In this tutorial, you've learned:

The basics of web scraping with Python
How to extract data using requests and BeautifulSoup
Handling pagination and dynamic content
Best practices for ethical and efficient scraping
Building practical web scraping applications

Web scraping is a powerful skill that opens up vast possibilities for data collection, automation, and analysis. Remember to use these techniques responsibly and ethically, respecting website terms of service and robots.txt directives.

Additional Resources

Books and Documentation:
- BeautifulSoup Documentation
- Requests Library Documentation
- "Web Scraping with Python" by Ryan Mitchell
Practice Websites:
Advanced Topics:
- Scrapy Framework - For building more scalable scrapers
- Playwright - Modern alternative to Selenium
- API Integration - When APIs are available instead of scraping

Exercises

Basic Exercise: Scrape a list of popular movies from IMDb's front page.
Intermediate Exercise: Create a weather tracker that scrapes forecast data from a weather site.
Advanced Exercise: Build a job listing aggregator that collects job postings from multiple sources.
Challenge: Implement a web scraper that monitors a product's price on an e-commerce site and sends an email alert when the price drops.

Remember that the best way to learn web scraping is through practice. Start with simple projects and gradually tackle more complex challenges as you become comfortable with the techniques.

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Prerequisites​

Understanding Web Scraping Ethics​

Required Libraries​

Basic Web Scraping with Requests and BeautifulSoup​

Breaking Down the Code​

Navigating HTML Structure​

Find Elements by Tags​

CSS Selectors​

Practical Example: Scraping Product Information​

Handling Pagination​

Handling Dynamic Content with Selenium​

Best Practices for Web Scraping​

Advanced Project: Building a News Aggregator​

Common Challenges and Solutions​

1. Handling CAPTCHAs​

2. Websites that Block Scrapers​

3. Dealing with Website Changes​

Summary​

Additional Resources​

Exercises​