Skip to main content

Python Web Scraping

Introduction

Web scraping is the process of automatically extracting information from websites. It's an essential skill for any Python developer who works with data or builds web applications. With web scraping, you can collect data that isn't available through APIs, automate repetitive tasks, or monitor websites for changes.

In this tutorial, you'll learn:

  • The basics of web scraping ethics and legality
  • How to use Python libraries for web scraping
  • How to parse HTML content and extract specific data
  • How to handle common challenges in web scraping
  • Building practical web scraping projects

Prerequisites

Before starting, make sure you have:

  • Basic knowledge of Python
  • Understanding of HTML structure
  • Python installed on your system
  • Familiarity with pip (Python package installer)

Understanding Web Scraping Ethics

Before diving into the technical aspects, it's important to understand the ethical and legal considerations of web scraping:

  1. Check Terms of Service: Always review a website's Terms of Service before scraping it
  2. Respect robots.txt: This file outlines which parts of a site can be scraped
  3. Throttle your requests: Don't overload servers with too many requests
  4. Identify your scraper: Use custom user-agents to identify your bot
  5. Data usage: Be mindful of how you use and share scraped data

Required Libraries

Let's install the main libraries we'll use for web scraping:

bash
pip install requests beautifulsoup4 lxml

These libraries serve different purposes:

  • Requests: Sends HTTP requests to websites and retrieves their content
  • BeautifulSoup4: Parses HTML and XML documents for easy data extraction
  • lxml: A fast HTML parser that works with BeautifulSoup

Basic Web Scraping with Requests and BeautifulSoup

Let's start with a simple example to scrape a webpage:

python
import requests
from bs4 import BeautifulSoup

# Send a GET request to the URL
url = "https://quotes.toscrape.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'lxml')

# Print the title of the page
print(soup.title.text)

# Find all quote elements
quotes = soup.find_all('span', class_='text')

# Print the first 5 quotes
for i, quote in enumerate(quotes[:5], 1):
print(f"{i}. {quote.text}")
else:
print(f"Failed to retrieve the page: {response.status_code}")

Output:

Quotes to Scrape
1. "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
2. "It is our choices, Harry, that show what we truly are, far more than our abilities."
3. "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
4. "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
5. "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."

Breaking Down the Code

  1. We import the requests library to make HTTP requests and BeautifulSoup to parse HTML.
  2. We send a GET request to "quotes.toscrape.com" and store the response.
  3. We check if the request was successful (status code 200).
  4. We create a BeautifulSoup object to parse the HTML content.
  5. We extract the page title and print it.
  6. We find all elements with the tag span and class text (which contain quotes).
  7. We print the first 5 quotes.

BeautifulSoup provides several methods to navigate through HTML and find elements:

Find Elements by Tags

python
# Find the first h1 tag
heading = soup.find('h1')

# Find all paragraph tags
paragraphs = soup.find_all('p')

# Find elements with specific attributes
links = soup.find_all('a', href=True)

CSS Selectors

BeautifulSoup also supports CSS selectors for more complex queries:

python
# Find elements using CSS selectors
results = soup.select('div.quote span.text')

# Find nested elements
authors = soup.select('div.quote span.author')

Practical Example: Scraping Product Information

Let's build a more practical example by scraping product information from a fictional bookstore website:

python
import requests
from bs4 import BeautifulSoup
import csv

def scrape_books():
books = []
url = "http://books.toscrape.com/"

response = requests.get(url)
if response.status_code != 200:
print("Failed to retrieve the page")
return books

soup = BeautifulSoup(response.text, 'lxml')

# Find all book containers
book_containers = soup.select('article.product_pod')

for book in book_containers:
# Extract book details
title = book.h3.a['title']
price = book.select_one('div.product_price p.price_color').text
availability = book.select_one('div.product_price p.availability').text.strip()
rating = book.p['class'][1] # The rating is stored as a class name

books.append({
'title': title,
'price': price,
'availability': availability,
'rating': rating
})

return books

# Run the scraper and save to CSV
books = scrape_books()

# Print the first 3 books
for book in books[:3]:
print(f"Title: {book['title']}")
print(f"Price: {book['price']}")
print(f"Availability: {book['availability']}")
print(f"Rating: {book['rating']}")
print("-" * 50)

# Save to CSV
with open('books.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['title', 'price', 'availability', 'rating']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()
for book in books:
writer.writerow(book)

print(f"Total books scraped: {len(books)}")

Output:

Title: A Light in the Attic
Price: £51.77
Availability: In stock
Rating: Three
--------------------------------------------------
Title: Tipping the Velvet
Price: £53.74
Availability: In stock
Rating: One
--------------------------------------------------
Title: Soumission
Price: £50.10
Availability: In stock
Rating: One
--------------------------------------------------
Total books scraped: 20

This example demonstrates a more complex scraping task where we:

  1. Extract multiple pieces of information from each product
  2. Organize the data in a structured format
  3. Save the results to a CSV file for further analysis

Handling Pagination

Many websites display content across multiple pages. To scrape all available data, we need to implement pagination:

python
def scrape_all_book_pages():
base_url = "http://books.toscrape.com/catalogue/page-{}.html"
all_books = []
page = 1

while True:
url = base_url.format(page)
response = requests.get(url)

# Check if we've reached the end of available pages
if response.status_code != 200:
break

print(f"Scraping page {page}...")
soup = BeautifulSoup(response.text, 'lxml')

# Find all book containers
book_containers = soup.select('article.product_pod')

# If no books found, break
if not book_containers:
break

for book in book_containers:
# Extract book details (same as before)
title = book.h3.a['title']
price = book.select_one('div.product_price p.price_color').text

all_books.append({
'title': title,
'price': price,
'page': page
})

page += 1

return all_books

# Run the pagination scraper
all_books = scrape_all_book_pages()
print(f"Total books scraped: {len(all_books)}")
print(f"Total pages scraped: {all_books[-1]['page']}")

Handling Dynamic Content with Selenium

Some websites load content dynamically using JavaScript. In these cases, we need to use Selenium, a browser automation tool:

python
# First install selenium: pip install selenium
# You'll also need a webdriver for your browser of choice
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Set up the driver
options = Options()
options.add_argument("--headless") # Run in headless mode (no browser window)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Navigate to a page with dynamic content
driver.get("https://quotes.toscrape.com/js/")

# Wait for JavaScript to load the content
time.sleep(2)

# Now extract the data
quotes = driver.find_elements(By.CLASS_NAME, "quote")
for quote in quotes[:3]:
text = quote.find_element(By.CLASS_NAME, "text").text
author = quote.find_element(By.CLASS_NAME, "author").text
print(f"{text} - {author}")

# Close the browser
driver.quit()

Best Practices for Web Scraping

  1. Rate Limiting: Add delays between requests to avoid overloading the server
python
import time
import random

# Add a random delay between 1-3 seconds
time.sleep(random.uniform(1, 3))
  1. User-Agent Rotation: Vary your user-agent to mimic different browsers
python
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
  1. Use Proxies: Rotate IP addresses to avoid being blocked
python
proxies = {
'http': 'http://10.10.10.10:8000',
'https': 'http://10.10.10.10:8000'
}
response = requests.get(url, proxies=proxies)
  1. Handle Errors Gracefully: Implement error handling and retries
python
def get_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise exception for error status codes
return response
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt + 1 < max_retries:
time.sleep(5) # Wait before retrying

return None
  1. Save Progress: For large scraping jobs, save progress incrementally
python
import json

def save_progress(data, filename="scrape_progress.json"):
with open(filename, 'w') as f:
json.dump(data, f)

def load_progress(filename="scrape_progress.json"):
try:
with open(filename, 'r') as f:
return json.load(f)
except FileNotFoundError:
return []

Advanced Project: Building a News Aggregator

Let's put everything together in a more advanced project that scrapes news headlines from multiple sources:

python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from datetime import datetime

class NewsAggregator:
def __init__(self):
self.sources = {
'BBC': {
'url': 'https://www.bbc.com/news',
'headline_selector': 'h3.gs-c-promo-heading__title'
},
'CNN': {
'url': 'https://www.cnn.com',
'headline_selector': 'span.container__headline-text'
}
}
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

def get_headlines(self, source_name):
source_info = self.sources.get(source_name)
if not source_info:
print(f"Source {source_name} not found!")
return []

print(f"Fetching headlines from {source_name}...")

try:
response = requests.get(source_info['url'], headers=self.headers)
response.raise_for_status()

soup = BeautifulSoup(response.text, 'lxml')
headlines = soup.select(source_info['headline_selector'])

results = []
for headline in headlines:
results.append({
'source': source_name,
'headline': headline.text.strip(),
'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
})

print(f"Found {len(results)} headlines")
return results

except Exception as e:
print(f"Error scraping {source_name}: {e}")
return []

def get_all_headlines(self):
all_headlines = []

for source_name in self.sources.keys():
headlines = self.get_headlines(source_name)
all_headlines.extend(headlines)

# Be nice to the websites
time.sleep(random.uniform(2, 5))

return all_headlines

def save_to_csv(self, headlines, filename="headlines.csv"):
if not headlines:
print("No headlines to save!")
return

df = pd.DataFrame(headlines)
df.to_csv(filename, index=False)
print(f"Saved {len(headlines)} headlines to {filename}")

# Run the aggregator
if __name__ == "__main__":
aggregator = NewsAggregator()
headlines = aggregator.get_all_headlines()

# Display some sample headlines
for headline in headlines[:5]:
print(f"[{headline['source']}] {headline['headline']}")

# Save all headlines
aggregator.save_to_csv(headlines)

This advanced example demonstrates:

  • Object-oriented approach to web scraping
  • Handling multiple sources with different HTML structures
  • Error handling and proper delays
  • Data processing and storage

Common Challenges and Solutions

1. Handling CAPTCHAs

Some websites implement CAPTCHAs to prevent scraping:

  • Use specialized CAPTCHA solving services
  • Implement browser automation with tools like Selenium
  • Consider using legitimate APIs instead of scraping

2. Websites that Block Scrapers

Strategies for bypassing blocking mechanisms:

  • Respect robots.txt and terms of service
  • Implement delays between requests
  • Rotate user agents and IP addresses
  • Use headless browsers that execute JavaScript

3. Dealing with Website Changes

Websites frequently update their structure, breaking scrapers:

  • Design your code to be resilient to minor changes
  • Implement logging and monitoring
  • Set up alerts when scraping patterns fail
  • Use more robust selectors that are less likely to change

Summary

In this tutorial, you've learned:

  • The basics of web scraping with Python
  • How to extract data using requests and BeautifulSoup
  • Handling pagination and dynamic content
  • Best practices for ethical and efficient scraping
  • Building practical web scraping applications

Web scraping is a powerful skill that opens up vast possibilities for data collection, automation, and analysis. Remember to use these techniques responsibly and ethically, respecting website terms of service and robots.txt directives.

Additional Resources

  1. Books and Documentation:

  2. Practice Websites:

  3. Advanced Topics:

Exercises

  1. Basic Exercise: Scrape a list of popular movies from IMDb's front page.
  2. Intermediate Exercise: Create a weather tracker that scrapes forecast data from a weather site.
  3. Advanced Exercise: Build a job listing aggregator that collects job postings from multiple sources.
  4. Challenge: Implement a web scraper that monitors a product's price on an e-commerce site and sends an email alert when the price drops.

Remember that the best way to learn web scraping is through practice. Start with simple projects and gradually tackle more complex challenges as you become comfortable with the techniques.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)