PyTorch Object Detection

Introduction

Object detection is a computer vision technique that identifies and locates objects within an image. Unlike image classification, which only tells us what objects are present, object detection provides the precise location of each object using bounding boxes.

In this tutorial, we'll explore how to implement object detection using PyTorch, one of the most popular deep learning frameworks. We'll cover:

The fundamentals of object detection
Pre-trained object detection models in PyTorch
How to use torchvision's detection models
Fine-tuning detection models on custom datasets
Evaluating object detection performance

By the end of this tutorial, you'll be able to create a complete object detection pipeline in PyTorch.

Understanding Object Detection

Object detection combines two tasks:

Classification: Identifying what objects are in an image
Localization: Determining where those objects are located

The output of an object detection model typically includes:

A class label for each detected object
A bounding box (x, y, width, height) for each object
A confidence score for each detection

Object Detection Architectures

Several popular object detection architectures are available in PyTorch:

R-CNN family: Region-based CNN models (R-CNN, Fast R-CNN, Faster R-CNN)
SSD: Single Shot MultiBox Detector
YOLO: You Only Look Once
RetinaNet: Focal Loss for Dense Object Detection

In this tutorial, we'll focus on Faster R-CNN, which is readily available in PyTorch's torchvision library.

Setting Up Your Environment

First, make sure you have the necessary libraries installed:

bash
pip install torch torchvision opencv-python matplotlib

Let's start by importing the required libraries:

python
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.transforms import functional as F
import cv2
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

Using Pre-trained Object Detection Models

PyTorch's torchvision package provides several pre-trained object detection models. Let's load a pre-trained Faster R-CNN model:

python
# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load a pre-trained Faster R-CNN model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.to(device)
model.eval()  # Set the model to evaluation mode

Output:

Using device: cuda

The Faster R-CNN model we loaded is pre-trained on the COCO dataset, which can detect 91 different object categories.

Performing Object Detection on an Image

Let's write a function to perform object detection on a given image:

python
def detect_objects(image_path, model, device, threshold=0.7):
    # Load the image
    image = Image.open(image_path).convert("RGB")
    
    # Transform the image to tensor
    image_tensor = F.to_tensor(image)
    
    # Add a batch dimension
    image_tensor = image_tensor.unsqueeze(0).to(device)
    
    # Perform prediction
    with torch.no_grad():
        predictions = model(image_tensor)
    
    # Get the predictions for the first image in the batch
    pred = predictions[0]
    
    # Get boxes, labels and scores above threshold
    boxes = pred['boxes'][pred['scores'] > threshold].cpu().numpy().astype(np.int32)
    labels = pred['labels'][pred['scores'] > threshold].cpu().numpy()
    scores = pred['scores'][pred['scores'] > threshold].cpu().numpy()
    
    return image, boxes, labels, scores

# COCO dataset class names
COCO_CLASSES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

Now, let's create a function to visualize the detection results:

python
def visualize_detections(image, boxes, labels, scores):
    # Convert PIL image to numpy array for OpenCV
    image_np = np.array(image)
    image_np = cv2.cvtColor(image_np, cv2.COLOR_RGB2BGR)
    
    # Draw bounding boxes and labels
    for box, label, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box
        
        # Draw bounding box
        cv2.rectangle(image_np, (x1, y1), (x2, y2), (0, 255, 0), 2)
        
        # Create label text
        text = f"{COCO_CLASSES[label]}: {score:.2f}"
        
        # Position the label text above the box
        cv2.putText(image_np, text, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    # Convert back to RGB for matplotlib
    image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)
    
    # Display the image
    plt.figure(figsize=(12, 8))
    plt.imshow(image_np)
    plt.axis('off')
    plt.show()

Now, let's use these functions to perform object detection on a sample image:

python
# Replace with your image path
image_path = "path/to/your/image.jpg"

# Detect objects
image, boxes, labels, scores = detect_objects(image_path, model, device)

# Visualize the detections
visualize_detections(image, boxes, labels, scores)

Expected output will be your image with bounding boxes around detected objects, each labeled with the class name and confidence score.

Fine-tuning Object Detection Models

You might want to fine-tune a pre-trained model on your custom dataset. Here's how to do it:

1. Prepare Your Dataset

Create a custom dataset class that inherits from torch.utils.data.Dataset:

python
from torch.utils.data import Dataset, DataLoader

class CustomObjectDetectionDataset(Dataset):
    def __init__(self, image_paths, annotations, transform=None):
        self.image_paths = image_paths
        self.annotations = annotations
        self.transform = transform
        
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        # Load image
        img_path = self.image_paths[idx]
        image = Image.open(img_path).convert("RGB")
        
        # Get annotation
        boxes = self.annotations[idx]["boxes"]
        labels = self.annotations[idx]["labels"]
        
        # Convert to tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)
        
        # Create target dictionary
        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([idx]),
            "area": (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]),
            "iscrowd": torch.zeros((len(boxes),), dtype=torch.int64)
        }
        
        # Apply transformations
        if self.transform:
            image = self.transform(image)
            
        return image, target

2. Define a Collate Function for Batching

python
def collate_fn(batch):
    return tuple(zip(*batch))

3. Create DataLoader

python
# Example data - you should replace this with your actual dataset
image_paths = ["image1.jpg", "image2.jpg"]
annotations = [
    {"boxes": [[10, 20, 100, 200]], "labels": [1]},
    {"boxes": [[50, 60, 150, 250]], "labels": [2]}
]

# Create dataset
dataset = CustomObjectDetectionDataset(image_paths, annotations)

# Create data loader
data_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

4. Fine-tuning

python
def train_one_epoch(model, optimizer, data_loader, device):
    model.train()
    total_loss = 0
    
    for images, targets in data_loader:
        images = [image.to(device) for image in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        
        # Forward pass
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        
        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
        
        total_loss += losses.item()
    
    return total_loss / len(data_loader)

def fine_tune_model(model, dataset, num_epochs=10, learning_rate=0.005):
    # Move model to the right device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    
    # Create data loader
    data_loader = DataLoader(
        dataset, batch_size=2, shuffle=True, collate_fn=collate_fn
    )
    
    # Get parameters
    params = [p for p in model.parameters() if p.requires_grad]
    
    # Create optimizer
    optimizer = torch.optim.SGD(params, lr=learning_rate, momentum=0.9, weight_decay=0.0005)
    
    # Training loop
    for epoch in range(num_epochs):
        # Train for one epoch
        loss = train_one_epoch(model, optimizer, data_loader, device)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
        
    return model

Evaluating Object Detection Models

After training, it's important to evaluate your model's performance:

python
from torchvision.utils import draw_bounding_boxes

def evaluate_model(model, test_loader, device):
    model.eval()
    results = []
    
    with torch.no_grad():
        for images, targets in test_loader:
            images = [img.to(device) for img in images]
            
            # Get predictions
            predictions = model(images)
            
            # Convert predictions and targets to CPU for evaluation
            cpu_predictions = [{k: v.cpu() for k, v in pred.items()} for pred in predictions]
            cpu_targets = [{k: v.cpu() for k, v in target.items()} for target in targets]
            
            results.append((cpu_predictions, cpu_targets))
    
    return results

Common metrics for object detection include:

Precision: How many of the predicted bounding boxes are correct
Recall: How many of the ground-truth objects are detected
mAP (mean Average Precision): The area under the precision-recall curve, averaged across classes
IoU (Intersection over Union): Measures the overlap between predicted and ground-truth boxes

Real-world Application: Object Detection in Video

Let's implement an application that detects objects in a video stream:

python
def detect_objects_in_video(video_path, model, device, output_path=None, threshold=0.7):
    # Open the video
    cap = cv2.VideoCapture(video_path)
    
    # Get video properties
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    
    # Set up video writer if output path is provided
    if output_path:
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
    
    # Process frames
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        # Convert to PIL Image
        pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        
        # Convert to tensor
        image_tensor = F.to_tensor(pil_image).unsqueeze(0).to(device)
        
        # Perform inference
        with torch.no_grad():
            prediction = model(image_tensor)[0]
        
        # Filter predictions by threshold
        mask = prediction['scores'] > threshold
        boxes = prediction['boxes'][mask].cpu().numpy().astype(np.int32)
        labels = prediction['labels'][mask].cpu().numpy()
        scores = prediction['scores'][mask].cpu().numpy()
        
        # Draw bounding boxes
        for box, label, score in zip(boxes, labels, scores):
            x1, y1, x2, y2 = box
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            text = f"{COCO_CLASSES[label]}: {score:.2f}"
            cv2.putText(frame, text, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
        
        # Save or display frame
        if output_path:
            out.write(frame)
        else:
            cv2.imshow('Object Detection', frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
    
    # Release resources
    cap.release()
    if output_path:
        out.release()
    cv2.destroyAllWindows()

To run this function:

python
# For video file
video_path = "path/to/your/video.mp4"
output_path = "output.mp4"
detect_objects_in_video(video_path, model, device, output_path)

# For webcam
# detect_objects_in_video(0, model, device)

Summary

In this tutorial, we covered:

The basics of object detection - understanding what object detection is and how it differs from classification
Using pre-trained models - loading and utilizing torchvision's Faster R-CNN model
Performing inference - detecting objects in images and visualizing the results
Fine-tuning - adapting pre-trained models for custom object detection tasks
Evaluation - measuring the performance of object detection models
Real-world application - implementing object detection for video streams

Object detection is a powerful computer vision technique with numerous real-world applications, including autonomous driving, surveillance, medical imaging, and more.

Additional Resources

Exercises

Use the provided code to detect objects in your own images and videos.
Try different pre-trained models like SSD or YOLO and compare their performance.
Create a small custom dataset and fine-tune a pre-trained model.
Implement real-time object detection using your webcam.
Experiment with different confidence thresholds and see how they affect detection results.
Implement object tracking along with detection for videos.
Try to improve inference speed using techniques like model quantization.

With these skills, you're now equipped to develop powerful object detection applications using PyTorch!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Object Detection​

Object Detection Architectures​

Setting Up Your Environment​

Using Pre-trained Object Detection Models​

Performing Object Detection on an Image​

Fine-tuning Object Detection Models​

1. Prepare Your Dataset​

2. Define a Collate Function for Batching​

3. Create DataLoader​

4. Fine-tuning​

Evaluating Object Detection Models​

Real-world Application: Object Detection in Video​

Summary​

Additional Resources​

Exercises​