PyTorch CI/CD Integration

Introduction

Continuous Integration and Continuous Deployment (CI/CD) are development practices that help teams deliver code changes more frequently and reliably. When working with PyTorch projects, proper CI/CD integration ensures that your models are consistently tested, validated, and deployed without manual intervention.

In this guide, you'll learn how to:

Set up CI/CD pipelines for PyTorch projects
Automate testing of PyTorch models
Configure deployment workflows for model serving
Implement best practices for reliable PyTorch CI/CD

Whether you're developing a simple classification model or a complex deep learning system, integrating CI/CD into your PyTorch workflow will save time and increase reliability.

Why CI/CD Matters for PyTorch Projects

PyTorch projects present unique challenges for CI/CD pipelines:

Computational requirements: Training and testing models often requires significant resources
Environment dependencies: PyTorch has specific CUDA/GPU requirements
Model artifacts: Models produce large binary files that need special handling
Reproducibility: Ensuring consistent results across different environments

A proper CI/CD setup addresses these challenges and helps maintain code quality as your project evolves.

Getting Started with PyTorch CI/CD

Setting Up Your Project Structure

Before implementing CI/CD, organize your PyTorch project with testability and automation in mind:

pytorch-project/
├── .github/workflows/  # GitHub Actions configurations
├── src/               # Source code
│   └── models/        # PyTorch models
├── tests/             # Test scripts
├── configs/           # Configuration files
├── requirements.txt   # Dependencies
└── setup.py           # Package setup

Creating Essential Test Cases

First, create basic test cases to verify your PyTorch models work correctly. Here's a simple example using pytest:

python
# tests/test_model.py
import torch
import pytest
from src.models.simple_cnn import SimpleCNN

def test_model_forward_pass():
    # Create a sample input
    batch_size = 2
    channels = 3
    height = 32
    width = 32
    x = torch.randn(batch_size, channels, height, width)
    
    # Initialize model
    model = SimpleCNN(num_classes=10)
    model.eval()
    
    # Perform forward pass
    with torch.no_grad():
        output = model(x)
    
    # Check output dimensions
    assert output.shape == (batch_size, 10), f"Expected shape (2, 10), got {output.shape}"

This test verifies that your model produces outputs with the expected shape.

GitHub Actions for PyTorch CI/CD

GitHub Actions is a popular choice for setting up CI/CD pipelines. Let's create a workflow file for a PyTorch project.

Basic Testing Workflow

Create a file .github/workflows/test.yml:

yaml
name: PyTorch Tests

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
        
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pytest
        pip install -r requirements.txt
        
    - name: Run tests
      run: |
        pytest tests/

Adding GPU Support for Testing

To test PyTorch models with GPU capabilities:

yaml
name: PyTorch GPU Tests

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test-gpu:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
        
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install pytest
        # Install PyTorch with CUDA support
        pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
        pip install -r requirements.txt
        
    - name: Run GPU tests
      run: |
        pytest tests/test_gpu.py

Model Testing Best Practices

1. Test Model Loading and Saving

python
# tests/test_model_io.py
import torch
import os
from src.models.simple_cnn import SimpleCNN

def test_model_save_load():
    # Create and save a model
    model = SimpleCNN(num_classes=10)
    torch.save(model.state_dict(), "model.pth")
    
    # Load model from file
    loaded_model = SimpleCNN(num_classes=10)
    loaded_model.load_state_dict(torch.load("model.pth"))
    loaded_model.eval()
    
    # Verify both models produce the same output
    x = torch.randn(1, 3, 32, 32)
    with torch.no_grad():
        output1 = model(x)
        output2 = loaded_model(x)
    
    assert torch.allclose(output1, output2, atol=1e-7)
    
    # Cleanup
    os.remove("model.pth")

2. Test Numerical Stability

python
# tests/test_numerical_stability.py
import torch
from src.models.simple_cnn import SimpleCNN

def test_backward_pass():
    # Create inputs and targets
    x = torch.randn(4, 3, 32, 32, requires_grad=True)
    target = torch.randint(0, 10, (4,))
    
    # Initialize model and optimizer
    model = SimpleCNN(num_classes=10)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    criterion = torch.nn.CrossEntropyLoss()
    
    # Forward and backward pass
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
    
    # Check if gradients are finite (not NaN or Inf)
    for name, param in model.named_parameters():
        assert torch.isfinite(param.grad).all(), f"Parameter {name} has non-finite gradients"
    
    # Apply optimizer step
    optimizer.step()

Automating Model Training in CI/CD

For small models, you can include training in your CI/CD pipeline:

yaml
name: Train Model

on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly on Sunday at midnight
  workflow_dispatch:  # Manual trigger

jobs:
  train:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
        
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        
    - name: Train model
      run: |
        python src/train.py --config configs/training_config.yaml
        
    - name: Save model artifact
      uses: actions/upload-artifact@v3
      with:
        name: trained-model
        path: outputs/model.pth

Example Training Script

python
# src/train.py
import argparse
import yaml
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from src.models.simple_cnn import SimpleCNN

def train(config):
    # Set up device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load data
    transform = transforms.Compose([
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
    train_dataset = datasets.CIFAR10(
        root='./data', 
        train=True,
        download=True, 
        transform=transform
    )
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=config['batch_size'],
        shuffle=True
    )
    
    # Initialize model
    model = SimpleCNN(num_classes=10).to(device)
    
    # Set up training
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config['learning_rate'])
    
    # Training loop
    for epoch in range(config['epochs']):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            if batch_idx % 100 == 0:
                print(f"Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}] Loss: {loss.item():.6f}")
    
    # Save model
    torch.save(model.state_dict(), config['output_path'])
    print(f"Model saved to {config['output_path']}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--config', type=str, required=True, help='Path to config file')
    args = parser.parse_args()
    
    with open(args.config, 'r') as f:
        config = yaml.safe_load(f)
    
    train(config)

Deploying PyTorch Models

After testing, you'll want to deploy your models to a production environment. Here's how to automate deployment using CI/CD:

Model Deployment with Docker

Create a Dockerfile for your PyTorch application:

dockerfile
FROM pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "src/serve.py"]

Add a model serving script:

python
# src/serve.py
import torch
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
from src.models.simple_cnn import SimpleCNN

app = FastAPI(title="PyTorch Model API")

# Load model
model = SimpleCNN(num_classes=10)
model.load_state_dict(torch.load("outputs/model.pth", map_location="cpu"))
model.eval()

class ImageData(BaseModel):
    image: list

@app.post("/predict")
async def predict(data: ImageData):
    # Convert input data to tensor
    image_tensor = torch.tensor(data.image).reshape(1, 3, 32, 32).float()
    
    # Make prediction
    with torch.no_grad():
        output = model(image_tensor)
        prediction = output.argmax(dim=1).item()
    
    return {"prediction": prediction}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Create a deployment workflow:

yaml
name: Deploy Model

on:
  push:
    branches: [ main ]
    paths:
      - 'outputs/model.pth'

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v2
      
    - name: Login to DockerHub
      uses: docker/login-action@v2
      with:
        username: ${{ secrets.DOCKER_USERNAME }}
        password: ${{ secrets.DOCKER_PASSWORD }}
        
    - name: Build and push Docker image
      uses: docker/build-push-action@v3
      with:
        context: .
        push: true
        tags: yourusername/pytorch-model:latest

Real-World Example: End-to-End PyTorch CI/CD Pipeline

Let's put everything together into a complete CI/CD pipeline for a PyTorch project.

Project Structure

pytorch-project/
├── .github/workflows/
│   ├── test.yml           # Run tests on PR and push
│   ├── train.yml          # Weekly model training
│   └── deploy.yml         # Deploy when model changes
├── src/
│   ├── models/
│   │   └── resnet.py      # Model definition
│   ├── train.py           # Training script
│   └── serve.py           # API server
├── tests/
│   ├── test_model.py      # Model tests
│   └── test_api.py        # API tests
├── configs/
│   └── training_config.yaml  # Training parameters
├── Dockerfile
├── requirements.txt
└── README.md

Complete CI/CD Workflow

Developer pushes code changes
GitHub Actions runs tests to validate changes
If tests pass and it's time for weekly training, a new model is trained
The trained model is saved as an artifact
When a new model is available, deployment workflow builds and pushes a Docker image
The image is deployed to your infrastructure (e.g., Kubernetes)

This approach ensures that your PyTorch models are consistently tested, trained, and deployed without manual intervention.

Monitoring Model Performance in Production

After deployment, it's important to monitor your model's performance. Add logging to your serving script:

python
# src/serve.py (modified)
import torch
import logging
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
from src.models.simple_cnn import SimpleCNN

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("api.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

app = FastAPI(title="PyTorch Model API")

# Load model
try:
    model = SimpleCNN(num_classes=10)
    model.load_state_dict(torch.load("outputs/model.pth", map_location="cpu"))
    model.eval()
    logger.info("Model loaded successfully")
except Exception as e:
    logger.error(f"Failed to load model: {e}")
    raise

class ImageData(BaseModel):
    image: list

@app.post("/predict")
async def predict(data: ImageData):
    try:
        # Convert input data to tensor
        image_tensor = torch.tensor(data.image).reshape(1, 3, 32, 32).float()
        
        # Make prediction
        with torch.no_grad():
            output = model(image_tensor)
            prediction = output.argmax(dim=1).item()
        
        logger.info(f"Prediction made: {prediction}")
        return {"prediction": prediction}
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise

Summary

In this guide, we've explored how to:

Set up CI/CD pipelines specifically for PyTorch projects
Create comprehensive tests for PyTorch models
Automate model training within CI/CD workflows
Deploy models using Docker and CI/CD
Implement a complete end-to-end pipeline

By integrating PyTorch with CI/CD practices, you can ensure your models are consistently tested, validated, and deployed, saving time and reducing errors in your development workflow.

Additional Resources

Exercises

Set up a basic CI pipeline for a simple PyTorch model using GitHub Actions.
Create tests that validate model outputs are consistent across CPU and GPU.
Implement automated hyperparameter tuning in your CI/CD pipeline.
Create a Docker container that serves a PyTorch model with FastAPI.
Add performance monitoring to track your model's accuracy in production.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why CI/CD Matters for PyTorch Projects​

Getting Started with PyTorch CI/CD​

Setting Up Your Project Structure​

Creating Essential Test Cases​

GitHub Actions for PyTorch CI/CD​

Basic Testing Workflow​

Adding GPU Support for Testing​

Model Testing Best Practices​

1. Test Model Loading and Saving​

2. Test Numerical Stability​

Automating Model Training in CI/CD​

Example Training Script​

Deploying PyTorch Models​

Model Deployment with Docker​

Real-World Example: End-to-End PyTorch CI/CD Pipeline​

Project Structure​

Complete CI/CD Workflow​

Monitoring Model Performance in Production​

Summary​

Additional Resources​

Exercises​