PyTorch GPU Tensors

Introduction

One of PyTorch's most powerful features is its seamless integration with GPUs (Graphics Processing Units). GPUs can significantly accelerate deep learning computations, often by orders of magnitude compared to CPUs. In this tutorial, we'll explore how to use PyTorch tensors on GPUs, which is essential knowledge for building and training modern deep learning models efficiently.

GPUs excel at parallel processing, making them ideal for the matrix operations that are fundamental to deep learning. PyTorch makes it incredibly easy to move your computations to the GPU with just a few lines of code.

Prerequisites

Before diving into GPU tensors, make sure you have:

PyTorch installed (preferably with CUDA support)
A compatible NVIDIA GPU (for CUDA support)
Basic understanding of PyTorch tensors

Checking GPU Availability

Let's first check if a GPU is available for PyTorch:

import torch

# Check if CUDA (NVIDIA's GPU computing platform) is available
print(f"CUDA available: {torch.cuda.is_available()}")

# If available, get the number of GPUs
if torch.cuda.is_available():
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")  # Name of the first GPU

Example output:

CUDA available: True
Number of GPUs: 1
GPU name: NVIDIA GeForce RTX 3080

If torch.cuda.is_available() returns False, it means either:

You don't have a compatible NVIDIA GPU
You haven't installed the CUDA version of PyTorch
Your CUDA drivers aren't properly installed

Creating GPU Tensors

There are several ways to create tensors on the GPU:

Method 1: Create on CPU and move to GPU

# Create a tensor on CPU
cpu_tensor = torch.tensor([1, 2, 3, 4])
print(f"CPU tensor: {cpu_tensor} (device: {cpu_tensor.device})")

# Move tensor to GPU
gpu_tensor = cpu_tensor.to('cuda')
print(f"GPU tensor: {gpu_tensor} (device: {gpu_tensor.device})")

# You can also use torch.cuda.device(0) instead of 'cuda'
gpu_tensor_alt = cpu_tensor.to(torch.device('cuda:0'))
print(f"GPU tensor (alt): {gpu_tensor_alt} (device: {gpu_tensor_alt.device})")

Example output:

CPU tensor: tensor([1, 2, 3, 4]) (device: cpu)
GPU tensor: tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
GPU tensor (alt): tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)

Method 2: Create directly on GPU

# Create tensor directly on GPU
direct_gpu_tensor = torch.tensor([1, 2, 3, 4], device='cuda')
print(f"Direct GPU tensor: {direct_gpu_tensor} (device: {direct_gpu_tensor.device})")

# Other tensor creation functions also accept device parameter
zeros_gpu = torch.zeros(3, 4, device='cuda')
print(f"Zeros on GPU: {zeros_gpu} (device: {zeros_gpu.device})")

Example output:

Direct GPU tensor: tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
Zeros on GPU: tensor([[0., 0., 0., 0.],
                     [0., 0., 0., 0.],
                     [0., 0., 0., 0.]], device='cuda:0') (device: cuda:0)

Moving Tensors Between Devices

You can easily move tensors between CPU and GPU:

# Move GPU tensor back to CPU
back_to_cpu = gpu_tensor.cpu()
print(f"Back to CPU: {back_to_cpu} (device: {back_to_cpu.device})")

# Shorthand for moving to current CUDA device
if torch.cuda.is_available():
    gpu_tensor_short = cpu_tensor.cuda()
    print(f"GPU tensor (shorthand): {gpu_tensor_short} (device: {gpu_tensor_short.device})")

Example output:

Back to CPU: tensor([1, 2, 3, 4]) (device: cpu)
GPU tensor (shorthand): tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)

Operations on GPU Tensors

Operations between tensors can only be performed if they are on the same device. PyTorch will throw an error if you try to operate on tensors on different devices.

if torch.cuda.is_available():
    # Create two GPU tensors
    a = torch.tensor([1, 2, 3], device='cuda')
    b = torch.tensor([4, 5, 6], device='cuda')
    
    # Operations work normally on the GPU
    c = a + b
    print(f"Addition result: {c} (device: {c.device})")
    
    # This would raise an error:
    # cpu_tensor = torch.tensor([1, 2, 3])
    # error_result = a + cpu_tensor  # Error: cannot perform operations across devices
    
    # Correct way:
    cpu_tensor = torch.tensor([1, 2, 3])
    result = a + cpu_tensor.cuda()  # Move CPU tensor to GPU first
    print(f"Mixed operation result: {result} (device: {result.device})")

Example output:

Addition result: tensor([5, 7, 9], device='cuda:0') (device: cuda:0)
Mixed operation result: tensor([2, 4, 6], device='cuda:0') (device: cuda:0)

Performance Comparison: CPU vs GPU

Let's compare the performance of a common operation on both CPU and GPU:

import time

# Matrix multiplication - a common deep learning operation
matrix_size = 5000

# CPU computation
cpu_matrix1 = torch.randn(matrix_size, matrix_size)
cpu_matrix2 = torch.randn(matrix_size, matrix_size)

start_time = time.time()
cpu_result = torch.matmul(cpu_matrix1, cpu_matrix2)
cpu_time = time.time() - start_time
print(f"CPU time: {cpu_time:.4f} seconds")

# GPU computation (if available)
if torch.cuda.is_available():
    gpu_matrix1 = cpu_matrix1.cuda()
    gpu_matrix2 = cpu_matrix2.cuda()
    
    # First run (including data transfer)
    start_time = time.time()
    gpu_result = torch.matmul(gpu_matrix1, gpu_matrix2)
    torch.cuda.synchronize()  # Wait for GPU operation to complete
    first_gpu_time = time.time() - start_time
    print(f"GPU time (first run): {first_gpu_time:.4f} seconds")
    
    # Second run (just computation)
    start_time = time.time()
    gpu_result = torch.matmul(gpu_matrix1, gpu_matrix2)
    torch.cuda.synchronize()  # Wait for GPU operation to complete
    second_gpu_time = time.time() - start_time
    print(f"GPU time (second run): {second_gpu_time:.4f} seconds")
    
    # Calculate speedup
    print(f"GPU speedup: {cpu_time / second_gpu_time:.2f}x")

Example output:

CPU time: 10.2456 seconds
GPU time (first run): 0.3421 seconds
GPU time (second run): 0.0824 seconds
GPU speedup: 124.34x

Note: The actual performance improvement will vary based on your specific hardware, but GPUs typically provide significant speedups for large tensor operations.

Real-World Example: Training a Neural Network

Let's see a simple example of how to use GPU tensors in a neural network training scenario:

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 1)
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Choose device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create model and move it to the selected device
model = SimpleNN().to(device)

# Create dummy data (on the selected device)
inputs = torch.randn(100, 10).to(device)  # 100 samples, 10 features each
targets = torch.randn(100, 1).to(device)  # 100 target values

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(5):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Example output:

Using device: cuda
Epoch 1, Loss: 1.0279
Epoch 2, Loss: 1.0248
Epoch 3, Loss: 1.0217
Epoch 4, Loss: 1.0187
Epoch 5, Loss: 1.0156

Best Practices for GPU Tensors

Move data to GPU in batches: Avoid transferring individual tensors; instead, batch your transfers to minimize overhead.
Move your model to GPU first: Use model.to(device) to move your entire neural network to the GPU at once.
Be mindful of memory: GPUs have limited memory. For large datasets:
- Use smaller batch sizes
- Free unnecessary tensors with del tensor
- Use torch.cuda.empty_cache() if needed
Leverage GPU-accelerated libraries: Many PyTorch extensions (like torchvision, torchaudio) have GPU implementations.
Watch for data transfers: Moving data between CPU and GPU is expensive. Keep operations on one device when possible.
Use mixed precision training: For newer GPUs, use torch.cuda.amp to speed up training with less memory usage.

# Example of clearing GPU memory
if torch.cuda.is_available():
    # Create a large tensor
    large_tensor = torch.randn(10000, 10000, device='cuda')
    
    # Check memory usage
    print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    
    # Free memory
    del large_tensor
    torch.cuda.empty_cache()
    
    # Check memory usage after freeing
    print(f"Memory allocated after clearing: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

Example output:

Memory allocated: 0.38 GB
Memory allocated after clearing: 0.00 GB

Multiple GPUs

If you have multiple GPUs, you can specify which one to use:

if torch.cuda.device_count() > 1:
    print(f"You have {torch.cuda.device_count()} GPUs!")
    
    # Create tensor on specific GPU
    tensor_gpu0 = torch.tensor([1, 2, 3], device='cuda:0')  # First GPU
    tensor_gpu1 = torch.tensor([4, 5, 6], device='cuda:1')  # Second GPU
    
    print(f"Tensor on GPU 0: {tensor_gpu0} (device: {tensor_gpu0.device})")
    print(f"Tensor on GPU 1: {tensor_gpu1} (device: {tensor_gpu1.device})")
    
    # Moving between GPUs
    moved_tensor = tensor_gpu0.to('cuda:1')
    print(f"Moved tensor: {moved_tensor} (device: {moved_tensor.device})")

For training on multiple GPUs, PyTorch provides DataParallel and DistributedDataParallel for parallel processing, but that's a more advanced topic.

GPU Usage Without CUDA

If you don't have an NVIDIA GPU with CUDA, PyTorch offers alternative acceleration options:

MPS (Metal Performance Shaders) for MacOS with M1/M2 chips:

# Check if MPS is available (MacOS with M1/M2 chips)
if hasattr(torch, 'backends') and hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Using MPS device: {device}")
    
    # Create tensor on MPS device
    mps_tensor = torch.tensor([1, 2, 3], device=device)
    print(f"MPS tensor: {mps_tensor} (device: {mps_tensor.device})")

Summary

GPU tensors in PyTorch provide a powerful way to accelerate your deep learning workflows:

Use .to('cuda') or .cuda() to move tensors to GPU
Create tensors directly on GPU with device='cuda'
Keep operations on the same device to avoid unnecessary transfers
Be mindful of GPU memory usage
Use device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') for code that works with or without a GPU

By leveraging GPU acceleration, your deep learning models can train much faster, allowing you to iterate and experiment more quickly.

Additional Resources

Exercises

Create a benchmark script that compares CPU vs GPU performance for different tensor sizes and operations.
Modify the neural network example to use a real dataset like MNIST and train it on both CPU and GPU. Compare the training times.
Implement a function that safely moves data to the best available device (CUDA, MPS, or CPU).
Experiment with PyTorch's memory profiling tools (torch.cuda.memory_summary()) to analyze your GPU memory usage.
Write a function that takes a model and dataset and automatically chooses the largest possible batch size for GPU training without running out of memory.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Checking GPU Availability​

Creating GPU Tensors​

Method 1: Create on CPU and move to GPU​

Method 2: Create directly on GPU​

Moving Tensors Between Devices​

Operations on GPU Tensors​

Performance Comparison: CPU vs GPU​

Real-World Example: Training a Neural Network​

Best Practices for GPU Tensors​

Multiple GPUs​

GPU Usage Without CUDA​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Checking GPU Availability

Creating GPU Tensors

Method 1: Create on CPU and move to GPU

Method 2: Create directly on GPU

Moving Tensors Between Devices

Operations on GPU Tensors

Performance Comparison: CPU vs GPU

Real-World Example: Training a Neural Network

Best Practices for GPU Tensors

Multiple GPUs

GPU Usage Without CUDA

Summary

Additional Resources

Exercises