PyTorch GPU Tensors
Introduction
One of PyTorch's most powerful features is its seamless integration with GPUs (Graphics Processing Units). GPUs can significantly accelerate deep learning computations, often by orders of magnitude compared to CPUs. In this tutorial, we'll explore how to use PyTorch tensors on GPUs, which is essential knowledge for building and training modern deep learning models efficiently.
GPUs excel at parallel processing, making them ideal for the matrix operations that are fundamental to deep learning. PyTorch makes it incredibly easy to move your computations to the GPU with just a few lines of code.
Prerequisites
Before diving into GPU tensors, make sure you have:
- PyTorch installed (preferably with CUDA support)
- A compatible NVIDIA GPU (for CUDA support)
- Basic understanding of PyTorch tensors
Checking GPU Availability
Let's first check if a GPU is available for PyTorch:
import torch
# Check if CUDA (NVIDIA's GPU computing platform) is available
print(f"CUDA available: {torch.cuda.is_available()}")
# If available, get the number of GPUs
if torch.cuda.is_available():
print(f"Number of GPUs: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}") # Name of the first GPU
Example output:
CUDA available: True
Number of GPUs: 1
GPU name: NVIDIA GeForce RTX 3080
If torch.cuda.is_available()
returns False
, it means either:
- You don't have a compatible NVIDIA GPU
- You haven't installed the CUDA version of PyTorch
- Your CUDA drivers aren't properly installed
Creating GPU Tensors
There are several ways to create tensors on the GPU:
Method 1: Create on CPU and move to GPU
# Create a tensor on CPU
cpu_tensor = torch.tensor([1, 2, 3, 4])
print(f"CPU tensor: {cpu_tensor} (device: {cpu_tensor.device})")
# Move tensor to GPU
gpu_tensor = cpu_tensor.to('cuda')
print(f"GPU tensor: {gpu_tensor} (device: {gpu_tensor.device})")
# You can also use torch.cuda.device(0) instead of 'cuda'
gpu_tensor_alt = cpu_tensor.to(torch.device('cuda:0'))
print(f"GPU tensor (alt): {gpu_tensor_alt} (device: {gpu_tensor_alt.device})")
Example output:
CPU tensor: tensor([1, 2, 3, 4]) (device: cpu)
GPU tensor: tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
GPU tensor (alt): tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
Method 2: Create directly on GPU
# Create tensor directly on GPU
direct_gpu_tensor = torch.tensor([1, 2, 3, 4], device='cuda')
print(f"Direct GPU tensor: {direct_gpu_tensor} (device: {direct_gpu_tensor.device})")
# Other tensor creation functions also accept device parameter
zeros_gpu = torch.zeros(3, 4, device='cuda')
print(f"Zeros on GPU: {zeros_gpu} (device: {zeros_gpu.device})")
Example output:
Direct GPU tensor: tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
Zeros on GPU: tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], device='cuda:0') (device: cuda:0)
Moving Tensors Between Devices
You can easily move tensors between CPU and GPU:
# Move GPU tensor back to CPU
back_to_cpu = gpu_tensor.cpu()
print(f"Back to CPU: {back_to_cpu} (device: {back_to_cpu.device})")
# Shorthand for moving to current CUDA device
if torch.cuda.is_available():
gpu_tensor_short = cpu_tensor.cuda()
print(f"GPU tensor (shorthand): {gpu_tensor_short} (device: {gpu_tensor_short.device})")
Example output:
Back to CPU: tensor([1, 2, 3, 4]) (device: cpu)
GPU tensor (shorthand): tensor([1, 2, 3, 4], device='cuda:0') (device: cuda:0)
Operations on GPU Tensors
Operations between tensors can only be performed if they are on the same device. PyTorch will throw an error if you try to operate on tensors on different devices.
if torch.cuda.is_available():
# Create two GPU tensors
a = torch.tensor([1, 2, 3], device='cuda')
b = torch.tensor([4, 5, 6], device='cuda')
# Operations work normally on the GPU
c = a + b
print(f"Addition result: {c} (device: {c.device})")
# This would raise an error:
# cpu_tensor = torch.tensor([1, 2, 3])
# error_result = a + cpu_tensor # Error: cannot perform operations across devices
# Correct way:
cpu_tensor = torch.tensor([1, 2, 3])
result = a + cpu_tensor.cuda() # Move CPU tensor to GPU first
print(f"Mixed operation result: {result} (device: {result.device})")
Example output:
Addition result: tensor([5, 7, 9], device='cuda:0') (device: cuda:0)
Mixed operation result: tensor([2, 4, 6], device='cuda:0') (device: cuda:0)
Performance Comparison: CPU vs GPU
Let's compare the performance of a common operation on both CPU and GPU:
import time
# Matrix multiplication - a common deep learning operation
matrix_size = 5000
# CPU computation
cpu_matrix1 = torch.randn(matrix_size, matrix_size)
cpu_matrix2 = torch.randn(matrix_size, matrix_size)
start_time = time.time()
cpu_result = torch.matmul(cpu_matrix1, cpu_matrix2)
cpu_time = time.time() - start_time
print(f"CPU time: {cpu_time:.4f} seconds")
# GPU computation (if available)
if torch.cuda.is_available():
gpu_matrix1 = cpu_matrix1.cuda()
gpu_matrix2 = cpu_matrix2.cuda()
# First run (including data transfer)
start_time = time.time()
gpu_result = torch.matmul(gpu_matrix1, gpu_matrix2)
torch.cuda.synchronize() # Wait for GPU operation to complete
first_gpu_time = time.time() - start_time
print(f"GPU time (first run): {first_gpu_time:.4f} seconds")
# Second run (just computation)
start_time = time.time()
gpu_result = torch.matmul(gpu_matrix1, gpu_matrix2)
torch.cuda.synchronize() # Wait for GPU operation to complete
second_gpu_time = time.time() - start_time
print(f"GPU time (second run): {second_gpu_time:.4f} seconds")
# Calculate speedup
print(f"GPU speedup: {cpu_time / second_gpu_time:.2f}x")
Example output:
CPU time: 10.2456 seconds
GPU time (first run): 0.3421 seconds
GPU time (second run): 0.0824 seconds
GPU speedup: 124.34x
Note: The actual performance improvement will vary based on your specific hardware, but GPUs typically provide significant speedups for large tensor operations.
Real-World Example: Training a Neural Network
Let's see a simple example of how to use GPU tensors in a neural network training scenario:
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple neural network
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(10, 50)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(50, 1)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Choose device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Create model and move it to the selected device
model = SimpleNN().to(device)
# Create dummy data (on the selected device)
inputs = torch.randn(100, 10).to(device) # 100 samples, 10 features each
targets = torch.randn(100, 1).to(device) # 100 target values
# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Training loop
for epoch in range(5):
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
Example output:
Using device: cuda
Epoch 1, Loss: 1.0279
Epoch 2, Loss: 1.0248
Epoch 3, Loss: 1.0217
Epoch 4, Loss: 1.0187
Epoch 5, Loss: 1.0156
Best Practices for GPU Tensors
-
Move data to GPU in batches: Avoid transferring individual tensors; instead, batch your transfers to minimize overhead.
-
Move your model to GPU first: Use
model.to(device)
to move your entire neural network to the GPU at once. -
Be mindful of memory: GPUs have limited memory. For large datasets:
- Use smaller batch sizes
- Free unnecessary tensors with
del tensor
- Use
torch.cuda.empty_cache()
if needed
-
Leverage GPU-accelerated libraries: Many PyTorch extensions (like torchvision, torchaudio) have GPU implementations.
-
Watch for data transfers: Moving data between CPU and GPU is expensive. Keep operations on one device when possible.
-
Use mixed precision training: For newer GPUs, use
torch.cuda.amp
to speed up training with less memory usage.
# Example of clearing GPU memory
if torch.cuda.is_available():
# Create a large tensor
large_tensor = torch.randn(10000, 10000, device='cuda')
# Check memory usage
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
# Free memory
del large_tensor
torch.cuda.empty_cache()
# Check memory usage after freeing
print(f"Memory allocated after clearing: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
Example output:
Memory allocated: 0.38 GB
Memory allocated after clearing: 0.00 GB
Multiple GPUs
If you have multiple GPUs, you can specify which one to use:
if torch.cuda.device_count() > 1:
print(f"You have {torch.cuda.device_count()} GPUs!")
# Create tensor on specific GPU
tensor_gpu0 = torch.tensor([1, 2, 3], device='cuda:0') # First GPU
tensor_gpu1 = torch.tensor([4, 5, 6], device='cuda:1') # Second GPU
print(f"Tensor on GPU 0: {tensor_gpu0} (device: {tensor_gpu0.device})")
print(f"Tensor on GPU 1: {tensor_gpu1} (device: {tensor_gpu1.device})")
# Moving between GPUs
moved_tensor = tensor_gpu0.to('cuda:1')
print(f"Moved tensor: {moved_tensor} (device: {moved_tensor.device})")
For training on multiple GPUs, PyTorch provides DataParallel
and DistributedDataParallel
for parallel processing, but that's a more advanced topic.
GPU Usage Without CUDA
If you don't have an NVIDIA GPU with CUDA, PyTorch offers alternative acceleration options:
- MPS (Metal Performance Shaders) for MacOS with M1/M2 chips:
# Check if MPS is available (MacOS with M1/M2 chips)
if hasattr(torch, 'backends') and hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
device = torch.device("mps")
print(f"Using MPS device: {device}")
# Create tensor on MPS device
mps_tensor = torch.tensor([1, 2, 3], device=device)
print(f"MPS tensor: {mps_tensor} (device: {mps_tensor.device})")
Summary
GPU tensors in PyTorch provide a powerful way to accelerate your deep learning workflows:
- Use
.to('cuda')
or.cuda()
to move tensors to GPU - Create tensors directly on GPU with
device='cuda'
- Keep operations on the same device to avoid unnecessary transfers
- Be mindful of GPU memory usage
- Use
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
for code that works with or without a GPU
By leveraging GPU acceleration, your deep learning models can train much faster, allowing you to iterate and experiment more quickly.
Additional Resources
Exercises
- Create a benchmark script that compares CPU vs GPU performance for different tensor sizes and operations.
- Modify the neural network example to use a real dataset like MNIST and train it on both CPU and GPU. Compare the training times.
- Implement a function that safely moves data to the best available device (CUDA, MPS, or CPU).
- Experiment with PyTorch's memory profiling tools (
torch.cuda.memory_summary()
) to analyze your GPU memory usage. - Write a function that takes a model and dataset and automatically chooses the largest possible batch size for GPU training without running out of memory.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)