PyTorch Recurrent Networks

Recurrent Neural Networks (RNNs) are a powerful class of neural networks designed to work with sequential data. In Natural Language Processing (NLP), these networks are fundamental because text is inherently sequential - the meaning of a word depends on the words that come before it. In this tutorial, we'll explore how to implement and use RNNs in PyTorch for various NLP tasks.

Introduction to Recurrent Neural Networks

Unlike feedforward neural networks, RNNs have connections that feed back into the network, allowing them to maintain an internal memory of previous inputs. This makes them particularly well-suited for tasks involving sequential data such as:

Text generation
Machine translation
Speech recognition
Sentiment analysis
Named entity recognition

Why Regular Neural Networks Fall Short for Sequential Data

Standard neural networks process each input independently, with no memory of previous inputs. For language tasks, this is problematic:

"The cat sat on the mat."

To understand what "it" refers to in "it sat on the mat," we need to remember that "the cat" came earlier in the sequence. RNNs solve this by maintaining a hidden state that carries information from previous steps.

Basic RNN Structure in PyTorch

PyTorch provides built-in modules for creating recurrent networks. Let's start with the basic RNN module:

import torch
import torch.nn as nn

# Parameters
input_size = 10  # Size of input features (e.g., vocabulary size or embedding dimension)
hidden_size = 20  # Size of hidden state
num_layers = 1   # Number of recurrent layers
batch_size = 3   # Number of sequences processed in parallel
seq_length = 5   # Length of input sequences

# Create an RNN
rnn = nn.RNN(input_size=input_size, 
             hidden_size=hidden_size, 
             num_layers=num_layers, 
             batch_first=True)  # batch comes first in input dimensions

# Create example input (batch_size, seq_length, input_size)
x = torch.randn(batch_size, seq_length, input_size)

# Initialize hidden state (num_layers, batch_size, hidden_size)
h0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass
output, hn = rnn(x, h0)

print(f"Output shape: {output.shape}")  # Should be (batch_size, seq_length, hidden_size)
print(f"Hidden state shape: {hn.shape}")  # Should be (num_layers, batch_size, hidden_size)

Output:

Output shape: torch.Size([3, 5, 20])
Hidden state shape: torch.Size([1, 3, 20])

In this example:

output contains the output from the RNN for each time step
hn is the final hidden state

Limitations of Simple RNNs and Better Alternatives

Basic RNNs suffer from two significant problems:

Vanishing gradient problem: Gradients can become too small during backpropagation, preventing the network from learning long-term dependencies
Exploding gradient problem: Gradients can become extremely large, causing the network to become unstable

To address these issues, more sophisticated recurrent architectures were developed:

Long Short-Term Memory (LSTM) Networks

LSTMs introduce a memory cell with gating mechanisms that control what information to remember, forget, and output:

# Create an LSTM
lstm = nn.LSTM(input_size=input_size, 
               hidden_size=hidden_size, 
               num_layers=num_layers, 
               batch_first=True)

# Forward pass requires both hidden state and cell state
h0 = torch.zeros(num_layers, batch_size, hidden_size)
c0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass
output, (hn, cn) = lstm(x, (h0, c0))

print(f"Output shape: {output.shape}")  # Should be (batch_size, seq_length, hidden_size)
print(f"Final hidden state shape: {hn.shape}")  # Should be (num_layers, batch_size, hidden_size)
print(f"Final cell state shape: {cn.shape}")  # Should be (num_layers, batch_size, hidden_size)

Output:

Output shape: torch.Size([3, 5, 20])
Final hidden state shape: torch.Size([1, 3, 20])
Final cell state shape: torch.Size([1, 3, 20])

Gated Recurrent Units (GRUs)

GRUs are a simplified version of LSTMs that combine the forget and input gates into a single "update gate":

# Create a GRU
gru = nn.GRU(input_size=input_size, 
             hidden_size=hidden_size, 
             num_layers=num_layers, 
             batch_first=True)

# Forward pass requires only hidden state (no cell state)
h0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass
output, hn = gru(x, h0)

print(f"Output shape: {output.shape}")  # Should be (batch_size, seq_length, hidden_size)
print(f"Final hidden state shape: {hn.shape}")  # Should be (num_layers, batch_size, hidden_size)

Output:

Output shape: torch.Size([3, 5, 20])
Final hidden state shape: torch.Size([1, 3, 20])

Creating a Complete RNN Model for Sentiment Analysis

Let's build a complete sentiment analysis model using PyTorch's recurrent modules:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, 
                 n_layers, bidirectional, dropout, pad_idx):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # LSTM layer
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout if n_layers > 1 else 0,
                           batch_first=True)
        
        # Output layer
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        # text = [batch size, sentence length]
        
        # Pass text through embedding layer
        embedded = self.embedding(text)
        
        # Pack sequence to handle variable length inputs efficiently
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), 
                                                           batch_first=True, enforce_sorted=False)
        
        # Run through RNN
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        # Unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        
        # For classification, we'll use the final hidden state
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
            
        # Pass through linear layer
        return self.fc(hidden)

Example: Sentiment Classification with IMDB Dataset

Here's how you might use the model with the IMDB movie review dataset:

from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torch.utils.data as data

# 1. Load and preprocess data
tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# Get training data
train_iter = IMDB(split='train')
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: 1 if x == "pos" else 0

# 2. Create dataset and dataloader
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        lengths.append(len(processed_text))
    
    # Pad sequences
    padded_text = nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])
    return torch.tensor(label_list), padded_text, torch.tensor(lengths)

# Create data loader
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

train_dataloader = data.DataLoader(list(train_iter), batch_size=64, shuffle=True, collate_fn=collate_batch)
test_dataloader = data.DataLoader(list(test_iter), batch_size=64, shuffle=False, collate_fn=collate_batch)

# 3. Initialize model, loss, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = SentimentRNN(
    vocab_size=len(vocab),
    embedding_dim=100,
    hidden_dim=256,
    output_dim=1,  # Binary classification
    n_layers=2,
    bidirectional=True,
    dropout=0.5,
    pad_idx=vocab["<pad>"]
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

# 4. Training function
def train(model, dataloader, optimizer, criterion):
    model.train()
    epoch_loss = 0
    correct_preds = 0
    total_preds = 0
    
    for labels, text, lengths in dataloader:
        labels, text, lengths = labels.to(device), text.to(device), lengths.to(device)
        
        # Reset gradients
        optimizer.zero_grad()
        
        # Forward pass
        predictions = model(text, lengths).squeeze(1)
        
        # Calculate loss
        loss = criterion(predictions, labels.float())
        
        # Backpropagation
        loss.backward()
        
        # Update weights
        optimizer.step()
        
        epoch_loss += loss.item()
        
        # Calculate accuracy
        predicted_labels = torch.round(torch.sigmoid(predictions))
        correct_preds += (predicted_labels == labels).sum().item()
        total_preds += labels.size(0)
    
    return epoch_loss / len(dataloader), correct_preds / total_preds

# 5. Evaluation function
def evaluate(model, dataloader, criterion):
    model.eval()
    epoch_loss = 0
    correct_preds = 0
    total_preds = 0
    
    with torch.no_grad():
        for labels, text, lengths in dataloader:
            labels, text, lengths = labels.to(device), text.to(device), lengths.to(device)
            predictions = model(text, lengths).squeeze(1)
            loss = criterion(predictions, labels.float())
            epoch_loss += loss.item()
            
            predicted_labels = torch.round(torch.sigmoid(predictions))
            correct_preds += (predicted_labels == labels).sum().item()
            total_preds += labels.size(0)
    
    return epoch_loss / len(dataloader), correct_preds / total_preds

# 6. Training loop (just a demonstration - this would take a while to run)
N_EPOCHS = 5

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, test_dataloader, criterion)
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

This code demonstrates a complete workflow for training a sentiment analysis model with PyTorch RNNs. Note that running this example would require a significant amount of time and computational resources.

Bidirectional RNNs

In many NLP tasks, context from both directions (past and future) can be important. Bidirectional RNNs process the input sequence in both forward and backward directions:

# Create a bidirectional LSTM
bi_lstm = nn.LSTM(input_size=input_size, 
                  hidden_size=hidden_size, 
                  num_layers=num_layers, 
                  batch_first=True,
                  bidirectional=True)  # This makes it bidirectional

# Forward pass
output, (hn, cn) = bi_lstm(x, (h0.repeat(2, 1, 1), c0.repeat(2, 1, 1)))  # Double the first dimension for bidirectional

print(f"Output shape: {output.shape}")  # Should be (batch_size, seq_length, hidden_size*2)
print(f"Hidden state shape: {hn.shape}")  # Should be (num_layers*2, batch_size, hidden_size)

Output:

Output shape: torch.Size([3, 5, 40])
Hidden state shape: torch.Size([2, 3, 20])

Note that the output dimension is doubled because we get one set of outputs from the forward direction and another from the backward direction.

Multi-layer RNNs

For more complex tasks, stacking multiple RNN layers can improve performance:

# Create a multi-layer LSTM
multi_lstm = nn.LSTM(input_size=input_size, 
                    hidden_size=hidden_size, 
                    num_layers=3,  # 3 stacked LSTM layers
                    batch_first=True)

# Initialize hidden and cell states
h0 = torch.zeros(3, batch_size, hidden_size)  # 3 layers
c0 = torch.zeros(3, batch_size, hidden_size)  # 3 layers

# Forward pass
output, (hn, cn) = multi_lstm(x, (h0, c0))

print(f"Output shape: {output.shape}")  # Should be (batch_size, seq_length, hidden_size)
print(f"Hidden state shape: {hn.shape}")  # Should be (num_layers, batch_size, hidden_size)

Output:

Output shape: torch.Size([3, 5, 20])
Hidden state shape: torch.Size([3, 3, 20])

Variable Length Sequences

When dealing with text data, sequences often have different lengths. PyTorch provides tools to handle this efficiently:

# Sample data with variable lengths
batch_size = 3
max_length = 10
embedding_dim = 8

# Create sequences of different lengths
sequences = [
    torch.randn(7, embedding_dim),  # 7 timesteps
    torch.randn(10, embedding_dim), # 10 timesteps
    torch.randn(5, embedding_dim)   # 5 timesteps
]

# Record actual lengths
seq_lengths = torch.tensor([len(seq) for seq in sequences])

# Pad sequences
padded_sequences = nn.utils.rnn.pad_sequence(sequences, batch_first=True)
print(f"Padded sequence shape: {padded_sequences.shape}")  # Should be [3, 10, 8]

# Pack padded sequence
packed_sequences = nn.utils.rnn.pack_padded_sequence(
    padded_sequences, 
    seq_lengths.cpu(),
    batch_first=True, 
    enforce_sorted=False
)

# Create RNN
rnn = nn.LSTM(embedding_dim, hidden_size=20, batch_first=True)

# Process packed sequences
packed_output, (hn, cn) = rnn(packed_sequences)

# Unpack sequences
output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)

print(f"Output shape after unpacking: {output.shape}")

Output:

Padded sequence shape: torch.Size([3, 10, 8])
Output shape after unpacking: torch.Size([3, 10, 20])

This approach ensures that computation is only performed on actual data rather than padding, making processing more efficient.

Practical Tips for Training RNNs

Gradient Clipping: Prevent exploding gradients by clipping them to a maximum value:
```
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
```
Use Packed Sequences: Always use packed sequences when dealing with variable-length inputs to improve efficiency.
Bidirectional vs. Unidirectional: Use bidirectional RNNs when future context is available during processing (most NLP tasks), but unidirectional when generating sequences (like text generation).
GRU vs LSTM: GRUs are typically faster with fewer parameters, while LSTMs might capture longer-term dependencies better. Experiment to see which works best for your task.
Embedding Layers: Almost always use an embedding layer as the first layer for text data:
```
embedding = nn.Embedding(vocab_size, embedding_dim)
```
Dropout: Apply dropout to prevent overfitting, especially between layers in a multi-layer RNN:
```
lstm = nn.LSTM(..., dropout=0.5)  # Applies dropout between layers, not to outputs
```

Summary

In this tutorial, you've learned:

The fundamentals of recurrent neural networks and why they're essential for NLP
How to implement basic RNNs, LSTMs, and GRUs in PyTorch
Techniques for handling variable-length sequences
How to build a complete sentiment analysis model
Best practices for training RNNs effectively

RNNs remain a powerful tool for NLP tasks, though they are increasingly being supplemented or replaced by transformer models for many applications. Understanding RNNs provides a solid foundation for more advanced deep learning for NLP.

Additional Resources and Exercises

Resources

PyTorch RNN Documentation
The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy's influential blog post
Understanding LSTM Networks - Christopher Olah's visualization-rich explanation

Exercises

Text Generation: Implement a character-level RNN to generate text in the style of a particular author.
Named Entity Recognition: Create an LSTM model that identifies entities (people, organizations, locations) in text.
Sequence-to-Sequence Translation: Build a simple machine translation system using an encoder-decoder RNN architecture.
Time Series Prediction: Apply RNNs to predict stock prices or weather patterns using historical data.
Hyperparameter Tuning: Experiment with different RNN architectures, hidden sizes, and layer counts to optimize performance on a specific task.

Happy coding!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Recurrent Neural Networks​

Why Regular Neural Networks Fall Short for Sequential Data​

Basic RNN Structure in PyTorch​

Limitations of Simple RNNs and Better Alternatives​

Long Short-Term Memory (LSTM) Networks​

Gated Recurrent Units (GRUs)​

Creating a Complete RNN Model for Sentiment Analysis​

Example: Sentiment Classification with IMDB Dataset​

Bidirectional RNNs​

Multi-layer RNNs​

Variable Length Sequences​

Practical Tips for Training RNNs​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction to Recurrent Neural Networks

Why Regular Neural Networks Fall Short for Sequential Data

Basic RNN Structure in PyTorch

Limitations of Simple RNNs and Better Alternatives

Long Short-Term Memory (LSTM) Networks

Gated Recurrent Units (GRUs)

Creating a Complete RNN Model for Sentiment Analysis

Example: Sentiment Classification with IMDB Dataset

Bidirectional RNNs

Multi-layer RNNs

Variable Length Sequences

Practical Tips for Training RNNs

Summary

Additional Resources and Exercises

Resources

Exercises