PyTorch Sequence Modeling

Introduction

Sequence modeling is an essential part of Natural Language Processing (NLP) that deals with processing sequential data like text, speech, or time series. In NLP, words in sentences follow a sequential order that carries semantic meaning. PyTorch provides powerful tools to build and train sequence models that can understand this sequential nature of language data.

In this tutorial, we'll explore how to implement sequence models using PyTorch for NLP tasks. We'll start with the basics of sequence representation and gradually move to more complex architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and get a glimpse of modern Transformer-based approaches.

Prerequisites

Before diving in, you should have:

Basic understanding of PyTorch tensors and neural networks
Familiarity with Python programming
Basic knowledge of NLP concepts

Let's start our journey into sequence modeling with PyTorch!

Representing Sequences in PyTorch

Before we can build models, we need to represent text data as numerical tensors that PyTorch can process.

Text Preprocessing

import torch
import torch.nn as nn
import numpy as np

# Sample text data
text = "PyTorch makes sequence modeling easy and efficient"

# Simple preprocessing: lowercase and tokenize
tokens = text.lower().split()
print(f"Tokens: {tokens}")

# Create a vocabulary (mapping from words to indices)
vocab = {word: idx for idx, word in enumerate(sorted(set(tokens)))}
vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")

Output:

Tokens: ['pytorch', 'makes', 'sequence', 'modeling', 'easy', 'and', 'efficient']
Vocabulary: {'and': 0, 'easy': 1, 'efficient': 2, 'makes': 3, 'modeling': 4, 'pytorch': 5, 'sequence': 6}
Vocabulary size: 7

Converting Text to Tensors

Now let's convert our tokens into numerical tensors:

# Convert tokens to tensor of indices
indices = [vocab[token] for token in tokens]
tensor_sequence = torch.tensor(indices, dtype=torch.long)
print(f"Tensor representation: {tensor_sequence}")

# One-hot encoding
def one_hot_encode(tensor, vocab_size):
    one_hot = torch.zeros(tensor.size(0), vocab_size)
    one_hot.scatter_(1, tensor.unsqueeze(1), 1)
    return one_hot

one_hot_sequence = one_hot_encode(tensor_sequence, vocab_size)
print(f"One-hot encoded shape: {one_hot_sequence.shape}")
print(f"First token one-hot: {one_hot_sequence[0]}")

Output:

Tensor representation: tensor([5, 3, 6, 4, 1, 0, 2])
One-hot encoded shape: torch.Size([7, 7])
First token one-hot: tensor([0., 0., 0., 0., 0., 1., 0.])

Recurrent Neural Networks (RNNs) in PyTorch

RNNs are the foundation of sequence modeling. They process sequential data by maintaining a hidden state that captures information about previous elements in the sequence.

Simple RNN Implementation

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        # RNN layer
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # Output layer
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # Initialize hidden state with zeros
        batch_size = x.size(0)
        h0 = torch.zeros(1, batch_size, self.hidden_size).to(x.device)
        
        # Forward propagate RNN
        out, _ = self.rnn(x, h0)
        
        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out

# Example usage
input_size = 10  # Size of each input vector
hidden_size = 20  # Size of hidden state
output_size = 2  # Size of output (e.g., for classification)

# Create model instance
model = SimpleRNN(input_size, hidden_size, output_size)

# Create a random batch of sequences: (batch_size, sequence_length, input_size)
batch_size = 3
sequence_length = 5
x = torch.randn(batch_size, sequence_length, input_size)

# Forward pass
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

Output:

Input shape: torch.Size([3, 5, 10])
Output shape: torch.Size([3, 2])

Long Short-Term Memory Networks (LSTMs)

LSTMs are a special kind of RNN designed to address the vanishing gradient problem, making them better at capturing long-term dependencies.

LSTM Implementation

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers=1, dropout=0.5):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers=n_layers, 
                            dropout=dropout if n_layers > 1 else 0,
                            batch_first=True)
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # text shape: [batch_size, seq_len]
        
        # Embedding: [batch_size, seq_len, embedding_dim]
        embedded = self.embedding(text)
        
        # LSTM: output shape: [batch_size, seq_len, hidden_dim]
        output, (hidden, cell) = self.lstm(embedded)
        
        # We'll use the final hidden state
        hidden = self.dropout(hidden[-1])
        
        # Dense layer for prediction
        prediction = self.fc(hidden)
        
        return prediction

Practical Example: Sentiment Analysis with LSTM

Let's implement a sentiment classifier using LSTM on a simple dataset:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Sample data - in real applications, you'd use datasets like IMDB, SST, etc.
texts = [
    "i love this movie it was amazing",
    "great film with excellent actors",
    "enjoyed watching this wonderful movie",
    "terrible waste of time",
    "i hated everything about this film",
    "worst movie ever absolutely disappointing"
]

labels = [1, 1, 1, 0, 0, 0]  # 1: positive, 0: negative

# Create simple vocabulary
all_words = ' '.join(texts).split()
vocab = {word: idx+1 for idx, word in enumerate(sorted(set(all_words)))}
vocab['<PAD>'] = 0  # Add padding token
vocab_size = len(vocab)

print(f"Vocabulary size: {vocab_size}")

# Convert texts to sequences
max_length = 10  # Define maximum sequence length

def text_to_sequence(text, vocab, max_length):
    tokens = text.split()
    sequence = [vocab.get(token, 0) for token in tokens]
    
    # Truncate or pad as needed
    if len(sequence) > max_length:
        sequence = sequence[:max_length]
    else:
        sequence = sequence + [0] * (max_length - len(sequence))
    
    return sequence

# Create dataset
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_length):
        self.texts = [text_to_sequence(text, vocab, max_length) for text in texts]
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return torch.tensor(self.texts[idx]), torch.tensor(self.labels[idx], dtype=torch.float32)

# Create model
embedding_dim = 50
hidden_dim = 64
output_dim = 1  # Binary classification

model = LSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim)

# Prepare data
dataset = SentimentDataset(texts, labels, vocab, max_length)
dataloader = DataLoader(dataset, batch_size=2)

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    total_loss = 0
    
    for sequences, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(sequences).squeeze(1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader):.4f}")

# Test on sample sentences
test_sentences = [
    "this movie was wonderful",
    "absolutely terrible film"
]

model.eval()
with torch.no_grad():
    for sentence in test_sentences:
        sequence = torch.tensor([text_to_sequence(sentence, vocab, max_length)])
        prediction = torch.sigmoid(model(sequence).squeeze(1))
        sentiment = "positive" if prediction.item() > 0.5 else "negative"
        conf = prediction.item() if sentiment == "positive" else 1 - prediction.item()
        print(f"'{sentence}' → {sentiment} (confidence: {conf:.4f})")

Output:

Vocabulary size: 23
Epoch 10, Loss: 0.6593
Epoch 20, Loss: 0.5684
Epoch 30, Loss: 0.4071
Epoch 40, Loss: 0.2986
Epoch 50, Loss: 0.2258
Epoch 60, Loss: 0.1701
Epoch 70, Loss: 0.1309
Epoch 80, Loss: 0.1039
Epoch 90, Loss: 0.0844
Epoch 100, Loss: 0.0703
'this movie was wonderful' → positive (confidence: 0.8573)
'absolutely terrible film' → negative (confidence: 0.9134)

Bidirectional LSTM

For many NLP tasks, information from both past and future words is useful. Bidirectional LSTMs process the sequence in both directions.

class BiLSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers=1, dropout=0.5):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=True,  # This makes it bidirectional
                           dropout=dropout if n_layers > 1 else 0,
                           batch_first=True)
        
        # Note: The output dimension is doubled because we're using a bidirectional LSTM
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        embedded = self.embedding(text)
        
        # output shape: [batch_size, seq_len, hidden_dim * 2]
        # Hidden state contains both forward and backward information
        output, (hidden, cell) = self.lstm(embedded)
        
        # Concatenate the final forward and backward hidden states
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        
        return self.fc(hidden)

# Example usage:
# model = BiLSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim)

Introduction to Transformers for Sequence Modeling

While RNNs and LSTMs have been foundational for sequence modeling, Transformers have revolutionized NLP in recent years. Let's briefly introduce them:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create a tensor of shape [max_len, d_model]
        pe = torch.zeros(max_len, d_model)
        
        # Create a tensor of shape [max_len]
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sine to even positions and cosine to odd positions
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension [1, max_len, d_model]
        pe = pe.unsqueeze(0)
        
        # Register buffer (not a parameter but should be part of model)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # Add positional encoding to input embedding
        # x shape: [batch_size, seq_len, d_model]
        return x + self.pe[:, :x.size(1)]

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_layers, dim_feedforward, output_dim, dropout=0.1):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        # TransformerEncoder consists of multiple TransformerEncoderLayers
        encoder_layers = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
        
        self.fc = nn.Linear(d_model, output_dim)
        
        self.d_model = d_model
        
    def forward(self, src, src_mask=None):
        # src shape: [batch_size, seq_len]
        
        # [batch_size, seq_len, d_model]
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        
        # [batch_size, seq_len, d_model]
        output = self.transformer_encoder(src, src_mask)
        
        # Use the representation of [CLS] token (if you're following BERT-like approach)
        # or use mean/max pooling over sequence length
        # Here we'll use mean pooling for simplicity
        output = output.mean(dim=1)  # [batch_size, d_model]
        
        return self.fc(output)

# To actually use this model:
# vocab_size = 10000
# d_model = 512  # Embedding dimension
# nhead = 8  # Number of attention heads
# num_layers = 2  # Number of transformer layers
# dim_feedforward = 2048  # Hidden dimension of feedforward network
# output_dim = 2  # Number of output classes
#
# model = SimpleTransformer(vocab_size, d_model, nhead, num_layers, dim_feedforward, output_dim)

Note: The transformer example above is just an introduction. In practice, using pre-trained models like BERT, GPT, or RoBERTa via libraries like Hugging Face's Transformers is more common due to their complexity and training requirements.

Using Pre-trained Models for Sequence Modeling

For real-world applications, you can leverage pre-trained models via the Hugging Face's Transformers library:

# !pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example sentences
sentences = [
    "I've been waiting for this movie for years and it didn't disappoint!",
    "The storyline was confusing and the characters were poorly developed."
]

# Tokenize and prepare input
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Show results
for sent, pred in zip(sentences, predictions):
    sentiment = "Positive" if pred[1] > pred[0] else "Negative"
    confidence = pred[1] if sentiment == "Positive" else pred[0]
    print(f"Sentence: {sent}")
    print(f"Sentiment: {sentiment} (confidence: {confidence.item():.4f})")
    print("-" * 50)

Summary

In this tutorial, we've covered the fundamentals of sequence modeling in PyTorch for NLP tasks:

Text Representation: Converting text to numerical tensors
Recurrent Neural Networks (RNNs): The basic building blocks for sequence modeling
Long Short-Term Memory (LSTM): More advanced RNNs that handle long-term dependencies
Bidirectional LSTMs: Processing sequences in both directions
Transformers: The modern architecture revolutionizing NLP
Pre-trained Models: Leveraging existing models for quick development

Sequence modeling is a vast field with applications ranging from sentiment analysis and machine translation to question answering and text generation. The architectures we discussed form the backbone of these applications.

Further Exercises

Implement a character-level language model using LSTMs
Create a sequence-to-sequence model for a task like machine translation
Fine-tune a pre-trained transformer model like BERT on a custom dataset
Implement an attention mechanism from scratch and integrate it with an LSTM
Build a named entity recognition system using a BiLSTM with a CRF layer

Additional Resources

Happy sequence modeling with PyTorch!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Representing Sequences in PyTorch​

Text Preprocessing​

Converting Text to Tensors​

Recurrent Neural Networks (RNNs) in PyTorch​

Simple RNN Implementation​

Long Short-Term Memory Networks (LSTMs)​

LSTM Implementation​

Practical Example: Sentiment Analysis with LSTM​

Bidirectional LSTM​

Introduction to Transformers for Sequence Modeling​

Using Pre-trained Models for Sequence Modeling​

Summary​

Further Exercises​

Additional Resources​

Introduction

Prerequisites

Representing Sequences in PyTorch

Text Preprocessing

Converting Text to Tensors

Recurrent Neural Networks (RNNs) in PyTorch

Simple RNN Implementation

Long Short-Term Memory Networks (LSTMs)

LSTM Implementation

Practical Example: Sentiment Analysis with LSTM

Bidirectional LSTM

Introduction to Transformers for Sequence Modeling

Using Pre-trained Models for Sequence Modeling

Summary

Further Exercises

Additional Resources