PyTorch Sequence Modeling
Introduction
Sequence modeling is an essential part of Natural Language Processing (NLP) that deals with processing sequential data like text, speech, or time series. In NLP, words in sentences follow a sequential order that carries semantic meaning. PyTorch provides powerful tools to build and train sequence models that can understand this sequential nature of language data.
In this tutorial, we'll explore how to implement sequence models using PyTorch for NLP tasks. We'll start with the basics of sequence representation and gradually move to more complex architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and get a glimpse of modern Transformer-based approaches.
Prerequisites
Before diving in, you should have:
- Basic understanding of PyTorch tensors and neural networks
- Familiarity with Python programming
- Basic knowledge of NLP concepts
Let's start our journey into sequence modeling with PyTorch!
Representing Sequences in PyTorch
Before we can build models, we need to represent text data as numerical tensors that PyTorch can process.
Text Preprocessing
import torch
import torch.nn as nn
import numpy as np
# Sample text data
text = "PyTorch makes sequence modeling easy and efficient"
# Simple preprocessing: lowercase and tokenize
tokens = text.lower().split()
print(f"Tokens: {tokens}")
# Create a vocabulary (mapping from words to indices)
vocab = {word: idx for idx, word in enumerate(sorted(set(tokens)))}
vocab_size = len(vocab)
print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")
Output:
Tokens: ['pytorch', 'makes', 'sequence', 'modeling', 'easy', 'and', 'efficient']
Vocabulary: {'and': 0, 'easy': 1, 'efficient': 2, 'makes': 3, 'modeling': 4, 'pytorch': 5, 'sequence': 6}
Vocabulary size: 7
Converting Text to Tensors
Now let's convert our tokens into numerical tensors:
# Convert tokens to tensor of indices
indices = [vocab[token] for token in tokens]
tensor_sequence = torch.tensor(indices, dtype=torch.long)
print(f"Tensor representation: {tensor_sequence}")
# One-hot encoding
def one_hot_encode(tensor, vocab_size):
one_hot = torch.zeros(tensor.size(0), vocab_size)
one_hot.scatter_(1, tensor.unsqueeze(1), 1)
return one_hot
one_hot_sequence = one_hot_encode(tensor_sequence, vocab_size)
print(f"One-hot encoded shape: {one_hot_sequence.shape}")
print(f"First token one-hot: {one_hot_sequence[0]}")
Output:
Tensor representation: tensor([5, 3, 6, 4, 1, 0, 2])
One-hot encoded shape: torch.Size([7, 7])
First token one-hot: tensor([0., 0., 0., 0., 0., 1., 0.])
Recurrent Neural Networks (RNNs) in PyTorch
RNNs are the foundation of sequence modeling. They process sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
Simple RNN Implementation
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.hidden_size = hidden_size
# RNN layer
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
# Output layer
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# Initialize hidden state with zeros
batch_size = x.size(0)
h0 = torch.zeros(1, batch_size, self.hidden_size).to(x.device)
# Forward propagate RNN
out, _ = self.rnn(x, h0)
# Decode the hidden state of the last time step
out = self.fc(out[:, -1, :])
return out
# Example usage
input_size = 10 # Size of each input vector
hidden_size = 20 # Size of hidden state
output_size = 2 # Size of output (e.g., for classification)
# Create model instance
model = SimpleRNN(input_size, hidden_size, output_size)
# Create a random batch of sequences: (batch_size, sequence_length, input_size)
batch_size = 3
sequence_length = 5
x = torch.randn(batch_size, sequence_length, input_size)
# Forward pass
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
Output:
Input shape: torch.Size([3, 5, 10])
Output shape: torch.Size([3, 2])
Long Short-Term Memory Networks (LSTMs)
LSTMs are a special kind of RNN designed to address the vanishing gradient problem, making them better at capturing long-term dependencies.
LSTM Implementation
class LSTMModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers=1, dropout=0.5):
super().__init__()
# Embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# LSTM layer
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
dropout=dropout if n_layers > 1 else 0,
batch_first=True)
# Dense layer
self.fc = nn.Linear(hidden_dim, output_dim)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, text):
# text shape: [batch_size, seq_len]
# Embedding: [batch_size, seq_len, embedding_dim]
embedded = self.embedding(text)
# LSTM: output shape: [batch_size, seq_len, hidden_dim]
output, (hidden, cell) = self.lstm(embedded)
# We'll use the final hidden state
hidden = self.dropout(hidden[-1])
# Dense layer for prediction
prediction = self.fc(hidden)
return prediction
Practical Example: Sentiment Analysis with LSTM
Let's implement a sentiment classifier using LSTM on a simple dataset:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
# Sample data - in real applications, you'd use datasets like IMDB, SST, etc.
texts = [
"i love this movie it was amazing",
"great film with excellent actors",
"enjoyed watching this wonderful movie",
"terrible waste of time",
"i hated everything about this film",
"worst movie ever absolutely disappointing"
]
labels = [1, 1, 1, 0, 0, 0] # 1: positive, 0: negative
# Create simple vocabulary
all_words = ' '.join(texts).split()
vocab = {word: idx+1 for idx, word in enumerate(sorted(set(all_words)))}
vocab['<PAD>'] = 0 # Add padding token
vocab_size = len(vocab)
print(f"Vocabulary size: {vocab_size}")
# Convert texts to sequences
max_length = 10 # Define maximum sequence length
def text_to_sequence(text, vocab, max_length):
tokens = text.split()
sequence = [vocab.get(token, 0) for token in tokens]
# Truncate or pad as needed
if len(sequence) > max_length:
sequence = sequence[:max_length]
else:
sequence = sequence + [0] * (max_length - len(sequence))
return sequence
# Create dataset
class SentimentDataset(Dataset):
def __init__(self, texts, labels, vocab, max_length):
self.texts = [text_to_sequence(text, vocab, max_length) for text in texts]
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return torch.tensor(self.texts[idx]), torch.tensor(self.labels[idx], dtype=torch.float32)
# Create model
embedding_dim = 50
hidden_dim = 64
output_dim = 1 # Binary classification
model = LSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim)
# Prepare data
dataset = SentimentDataset(texts, labels, vocab, max_length)
dataloader = DataLoader(dataset, batch_size=2)
# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop
num_epochs = 100
for epoch in range(num_epochs):
total_loss = 0
for sequences, labels in dataloader:
optimizer.zero_grad()
outputs = model(sequences).squeeze(1)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader):.4f}")
# Test on sample sentences
test_sentences = [
"this movie was wonderful",
"absolutely terrible film"
]
model.eval()
with torch.no_grad():
for sentence in test_sentences:
sequence = torch.tensor([text_to_sequence(sentence, vocab, max_length)])
prediction = torch.sigmoid(model(sequence).squeeze(1))
sentiment = "positive" if prediction.item() > 0.5 else "negative"
conf = prediction.item() if sentiment == "positive" else 1 - prediction.item()
print(f"'{sentence}' → {sentiment} (confidence: {conf:.4f})")
Output:
Vocabulary size: 23
Epoch 10, Loss: 0.6593
Epoch 20, Loss: 0.5684
Epoch 30, Loss: 0.4071
Epoch 40, Loss: 0.2986
Epoch 50, Loss: 0.2258
Epoch 60, Loss: 0.1701
Epoch 70, Loss: 0.1309
Epoch 80, Loss: 0.1039
Epoch 90, Loss: 0.0844
Epoch 100, Loss: 0.0703
'this movie was wonderful' → positive (confidence: 0.8573)
'absolutely terrible film' → negative (confidence: 0.9134)
Bidirectional LSTM
For many NLP tasks, information from both past and future words is useful. Bidirectional LSTMs process the sequence in both directions.
class BiLSTMModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers=1, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=True, # This makes it bidirectional
dropout=dropout if n_layers > 1 else 0,
batch_first=True)
# Note: The output dimension is doubled because we're using a bidirectional LSTM
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
embedded = self.embedding(text)
# output shape: [batch_size, seq_len, hidden_dim * 2]
# Hidden state contains both forward and backward information
output, (hidden, cell) = self.lstm(embedded)
# Concatenate the final forward and backward hidden states
hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
return self.fc(hidden)
# Example usage:
# model = BiLSTMModel(vocab_size, embedding_dim, hidden_dim, output_dim)
Introduction to Transformers for Sequence Modeling
While RNNs and LSTMs have been foundational for sequence modeling, Transformers have revolutionized NLP in recent years. Let's briefly introduce them:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# Create a tensor of shape [max_len, d_model]
pe = torch.zeros(max_len, d_model)
# Create a tensor of shape [max_len]
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
# Apply sine to even positions and cosine to odd positions
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension [1, max_len, d_model]
pe = pe.unsqueeze(0)
# Register buffer (not a parameter but should be part of model)
self.register_buffer('pe', pe)
def forward(self, x):
# Add positional encoding to input embedding
# x shape: [batch_size, seq_len, d_model]
return x + self.pe[:, :x.size(1)]
class SimpleTransformer(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers, dim_feedforward, output_dim, dropout=0.1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoder = PositionalEncoding(d_model)
# TransformerEncoder consists of multiple TransformerEncoderLayers
encoder_layers = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True)
self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
self.fc = nn.Linear(d_model, output_dim)
self.d_model = d_model
def forward(self, src, src_mask=None):
# src shape: [batch_size, seq_len]
# [batch_size, seq_len, d_model]
src = self.embedding(src) * math.sqrt(self.d_model)
src = self.pos_encoder(src)
# [batch_size, seq_len, d_model]
output = self.transformer_encoder(src, src_mask)
# Use the representation of [CLS] token (if you're following BERT-like approach)
# or use mean/max pooling over sequence length
# Here we'll use mean pooling for simplicity
output = output.mean(dim=1) # [batch_size, d_model]
return self.fc(output)
# To actually use this model:
# vocab_size = 10000
# d_model = 512 # Embedding dimension
# nhead = 8 # Number of attention heads
# num_layers = 2 # Number of transformer layers
# dim_feedforward = 2048 # Hidden dimension of feedforward network
# output_dim = 2 # Number of output classes
#
# model = SimpleTransformer(vocab_size, d_model, nhead, num_layers, dim_feedforward, output_dim)
Note: The transformer example above is just an introduction. In practice, using pre-trained models like BERT, GPT, or RoBERTa via libraries like Hugging Face's Transformers is more common due to their complexity and training requirements.
Using Pre-trained Models for Sequence Modeling
For real-world applications, you can leverage pre-trained models via the Hugging Face's Transformers library:
# !pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example sentences
sentences = [
"I've been waiting for this movie for years and it didn't disappoint!",
"The storyline was confusing and the characters were poorly developed."
]
# Tokenize and prepare input
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Show results
for sent, pred in zip(sentences, predictions):
sentiment = "Positive" if pred[1] > pred[0] else "Negative"
confidence = pred[1] if sentiment == "Positive" else pred[0]
print(f"Sentence: {sent}")
print(f"Sentiment: {sentiment} (confidence: {confidence.item():.4f})")
print("-" * 50)
Summary
In this tutorial, we've covered the fundamentals of sequence modeling in PyTorch for NLP tasks:
- Text Representation: Converting text to numerical tensors
- Recurrent Neural Networks (RNNs): The basic building blocks for sequence modeling
- Long Short-Term Memory (LSTM): More advanced RNNs that handle long-term dependencies
- Bidirectional LSTMs: Processing sequences in both directions
- Transformers: The modern architecture revolutionizing NLP
- Pre-trained Models: Leveraging existing models for quick development
Sequence modeling is a vast field with applications ranging from sentiment analysis and machine translation to question answering and text generation. The architectures we discussed form the backbone of these applications.
Further Exercises
- Implement a character-level language model using LSTMs
- Create a sequence-to-sequence model for a task like machine translation
- Fine-tune a pre-trained transformer model like BERT on a custom dataset
- Implement an attention mechanism from scratch and integrate it with an LSTM
- Build a named entity recognition system using a BiLSTM with a CRF layer
Additional Resources
- PyTorch documentation on RNNs
- The Illustrated Transformer
- Hugging Face's Transformers Library
- Deep Learning for NLP with PyTorch
- Sequence Models Coursera Course
Happy sequence modeling with PyTorch!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)