Skip to main content

PyTorch LSTM Models

In natural language processing (NLP), handling sequential data efficiently is crucial. Long Short-Term Memory (LSTM) networks are specialized recurrent neural networks designed to address the challenges of learning long-term dependencies in sequence data. In this tutorial, we'll explore how to implement and use LSTM models in PyTorch for NLP tasks.

Introduction to LSTMs

LSTMs were introduced to solve the "vanishing gradient problem" that standard recurrent neural networks (RNNs) face when processing long sequences. They accomplish this through a sophisticated gating mechanism that controls information flow.

Key Components of an LSTM Cell

An LSTM cell contains three gates:

  • Input gate: Decides what new information to store
  • Forget gate: Decides what information to discard
  • Output gate: Decides what to output based on the cell state

These gates enable LSTMs to remember information for long periods and selectively update or forget information as needed.

Setting Up Your Environment

Before we dive into implementation, let's make sure we have the necessary libraries installed:

bash
pip install torch torchtext numpy matplotlib

Let's import the libraries we'll need:

python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

Building a Basic LSTM Model in PyTorch

Let's start by creating a simple LSTM model for sequence classification:

python
class LSTMClassifier(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, num_layers, dropout):
super().__init__()

# Embedding layer to convert tokens to vectors
self.embedding = nn.Embedding(input_dim, embedding_dim)

# LSTM layer
self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=num_layers,
bidirectional=False,
dropout=dropout if num_layers > 1 else 0,
batch_first=True
)

# Fully connected layer for classification
self.fc = nn.Linear(hidden_dim, output_dim)

# Dropout layer
self.dropout = nn.Dropout(dropout)

def forward(self, text):
# text shape: [batch_size, seq_length]

# Apply embedding layer
embedded = self.embedding(text)
# embedded shape: [batch_size, seq_length, embedding_dim]

# Apply LSTM
outputs, (hidden, cell) = self.lstm(embedded)
# hidden shape: [num_layers * directions, batch_size, hidden_dim]

# Use the final hidden state
hidden = self.dropout(hidden[-1, :, :])
# hidden shape: [batch_size, hidden_dim]

# Apply linear layer
return self.fc(hidden)

Understanding the LSTM Parameters

Let's break down the important parameters of the nn.LSTM module:

  • input_size: Size of the input features (embedding_dim in our case)
  • hidden_size: Number of features in the hidden state
  • num_layers: Number of stacked LSTM layers
  • batch_first: If True, input shape should be [batch_size, seq_len, input_size]
  • bidirectional: If True, creates a bidirectional LSTM
  • dropout: Dropout rate between LSTM layers (only applies if num_layers > 1)

Example: Sentiment Analysis with LSTMs

Let's implement a sentiment analysis model using LSTM. We'll use a simplified approach with dummy data for demonstration:

python
# Define hyperparameters
VOCAB_SIZE = 10000 # Size of vocabulary
EMBEDDING_DIM = 100 # Embedding dimension
HIDDEN_DIM = 256 # Hidden dimension for LSTM
OUTPUT_DIM = 2 # Binary classification (positive/negative)
NUM_LAYERS = 2 # Number of LSTM layers
DROPOUT = 0.5 # Dropout rate
BATCH_SIZE = 64 # Batch size
LEARNING_RATE = 0.001 # Learning rate

# Create model instance
model = LSTMClassifier(
input_dim=VOCAB_SIZE,
embedding_dim=EMBEDDING_DIM,
hidden_dim=HIDDEN_DIM,
output_dim=OUTPUT_DIM,
num_layers=NUM_LAYERS,
dropout=DROPOUT
)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Print model architecture
print(model)

Output:

LSTMClassifier(
(embedding): Embedding(10000, 100)
(lstm): LSTM(100, 256, num_layers=2, batch_first=True, dropout=0.5)
(fc): Linear(in_features=256, out_features=2, bias=True)
(dropout): Dropout(p=0.5, inplace=False)
)

Training the LSTM Model

Let's create a training function:

python
def train(model, dataloader, optimizer, criterion, device):
model.train() # Set model to training mode
epoch_loss = 0
epoch_acc = 0

for batch in dataloader:
# Get text and labels
text, labels = batch

# Move to device
text = text.to(device)
labels = labels.to(device)

# Reset gradients
optimizer.zero_grad()

# Forward pass
predictions = model(text)

# Calculate loss
loss = criterion(predictions, labels)

# Backpropagation
loss.backward()

# Update parameters
optimizer.step()

# Calculate accuracy
_, predicted = torch.max(predictions, 1)
correct = (predicted == labels).float().sum()
accuracy = correct / len(labels)

# Update running metrics
epoch_loss += loss.item()
epoch_acc += accuracy.item()

# Return average metrics
return epoch_loss / len(dataloader), epoch_acc / len(dataloader)

Creating a Dummy Dataset for Testing

Let's create a simple dataset to test our model:

python
# Create dummy data
def create_dummy_data(num_samples=1000, max_length=100):
X = torch.randint(0, VOCAB_SIZE, (num_samples, max_length)) # Random token indices
y = torch.randint(0, OUTPUT_DIM, (num_samples,)) # Random labels
return X, y

# Create dummy datasets
X_train, y_train = create_dummy_data(num_samples=800)
X_val, y_val = create_dummy_data(num_samples=200)

# Create DataLoader
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

Training Loop

Now let's run the training:

python
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Training loop
epochs = 5
train_losses = []
train_accs = []

for epoch in range(epochs):
# Train model
train_loss, train_acc = train(model, train_dataloader, optimizer, criterion, device)

# Store metrics
train_losses.append(train_loss)
train_accs.append(train_acc)

print(f"Epoch {epoch+1}/{epochs}")
print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")

Example Output:

Epoch 1/5
Train Loss: 0.6932, Train Acc: 0.4992
Epoch 2/5
Train Loss: 0.6931, Train Acc: 0.5023
Epoch 3/5
Train Loss: 0.6929, Train Acc: 0.5063
Epoch 4/5
Train Loss: 0.6926, Train Acc: 0.5102
Epoch 5/5
Train Loss: 0.6921, Train Acc: 0.5148

Note: Since we're working with random data, you won't see much improvement in accuracy.

Visualizing Results

Let's visualize the training performance:

python
plt.figure(figsize=(12, 4))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(train_losses)
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')

# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(train_accs)
plt.title('Training Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')

plt.tight_layout()
plt.show()

Practical Example: Text Classification with a Real Dataset

For a more realistic example, let's use a small subset of the IMDb movie review dataset:

python
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

# Load the dataset
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

# Create the tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])

# Text pipeline: tokenize and convert to integers
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# Label pipeline: convert to integer
label_pipeline = lambda x: 1 if x == 'pos' else 0

# Collate function for DataLoader
def collate_batch(batch):
label_list, text_list = [], []
for (_label, _text) in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)

# Pad sequences
text_list = pad_sequence(text_list, batch_first=True, padding_value=vocab['<pad>'])
label_list = torch.tensor(label_list, dtype=torch.int64)
return text_list, label_list

# Create DataLoader
train_iter, test_iter = IMDB(split='train'), IMDB(split='test')
train_dataloader = DataLoader(list(train_iter)[:1000], batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(list(test_iter)[:200], batch_size=BATCH_SIZE,
collate_fn=collate_batch)

# Create model
model = LSTMClassifier(
input_dim=len(vocab),
embedding_dim=EMBEDDING_DIM,
hidden_dim=HIDDEN_DIM,
output_dim=2, # positive or negative
num_layers=NUM_LAYERS,
dropout=DROPOUT
)
model = model.to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

Note: The above code is for illustrative purposes. In a real implementation, you would need to handle the dataset properly and adjust model parameters.

Advanced LSTM Techniques

Bidirectional LSTM

Bidirectional LSTMs process sequences in both forward and backward directions, capturing more context:

python
class BidirectionalLSTM(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, num_layers, dropout):
super().__init__()

self.embedding = nn.Embedding(input_dim, embedding_dim)

self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=num_layers,
bidirectional=True, # Bidirectional!
dropout=dropout if num_layers > 1 else 0,
batch_first=True
)

# Note: hidden_dim * 2 because bidirectional
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, text):
embedded = self.embedding(text)

outputs, (hidden, cell) = self.lstm(embedded)

# Concatenate the final forward and backward hidden states
hidden = torch.cat([hidden[-2,:,:], hidden[-1,:,:]], dim=1)
hidden = self.dropout(hidden)

return self.fc(hidden)

LSTM with Attention Mechanism

Adding attention helps the model focus on important parts of the sequence:

python
class LSTMWithAttention(nn.Module):
def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, num_layers, dropout):
super().__init__()

self.embedding = nn.Embedding(input_dim, embedding_dim)

self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=num_layers,
bidirectional=True,
dropout=dropout if num_layers > 1 else 0,
batch_first=True
)

self.attention = nn.Linear(hidden_dim * 2, 1)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, text):
# text shape: [batch_size, seq_length]
embedded = self.embedding(text)
# embedded shape: [batch_size, seq_length, embedding_dim]

outputs, (hidden, cell) = self.lstm(embedded)
# outputs shape: [batch_size, seq_length, hidden_dim * 2]

# Calculate attention weights
attention = self.attention(outputs)
# attention shape: [batch_size, seq_length, 1]

# Apply softmax to get normalized weights
attention_weights = torch.softmax(attention.squeeze(-1), dim=1)
# attention_weights shape: [batch_size, seq_length]

# Create attention weights matrix
attention_weights = attention_weights.unsqueeze(-1)
# attention_weights shape: [batch_size, seq_length, 1]

# Weighted sum of LSTM outputs
context = torch.sum(outputs * attention_weights, dim=1)
# context shape: [batch_size, hidden_dim * 2]

# Apply dropout and linear layer
context = self.dropout(context)
return self.fc(context)

Common Issues and Optimization Tips

Gradient Clipping

To prevent exploding gradients (a common issue in RNNs):

python
# In your training loop
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Weight Initialization

Proper initialization can help with training:

python
def init_weights(model):
for name, param in model.named_parameters():
if 'weight' in name:
nn.init.xavier_normal_(param)
elif 'bias' in name:
nn.init.zeros_(param)

model.apply(init_weights)

Learning Rate Scheduling

Adjust learning rate over time:

python
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2)

# In training loop
scheduler.step(val_loss)

Saving and Loading LSTM Models

python
# Save model
def save_model(model, filepath):
torch.save({
'model_state_dict': model.state_dict(),
'vocab': vocab, # Save vocabulary with the model
}, filepath)
print(f"Model saved to {filepath}")

# Load model
def load_model(filepath, model_class, model_args):
checkpoint = torch.load(filepath)
model = model_class(**model_args)
model.load_state_dict(checkpoint['model_state_dict'])
vocab = checkpoint['vocab']
return model, vocab

# Example usage
save_model(model, "lstm_sentiment_model.pt")

# Later, to load:
model_args = {
'input_dim': len(vocab),
'embedding_dim': EMBEDDING_DIM,
'hidden_dim': HIDDEN_DIM,
'output_dim': OUTPUT_DIM,
'num_layers': NUM_LAYERS,
'dropout': DROPOUT
}
loaded_model, loaded_vocab = load_model("lstm_sentiment_model.pt", LSTMClassifier, model_args)

Summary

In this tutorial, we've explored how to implement LSTM models in PyTorch for natural language processing tasks:

  1. We started with the fundamentals of LSTM networks and why they're valuable for sequence data
  2. We built a basic LSTM model for text classification
  3. We implemented a training loop and tested it on dummy data
  4. We examined more advanced architectures like Bidirectional LSTMs and LSTMs with Attention
  5. We discussed common issues and optimization techniques
  6. We learned how to save and load trained models

LSTMs are powerful tools for sequential data processing in NLP, serving as the foundation for many applications including sentiment analysis, machine translation, text generation, and more.

Additional Resources

Exercises

  1. Implement a Character-Level Language Model: Create an LSTM model that generates text character by character after training on a corpus of text.

  2. Named Entity Recognition: Modify the LSTM classifier to perform NER by predicting entity tags for each token in a sentence.

  3. Hyperparameter Tuning: Experiment with different LSTM configurations by varying hidden dimensions, number of layers, and dropout rates. Track the impact on performance.

  4. Pre-trained Embeddings: Modify the model to use pre-trained GloVe or Word2Vec embeddings instead of learning them from scratch.

  5. Attention Visualization: For the LSTM with attention model, implement a function to visualize which words the model pays attention to when making predictions.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)