PyTorch Transformers
Introduction
Transformers have revolutionized the field of Natural Language Processing (NLP) since their introduction in the 2017 paper "Attention is All You Need" by Vaswani et al. Unlike previous sequence models like RNNs and LSTMs that process data sequentially, transformers process entire sequences simultaneously, enabling efficient parallelization and capturing long-range dependencies in text.
In this tutorial, we'll explore how to use transformer models in PyTorch for NLP tasks. We'll cover:
- The basic architecture of transformers
- How to implement a simple transformer model in PyTorch
- Using pre-trained transformers from the Hugging Face library
- Fine-tuning transformer models for specific NLP tasks
By the end of this tutorial, you'll have a solid understanding of how transformers work and how to apply them to your own NLP projects.
Prerequisites
Before diving into transformers, you should have:
- Basic knowledge of PyTorch
- Understanding of neural networks fundamentals
- Familiarity with NLP concepts like tokenization and embeddings
Let's make sure we have all the necessary libraries installed:
# Install required libraries
!pip install torch transformers datasets
Now, let's import the libraries we'll need:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
Transformer Architecture: Building Blocks
The transformer architecture consists of several key components:
- Embedding Layer: Converts words to vectors
- Positional Encoding: Adds information about word positions
- Multi-Head Attention: Allows the model to focus on different parts of the input
- Feed-Forward Networks: Processes the attention outputs
- Layer Normalization & Residual Connections: Stabilizes training
Let's understand each component by implementing them in PyTorch.
1. Embedding and Positional Encoding
The embedding layer maps tokens to vectors, but doesn't contain information about their position in the sequence. Positional encoding adds this information:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_seq_length=5000):
super(PositionalEncoding, self).__init__()
# Create positional encoding matrix
pe = torch.zeros(max_seq_length, d_model)
position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
# Apply sine to even indices
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension and register as buffer (not a parameter)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
# Add positional encoding to the embedding
# x shape: [batch_size, seq_len, embedding_dim]
return x + self.pe[:, :x.size(1), :]
2. Multi-Head Attention
The attention mechanism is the heart of the transformer. It allows the model to focus on relevant parts of the input sequence when making predictions. Multi-head attention runs this mechanism multiple times in parallel:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads
# Check if dimensions are compatible
assert self.head_dim * num_heads == d_model, "d_model must be divisible by num_heads"
# Linear projections
self.q_linear = nn.Linear(d_model, d_model)
self.k_linear = nn.Linear(d_model, d_model)
self.v_linear = nn.Linear(d_model, d_model)
self.out_linear = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections and reshape for multi-head attention
q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
k = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
v = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
# Compute attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
# Apply mask if provided (used in decoder for causal attention)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax to get attention weights
attention_weights = F.softmax(scores, dim=-1)
# Apply attention weights to values
output = torch.matmul(attention_weights, v)
# Reshape and apply final linear projection
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
output = self.out_linear(output)
return output
3. Feed-Forward Network
Each transformer block includes a feed-forward network that processes the outputs of the attention layer:
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff=2048, dropout=0.1):
super(FeedForward, self).__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = F.relu(self.linear1(x))
x = self.dropout(x)
x = self.linear2(x)
return x
4. Transformer Encoder Layer
Now, let's combine these components to create a complete encoder layer:
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection and layer normalization
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection and layer normalization
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Building a Complete Transformer Encoder
Let's assemble a full transformer encoder using the components we've built:
class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff=2048, dropout=0.1, max_seq_len=5000):
super(TransformerEncoder, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
self.dropout = nn.Dropout(dropout)
# Stack of encoder layers
self.layers = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
# Convert input indices to embeddings
x = self.embedding(x)
# Add positional encoding
x = self.positional_encoding(x)
# Apply dropout
x = self.dropout(x)
# Pass through each encoder layer
for layer in self.layers:
x = layer(x, mask)
# Apply final normalization
x = self.norm(x)
return x
Using Hugging Face Transformers Library
While building transformers from scratch is educational, for practical purposes, it's better to use established libraries. Hugging Face's Transformers library provides pre-trained models that can be fine-tuned for specific tasks.
Let's see how to use pre-trained transformers for text classification:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Example input texts and labels (sentiment analysis)
texts = [
"I love this movie, it's amazing!",
"This film is terrible, I hated it.",
"The acting was great but the plot was confusing.",
"A masterpiece of modern cinema."
]
# Binary labels (0: negative, 1: positive)
labels = torch.tensor([1, 0, 0, 1])
# Tokenize and prepare inputs
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=128)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
# Create dataset and dataloader
dataset = TensorDataset(input_ids, attention_mask, labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# Configure training
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Fine-tuning loop
model.train()
for epoch in range(3): # 3 epochs
for batch in dataloader:
batch_input_ids, batch_attention_mask, batch_labels = [b.to(device) for b in batch]
# Zero gradients
optimizer.zero_grad()
# Forward pass
outputs = model(
input_ids=batch_input_ids,
attention_mask=batch_attention_mask,
labels=batch_labels
)
loss = outputs.loss
# Backward pass and optimization
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")
# Inference example
model.eval()
with torch.no_grad():
test_text = "This movie exceeded all my expectations!"
inputs = tokenizer(test_text, return_tensors="pt", padding=True, truncation=True).to(device)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1).item()
print(f"Text: {test_text}")
print(f"Sentiment: {'Positive' if prediction == 1 else 'Negative'}")
Example output:
Epoch 1, Loss: 0.6931
Epoch 2, Loss: 0.6537
Epoch 3, Loss: 0.5814
Text: This movie exceeded all my expectations!
Sentiment: Positive
Real-World Application: Text Summarization
Transformers excel at text generation tasks like summarization. Let's see how to implement text summarization using a pre-trained T5 model:
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Example article to summarize
article = """
Artificial intelligence (AI) is rapidly transforming various industries and aspects of daily life.
From healthcare to finance, transportation to entertainment, AI applications are becoming increasingly
common. Machine learning algorithms can analyze large datasets to identify patterns and make predictions,
while natural language processing enables computers to understand and generate human language.
Despite these advances, concerns about ethical implications, job displacement, and privacy issues remain.
Many experts agree that responsible AI development requires careful consideration of these challenges.
"""
# T5 requires a "summarize: " prefix for summarization tasks
input_text = "summarize: " + article
# Tokenize and generate summary
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)
summary_ids = model.generate(
inputs,
max_length=150,
min_length=40,
length_penalty=2.0,
num_beams=4,
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Original Text Length:", len(article.split()))
print("Summary Length:", len(summary.split()))
print("\nSummary:")
print(summary)
Example output:
Original Text Length: 87
Summary Length: 37
Summary:
AI is rapidly transforming various industries and aspects of daily life. Machine learning algorithms can analyze large datasets to identify patterns and make predictions. Concerns about ethical implications, job displacement, and privacy issues remain. Responsible AI development requires careful consideration of these challenges.
Named Entity Recognition with Transformers
Another practical application of transformers is named entity recognition (NER). Let's use a pre-trained BERT model for this task:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
# Create NER pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text for entity recognition
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 in Cupertino, California."
# Perform NER
entities = ner(text)
print("Named Entities:")
for entity in entities:
print(f"{entity['word']} - {entity['entity_group']} (Score: {entity['score']:.4f})")
# Extract and display tagged text
tagged_text = list(text)
offsets = []
for entity in entities:
start = entity["start"]
end = entity["end"]
entity_type = entity["entity_group"]
offsets.append((start, f"[{entity_type}:"))
offsets.append((end, f"]"))
# Sort offsets in reverse order to avoid index shifting when inserting tags
offsets.sort(key=lambda x: x[0], reverse=True)
for offset, tag in offsets:
tagged_text.insert(offset, tag)
print("\nTagged Text:")
print("".join(tagged_text))
Example output:
Named Entities:
Apple Inc. - ORG (Score: 0.9971)
Steve Jobs - PER (Score: 0.9996)
Steve Wozniak - PER (Score: 0.9992)
Ronald Wayne - PER (Score: 0.9989)
April 1976 - MISC (Score: 0.9717)
Cupertino - LOC (Score: 0.9991)
California - LOC (Score: 0.9992)
Tagged Text:
[ORG:Apple Inc.] was founded by [PER:Steve Jobs], [PER:Steve Wozniak], and [PER:Ronald Wayne] in [MISC:April 1976] in [LOC:Cupertino], [LOC:California].
Summary
In this tutorial, we've explored transformer models in PyTorch, covering:
- The fundamental components of transformer architectures (attention mechanisms, positional encoding, etc.)
- Implementation of a basic transformer encoder from scratch
- Using pre-trained transformer models from the Hugging Face library
- Fine-tuning transformers for text classification
- Practical applications like text summarization and named entity recognition
Transformers have become the backbone of modern NLP, powering models like BERT, GPT, and T5 that achieve state-of-the-art results across a wide range of language tasks. With the knowledge gained from this tutorial, you're now equipped to apply these powerful models to your own NLP projects.
Additional Resources
- The original "Attention Is All You Need" paper
- Hugging Face Transformers Documentation
- The Illustrated Transformer by Jay Alammar
- PyTorch Transformer Tutorial
Exercises
- Basic: Modify the text classification example to work with a different pre-trained model (e.g., RoBERTa or DistilBERT).
- Intermediate: Implement a simple chatbot using a pre-trained GPT-2 model from Hugging Face.
- Advanced: Create a translation system using a transformer model, fine-tuning it on a small dataset for a specific language pair.
- Challenge: Implement the decoder portion of the transformer architecture and combine it with the encoder to create a full sequence-to-sequence model.
By completing these exercises, you'll gain hands-on experience with transformer models and deepen your understanding of their capabilities and applications in NLP.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)