PyTorch Transformers

Introduction

Transformers have revolutionized the field of Natural Language Processing (NLP) since their introduction in the 2017 paper "Attention is All You Need" by Vaswani et al. Unlike previous sequence models like RNNs and LSTMs that process data sequentially, transformers process entire sequences simultaneously, enabling efficient parallelization and capturing long-range dependencies in text.

In this tutorial, we'll explore how to use transformer models in PyTorch for NLP tasks. We'll cover:

The basic architecture of transformers
How to implement a simple transformer model in PyTorch
Using pre-trained transformers from the Hugging Face library
Fine-tuning transformer models for specific NLP tasks

By the end of this tutorial, you'll have a solid understanding of how transformers work and how to apply them to your own NLP projects.

Prerequisites

Before diving into transformers, you should have:

Basic knowledge of PyTorch
Understanding of neural networks fundamentals
Familiarity with NLP concepts like tokenization and embeddings

Let's make sure we have all the necessary libraries installed:

python
# Install required libraries
!pip install torch transformers datasets

Now, let's import the libraries we'll need:

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math

Transformer Architecture: Building Blocks

The transformer architecture consists of several key components:

Embedding Layer: Converts words to vectors
Positional Encoding: Adds information about word positions
Multi-Head Attention: Allows the model to focus on different parts of the input
Feed-Forward Networks: Processes the attention outputs
Layer Normalization & Residual Connections: Stabilizes training

Let's understand each component by implementing them in PyTorch.

1. Embedding and Positional Encoding

The embedding layer maps tokens to vectors, but doesn't contain information about their position in the sequence. Positional encoding adds this information:

python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=5000):
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer (not a parameter)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # Add positional encoding to the embedding
        # x shape: [batch_size, seq_len, embedding_dim]
        return x + self.pe[:, :x.size(1), :]

2. Multi-Head Attention

The attention mechanism is the heart of the transformer. It allows the model to focus on relevant parts of the input sequence when making predictions. Multi-head attention runs this mechanism multiple times in parallel:

python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        
        # Check if dimensions are compatible
        assert self.head_dim * num_heads == d_model, "d_model must be divisible by num_heads"
        
        # Linear projections
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape for multi-head attention
        q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Apply mask if provided (used in decoder for causal attention)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
            
        # Apply softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention weights to values
        output = torch.matmul(attention_weights, v)
        
        # Reshape and apply final linear projection
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.out_linear(output)
        
        return output

3. Feed-Forward Network

Each transformer block includes a feed-forward network that processes the outputs of the attention layer:

python
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x

4. Transformer Encoder Layer

Now, let's combine these components to create a complete encoder layer:

python
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff=2048, dropout=0.1):
        super(EncoderLayer, self).__init__()
        
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self-attention with residual connection and layer normalization
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection and layer normalization
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

Building a Complete Transformer Encoder

Let's assemble a full transformer encoder using the components we've built:

python
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff=2048, dropout=0.1, max_seq_len=5000):
        super(TransformerEncoder, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
        self.dropout = nn.Dropout(dropout)
        
        # Stack of encoder layers
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        # Convert input indices to embeddings
        x = self.embedding(x)
        
        # Add positional encoding
        x = self.positional_encoding(x)
        
        # Apply dropout
        x = self.dropout(x)
        
        # Pass through each encoder layer
        for layer in self.layers:
            x = layer(x, mask)
            
        # Apply final normalization
        x = self.norm(x)
        
        return x

Using Hugging Face Transformers Library

While building transformers from scratch is educational, for practical purposes, it's better to use established libraries. Hugging Face's Transformers library provides pre-trained models that can be fine-tuned for specific tasks.

Let's see how to use pre-trained transformers for text classification:

python
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Example input texts and labels (sentiment analysis)
texts = [
    "I love this movie, it's amazing!",
    "This film is terrible, I hated it.",
    "The acting was great but the plot was confusing.",
    "A masterpiece of modern cinema."
]

# Binary labels (0: negative, 1: positive)
labels = torch.tensor([1, 0, 0, 1])

# Tokenize and prepare inputs
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=128)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

# Create dataset and dataloader
dataset = TensorDataset(input_ids, attention_mask, labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Configure training
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fine-tuning loop
model.train()
for epoch in range(3):  # 3 epochs
    for batch in dataloader:
        batch_input_ids, batch_attention_mask, batch_labels = [b.to(device) for b in batch]
        
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(
            input_ids=batch_input_ids,
            attention_mask=batch_attention_mask,
            labels=batch_labels
        )
        
        loss = outputs.loss
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Inference example
model.eval()
with torch.no_grad():
    test_text = "This movie exceeded all my expectations!"
    inputs = tokenizer(test_text, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    
    print(f"Text: {test_text}")
    print(f"Sentiment: {'Positive' if prediction == 1 else 'Negative'}")

Example output:

Epoch 1, Loss: 0.6931
Epoch 2, Loss: 0.6537
Epoch 3, Loss: 0.5814
Text: This movie exceeded all my expectations!
Sentiment: Positive

Real-World Application: Text Summarization

Transformers excel at text generation tasks like summarization. Let's see how to implement text summarization using a pre-trained T5 model:

python
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example article to summarize
article = """
Artificial intelligence (AI) is rapidly transforming various industries and aspects of daily life. 
From healthcare to finance, transportation to entertainment, AI applications are becoming increasingly 
common. Machine learning algorithms can analyze large datasets to identify patterns and make predictions, 
while natural language processing enables computers to understand and generate human language. 
Despite these advances, concerns about ethical implications, job displacement, and privacy issues remain.
Many experts agree that responsible AI development requires careful consideration of these challenges.
"""

# T5 requires a "summarize: " prefix for summarization tasks
input_text = "summarize: " + article

# Tokenize and generate summary
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)
summary_ids = model.generate(
    inputs, 
    max_length=150, 
    min_length=40, 
    length_penalty=2.0, 
    num_beams=4, 
    early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Original Text Length:", len(article.split()))
print("Summary Length:", len(summary.split()))
print("\nSummary:")
print(summary)

Example output:

Original Text Length: 87
Summary Length: 37

Summary:
AI is rapidly transforming various industries and aspects of daily life. Machine learning algorithms can analyze large datasets to identify patterns and make predictions. Concerns about ethical implications, job displacement, and privacy issues remain. Responsible AI development requires careful consideration of these challenges.

Named Entity Recognition with Transformers

Another practical application of transformers is named entity recognition (NER). Let's use a pre-trained BERT model for this task:

python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

# Create NER pipeline
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text for entity recognition
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 in Cupertino, California."

# Perform NER
entities = ner(text)

print("Named Entities:")
for entity in entities:
    print(f"{entity['word']} - {entity['entity_group']} (Score: {entity['score']:.4f})")

# Extract and display tagged text
tagged_text = list(text)
offsets = []
for entity in entities:
    start = entity["start"]
    end = entity["end"]
    entity_type = entity["entity_group"]
    offsets.append((start, f"[{entity_type}:"))
    offsets.append((end, f"]"))

# Sort offsets in reverse order to avoid index shifting when inserting tags
offsets.sort(key=lambda x: x[0], reverse=True)

for offset, tag in offsets:
    tagged_text.insert(offset, tag)

print("\nTagged Text:")
print("".join(tagged_text))

Example output:

Named Entities:
Apple Inc. - ORG (Score: 0.9971)
Steve Jobs - PER (Score: 0.9996)
Steve Wozniak - PER (Score: 0.9992)
Ronald Wayne - PER (Score: 0.9989)
April 1976 - MISC (Score: 0.9717)
Cupertino - LOC (Score: 0.9991)
California - LOC (Score: 0.9992)

Tagged Text:
[ORG:Apple Inc.] was founded by [PER:Steve Jobs], [PER:Steve Wozniak], and [PER:Ronald Wayne] in [MISC:April 1976] in [LOC:Cupertino], [LOC:California].

Summary

In this tutorial, we've explored transformer models in PyTorch, covering:

The fundamental components of transformer architectures (attention mechanisms, positional encoding, etc.)
Implementation of a basic transformer encoder from scratch
Using pre-trained transformer models from the Hugging Face library
Fine-tuning transformers for text classification
Practical applications like text summarization and named entity recognition

Transformers have become the backbone of modern NLP, powering models like BERT, GPT, and T5 that achieve state-of-the-art results across a wide range of language tasks. With the knowledge gained from this tutorial, you're now equipped to apply these powerful models to your own NLP projects.

Additional Resources

Exercises

Basic: Modify the text classification example to work with a different pre-trained model (e.g., RoBERTa or DistilBERT).
Intermediate: Implement a simple chatbot using a pre-trained GPT-2 model from Hugging Face.
Advanced: Create a translation system using a transformer model, fine-tuning it on a small dataset for a specific language pair.
Challenge: Implement the decoder portion of the transformer architecture and combine it with the encoder to create a full sequence-to-sequence model.

By completing these exercises, you'll gain hands-on experience with transformer models and deepen your understanding of their capabilities and applications in NLP.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Transformer Architecture: Building Blocks​

1. Embedding and Positional Encoding​

2. Multi-Head Attention​

3. Feed-Forward Network​

4. Transformer Encoder Layer​

Building a Complete Transformer Encoder​

Using Hugging Face Transformers Library​

Real-World Application: Text Summarization​

Named Entity Recognition with Transformers​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Transformer Architecture: Building Blocks

1. Embedding and Positional Encoding

2. Multi-Head Attention

3. Feed-Forward Network

4. Transformer Encoder Layer

Building a Complete Transformer Encoder

Using Hugging Face Transformers Library

Real-World Application: Text Summarization

Named Entity Recognition with Transformers

Summary

Additional Resources

Exercises