PyTorch Word Embeddings
Word embeddings are a fundamental concept in Natural Language Processing (NLP). They allow us to represent words as dense vectors of real numbers, capturing semantic relationships between words. In this tutorial, we'll learn how to work with word embeddings in PyTorch, a popular deep learning framework.
What are Word Embeddings?
Word embeddings are dense vector representations of words where words with similar meanings have similar vector representations. Unlike traditional one-hot encoding (where each word is represented by a sparse vector with mostly zeros), word embeddings:
- Capture semantic relationships between words
- Reduce dimensionality
- Allow mathematical operations on words (e.g., "king" - "man" + "woman" = "queen")
- Improve the performance of NLP models significantly
Types of Word Embeddings in PyTorch
PyTorch provides several ways to create and use word embeddings:
- nn.Embedding: PyTorch's built-in embedding layer
- Pre-trained embeddings: GloVe, Word2Vec, FastText
- Contextual embeddings: From models like BERT, GPT, etc.
Let's explore each of these approaches.
1. Using nn.Embedding in PyTorch
The nn.Embedding
module in PyTorch creates an embedding lookup table. It's one of the simplest ways to create embeddings.
Basic Example
import torch
import torch.nn as nn
# Define vocabulary size and embedding dimension
vocab_size = 10000 # Example: vocabulary contains 10,000 words
embedding_dim = 100 # Each word will be represented by a 100-dimensional vector
# Create embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
# Example: Get embedding for word with index 250
word_idx = torch.tensor([250])
word_embedding = embedding(word_idx)
print(f"Shape of embedding for a single word: {word_embedding.shape}")
# Get embeddings for a sentence (sequence of word indices)
sentence = torch.tensor([250, 56, 789, 3254])
sentence_embeddings = embedding(sentence)
print(f"Shape of embeddings for a sentence: {sentence_embeddings.shape}")
Output:
Shape of embedding for a single word: torch.Size([1, 100])
Shape of embeddings for a sentence: torch.Size([4, 100])
Creating a Simple Text Classification Model with Embeddings
Let's create a basic sentiment analysis model using word embeddings:
import torch
import torch.nn as nn
import torch.optim as optim
class SentimentClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.fc = nn.Linear(embedding_dim, hidden_dim)
self.output = nn.Linear(hidden_dim, output_dim)
self.relu = nn.ReLU()
def forward(self, text):
# text shape: [batch_size, sentence_length]
embedded = self.embedding(text) # [batch_size, sentence_length, embedding_dim]
# Average the embeddings for the entire sentence
embedded = torch.mean(embedded, dim=1) # [batch_size, embedding_dim]
hidden = self.fc(embedded)
hidden = self.relu(hidden)
output = self.output(hidden)
return output
# Initialize model
vocab_size = 10000
embedding_dim = 100
hidden_dim = 256
output_dim = 2 # Binary classification (positive/negative)
model = SentimentClassifier(vocab_size, embedding_dim, hidden_dim, output_dim)
# Example input
batch_size = 16
max_sentence_length = 20
text_batch = torch.randint(0, vocab_size, (batch_size, max_sentence_length))
# Forward pass
predictions = model(text_batch)
print(f"Model input shape: {text_batch.shape}")
print(f"Model output shape: {predictions.shape}")
Output:
Model input shape: torch.Size([16, 20])
Model output shape: torch.Size([16, 2])
2. Using Pre-trained Word Embeddings
Pre-trained word embeddings like GloVe or Word2Vec have been trained on large corpora and capture rich semantic information. Let's see how to use them in PyTorch.
Loading and Using GloVe Embeddings
import torch
import torch.nn as nn
import numpy as np
import os
def load_glove_embeddings(glove_path, word_to_idx, embedding_dim=100):
# Create a random embedding matrix for all words in our vocabulary
vocab_size = len(word_to_idx)
embedding_matrix = np.random.randn(vocab_size, embedding_dim)
# Load GloVe embeddings from file
with open(glove_path, 'r', encoding='utf-8') as f:
for line in f:
parts = line.split()
word = parts[0]
# Only update embeddings for words in our vocabulary
if word in word_to_idx:
vector = np.array([float(val) for val in parts[1:]])
embedding_matrix[word_to_idx[word]] = vector
return torch.FloatTensor(embedding_matrix)
# Example usage:
# Assume we have a vocabulary mapping from words to indices
word_to_idx = {"hello": 0, "world": 1, "python": 2, "pytorch": 3, "nlp": 4}
vocab_size = len(word_to_idx)
embedding_dim = 100
# Path to GloVe embeddings (you would need to download them)
glove_path = "glove.6B.100d.txt" # This is just an example path
# For demonstration only (since we don't have actual GloVe file in this example)
if not os.path.exists(glove_path):
print("This is a demonstration - in practice, download GloVe from https://nlp.stanford.edu/projects/glove/")
# Create a model with random embeddings instead
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
else:
# Load GloVe embeddings
glove_embeddings = load_glove_embeddings(glove_path, word_to_idx, embedding_dim)
# Create embedding layer with pre-trained weights
embedding_layer = nn.Embedding.from_pretrained(glove_embeddings, freeze=False)
# Setting freeze=False allows the embeddings to be fine-tuned during training
print(f"Created embedding layer with shape: {embedding_layer.weight.shape}")
Output:
This is a demonstration - in practice, download GloVe from https://nlp.stanford.edu/projects/glove/
Created embedding layer with shape: torch.Size([5, 100])
Word Similarity with Pre-trained Embeddings
With word embeddings, we can calculate semantic similarity between words:
def cosine_similarity(vec1, vec2):
return torch.nn.functional.cosine_similarity(vec1, vec2, dim=0)
# Demonstration with our toy embeddings
# In practice, you'd use pre-trained embeddings
with torch.no_grad():
# Get embeddings for two words
idx1 = torch.tensor([word_to_idx["python"]])
idx2 = torch.tensor([word_to_idx["pytorch"]])
emb1 = embedding_layer(idx1).squeeze(0)
emb2 = embedding_layer(idx2).squeeze(0)
similarity = cosine_similarity(emb1, emb2)
print(f"Similarity between 'python' and 'pytorch': {similarity.item():.4f}")
# With meaningful pre-trained embeddings, similar words would have higher similarity scores
3. Working with Contextual Embeddings
Contextual embeddings like BERT generate different embeddings for the same word based on context. Let's see a basic implementation using the Hugging Face transformers library:
# pip install transformers
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Example sentences
sentences = [
"I love PyTorch for deep learning tasks.",
"PyTorch is my favorite deep learning framework."
]
# Tokenize input
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Get contextual embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings for the entire sequence
sequence_embeddings = outputs.last_hidden_state
# Get the [CLS] token embedding (often used for sentence representation)
sentence_embeddings = outputs.pooler_output
print(f"Shape of all token embeddings: {sequence_embeddings.shape}")
print(f"Shape of sentence embeddings: {sentence_embeddings.shape}")
Output:
Shape of all token embeddings: torch.Size([2, 13, 768])
Shape of sentence embeddings: torch.Size([2, 768])
Real-world Application: Sentiment Analysis
Let's build a more complete sentiment analysis model using word embeddings and LSTM (Long Short-Term Memory) networks:
import torch
import torch.nn as nn
import torch.optim as optim
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
bidirectional, dropout, pad_idx):
super().__init__()
# Embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
# LSTM layer
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout,
batch_first=True)
# Output layer
self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
# Dropout layer
self.dropout = nn.Dropout(dropout)
def forward(self, text):
# text shape: [batch_size, sentence_length]
# Embedded: [batch_size, sentence_length, embedding_dim]
embedded = self.dropout(self.embedding(text))
# LSTM output: [batch_size, sentence_length, hidden_dim * n_directions]
output, (hidden, cell) = self.lstm(embedded)
# If bidirectional, concat the last hidden state from both directions
if self.lstm.bidirectional:
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
else:
hidden = hidden[-1,:,:]
# Apply dropout to hidden state
hidden = self.dropout(hidden)
# Return logits
return self.fc(hidden)
# Model parameters
vocab_size = 25000
embedding_dim = 300
hidden_dim = 256
output_dim = 2 # binary sentiment (positive/negative)
n_layers = 2
bidirectional = True
dropout = 0.5
pad_idx = 1 # Assuming 1 is the padding index in your vocabulary
# Initialize model
model = SentimentLSTM(vocab_size, embedding_dim, hidden_dim, output_dim,
n_layers, bidirectional, dropout, pad_idx)
# Example input (batch of 8 sentences, each with max 50 tokens)
batch = torch.randint(0, vocab_size, (8, 50))
# Forward pass
predictions = model(batch)
print(f"Input shape: {batch.shape}")
print(f"Output shape: {predictions.shape}")
Output:
Input shape: torch.Size([8, 50])
Output shape: torch.Size([8, 2])
Training a Sentiment Analysis Model
Let's see how you would train this model:
def train_model(model, train_iterator, optimizer, criterion):
model.train()
epoch_loss = 0
for batch in train_iterator:
# Assuming batch.text contains the tokenized texts and batch.label contains the labels
text = batch.text
labels = batch.label
# Zero the gradients
optimizer.zero_grad()
# Forward pass
predictions = model(text)
# Calculate loss
loss = criterion(predictions, labels)
# Backward pass
loss.backward()
# Update parameters
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(train_iterator)
# Example usage (this is just for illustration):
# optimizer = optim.Adam(model.parameters())
# criterion = nn.CrossEntropyLoss()
# train_model(model, train_iterator, optimizer, criterion)
Custom Word Embeddings vs. Pre-trained Embeddings
When should you use custom embeddings versus pre-trained ones?
Custom embeddings (nn.Embedding):
- When you have domain-specific vocabulary
- When your dataset is large enough
- When you want the embeddings to be optimized specifically for your task
Pre-trained embeddings:
- When you have limited training data
- When you want to leverage semantic knowledge from large corpora
- When you need faster training time
Contextual embeddings:
- When you need state-of-the-art performance
- When the same word can have different meanings based on context
- When you have enough computational resources
Summary
In this tutorial, we've covered:
- What word embeddings are and why they're important
- How to use PyTorch's
nn.Embedding
module - How to incorporate pre-trained embeddings like GloVe
- How to work with contextual embeddings from models like BERT
- Building an LSTM-based sentiment analysis model with embeddings
Word embeddings are a fundamental building block of modern NLP systems. They allow us to represent words as dense vectors that capture semantic relationships, which greatly improves the performance of NLP tasks.
Exercises
- Build a simple text classifier using
nn.Embedding
and a feed-forward neural network. - Load pre-trained GloVe embeddings and use them in a model.
- Try different embedding dimensions (50, 100, 300) and observe their effect on model performance.
- Implement a word analogy task using embeddings (e.g., "king" - "man" + "woman" = ?).
- Compare the performance of fixed vs. trainable embeddings on a text classification task.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)