PyTorch Word Embeddings

Word embeddings are a fundamental concept in Natural Language Processing (NLP). They allow us to represent words as dense vectors of real numbers, capturing semantic relationships between words. In this tutorial, we'll learn how to work with word embeddings in PyTorch, a popular deep learning framework.

What are Word Embeddings?

Word embeddings are dense vector representations of words where words with similar meanings have similar vector representations. Unlike traditional one-hot encoding (where each word is represented by a sparse vector with mostly zeros), word embeddings:

Capture semantic relationships between words
Reduce dimensionality
Allow mathematical operations on words (e.g., "king" - "man" + "woman" = "queen")
Improve the performance of NLP models significantly

Types of Word Embeddings in PyTorch

PyTorch provides several ways to create and use word embeddings:

nn.Embedding: PyTorch's built-in embedding layer
Pre-trained embeddings: GloVe, Word2Vec, FastText
Contextual embeddings: From models like BERT, GPT, etc.

Let's explore each of these approaches.

1. Using nn.Embedding in PyTorch

The nn.Embedding module in PyTorch creates an embedding lookup table. It's one of the simplest ways to create embeddings.

Basic Example

python
import torch
import torch.nn as nn

# Define vocabulary size and embedding dimension
vocab_size = 10000  # Example: vocabulary contains 10,000 words
embedding_dim = 100  # Each word will be represented by a 100-dimensional vector

# Create embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# Example: Get embedding for word with index 250
word_idx = torch.tensor([250])
word_embedding = embedding(word_idx)

print(f"Shape of embedding for a single word: {word_embedding.shape}")

# Get embeddings for a sentence (sequence of word indices)
sentence = torch.tensor([250, 56, 789, 3254])
sentence_embeddings = embedding(sentence)

print(f"Shape of embeddings for a sentence: {sentence_embeddings.shape}")

Output:

Shape of embedding for a single word: torch.Size([1, 100])
Shape of embeddings for a sentence: torch.Size([4, 100])

Creating a Simple Text Classification Model with Embeddings

Let's create a basic sentiment analysis model using word embeddings:

python
import torch
import torch.nn as nn
import torch.optim as optim

class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
        
    def forward(self, text):
        # text shape: [batch_size, sentence_length]
        embedded = self.embedding(text)  # [batch_size, sentence_length, embedding_dim]
        
        # Average the embeddings for the entire sentence
        embedded = torch.mean(embedded, dim=1)  # [batch_size, embedding_dim]
        
        hidden = self.fc(embedded)
        hidden = self.relu(hidden)
        output = self.output(hidden)
        
        return output

# Initialize model
vocab_size = 10000
embedding_dim = 100
hidden_dim = 256
output_dim = 2  # Binary classification (positive/negative)

model = SentimentClassifier(vocab_size, embedding_dim, hidden_dim, output_dim)

# Example input
batch_size = 16
max_sentence_length = 20
text_batch = torch.randint(0, vocab_size, (batch_size, max_sentence_length))

# Forward pass
predictions = model(text_batch)
print(f"Model input shape: {text_batch.shape}")
print(f"Model output shape: {predictions.shape}")

Output:

Model input shape: torch.Size([16, 20])
Model output shape: torch.Size([16, 2])

2. Using Pre-trained Word Embeddings

Pre-trained word embeddings like GloVe or Word2Vec have been trained on large corpora and capture rich semantic information. Let's see how to use them in PyTorch.

Loading and Using GloVe Embeddings

python
import torch
import torch.nn as nn
import numpy as np
import os

def load_glove_embeddings(glove_path, word_to_idx, embedding_dim=100):
    # Create a random embedding matrix for all words in our vocabulary
    vocab_size = len(word_to_idx)
    embedding_matrix = np.random.randn(vocab_size, embedding_dim)
    
    # Load GloVe embeddings from file
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            
            # Only update embeddings for words in our vocabulary
            if word in word_to_idx:
                vector = np.array([float(val) for val in parts[1:]])
                embedding_matrix[word_to_idx[word]] = vector
                
    return torch.FloatTensor(embedding_matrix)

# Example usage:
# Assume we have a vocabulary mapping from words to indices
word_to_idx = {"hello": 0, "world": 1, "python": 2, "pytorch": 3, "nlp": 4}
vocab_size = len(word_to_idx)
embedding_dim = 100

# Path to GloVe embeddings (you would need to download them)
glove_path = "glove.6B.100d.txt"  # This is just an example path

# For demonstration only (since we don't have actual GloVe file in this example)
if not os.path.exists(glove_path):
    print("This is a demonstration - in practice, download GloVe from https://nlp.stanford.edu/projects/glove/")
    # Create a model with random embeddings instead
    embedding_layer = nn.Embedding(vocab_size, embedding_dim)
else:
    # Load GloVe embeddings
    glove_embeddings = load_glove_embeddings(glove_path, word_to_idx, embedding_dim)
    
    # Create embedding layer with pre-trained weights
    embedding_layer = nn.Embedding.from_pretrained(glove_embeddings, freeze=False)
    # Setting freeze=False allows the embeddings to be fine-tuned during training

print(f"Created embedding layer with shape: {embedding_layer.weight.shape}")

Output:

This is a demonstration - in practice, download GloVe from https://nlp.stanford.edu/projects/glove/
Created embedding layer with shape: torch.Size([5, 100])

Word Similarity with Pre-trained Embeddings

With word embeddings, we can calculate semantic similarity between words:

python
def cosine_similarity(vec1, vec2):
    return torch.nn.functional.cosine_similarity(vec1, vec2, dim=0)

# Demonstration with our toy embeddings
# In practice, you'd use pre-trained embeddings
with torch.no_grad():
    # Get embeddings for two words
    idx1 = torch.tensor([word_to_idx["python"]])
    idx2 = torch.tensor([word_to_idx["pytorch"]])
    
    emb1 = embedding_layer(idx1).squeeze(0)
    emb2 = embedding_layer(idx2).squeeze(0)
    
    similarity = cosine_similarity(emb1, emb2)
    print(f"Similarity between 'python' and 'pytorch': {similarity.item():.4f}")
    
    # With meaningful pre-trained embeddings, similar words would have higher similarity scores

3. Working with Contextual Embeddings

Contextual embeddings like BERT generate different embeddings for the same word based on context. Let's see a basic implementation using the Hugging Face transformers library:

python
# pip install transformers
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentences
sentences = [
    "I love PyTorch for deep learning tasks.",
    "PyTorch is my favorite deep learning framework."
]

# Tokenize input
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get contextual embeddings
with torch.no_grad():
    outputs = model(**inputs)
    
# Get the embeddings for the entire sequence
sequence_embeddings = outputs.last_hidden_state

# Get the [CLS] token embedding (often used for sentence representation)
sentence_embeddings = outputs.pooler_output

print(f"Shape of all token embeddings: {sequence_embeddings.shape}")
print(f"Shape of sentence embeddings: {sentence_embeddings.shape}")

Output:

Shape of all token embeddings: torch.Size([2, 13, 768])
Shape of sentence embeddings: torch.Size([2, 768])

Real-world Application: Sentiment Analysis

Let's build a more complete sentiment analysis model using word embeddings and LSTM (Long Short-Term Memory) networks:

python
import torch
import torch.nn as nn
import torch.optim as optim

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # LSTM layer
        self.lstm = nn.LSTM(embedding_dim, 
                          hidden_dim, 
                          num_layers=n_layers, 
                          bidirectional=bidirectional, 
                          dropout=dropout,
                          batch_first=True)
        
        # Output layer
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # text shape: [batch_size, sentence_length]
        
        # Embedded: [batch_size, sentence_length, embedding_dim]
        embedded = self.dropout(self.embedding(text))
        
        # LSTM output: [batch_size, sentence_length, hidden_dim * n_directions]
        output, (hidden, cell) = self.lstm(embedded)
        
        # If bidirectional, concat the last hidden state from both directions
        if self.lstm.bidirectional:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        else:
            hidden = hidden[-1,:,:]
        
        # Apply dropout to hidden state
        hidden = self.dropout(hidden)
        
        # Return logits
        return self.fc(hidden)

# Model parameters
vocab_size = 25000
embedding_dim = 300
hidden_dim = 256
output_dim = 2  # binary sentiment (positive/negative)
n_layers = 2
bidirectional = True
dropout = 0.5
pad_idx = 1  # Assuming 1 is the padding index in your vocabulary

# Initialize model
model = SentimentLSTM(vocab_size, embedding_dim, hidden_dim, output_dim, 
                      n_layers, bidirectional, dropout, pad_idx)

# Example input (batch of 8 sentences, each with max 50 tokens)
batch = torch.randint(0, vocab_size, (8, 50))

# Forward pass
predictions = model(batch)
print(f"Input shape: {batch.shape}")
print(f"Output shape: {predictions.shape}")

Output:

Input shape: torch.Size([8, 50])
Output shape: torch.Size([8, 2])

Training a Sentiment Analysis Model

Let's see how you would train this model:

python
def train_model(model, train_iterator, optimizer, criterion):
    model.train()
    epoch_loss = 0
    
    for batch in train_iterator:
        # Assuming batch.text contains the tokenized texts and batch.label contains the labels
        text = batch.text
        labels = batch.label
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        predictions = model(text)
        
        # Calculate loss
        loss = criterion(predictions, labels)
        
        # Backward pass
        loss.backward()
        
        # Update parameters
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(train_iterator)

# Example usage (this is just for illustration):
# optimizer = optim.Adam(model.parameters())
# criterion = nn.CrossEntropyLoss()
# train_model(model, train_iterator, optimizer, criterion)

Custom Word Embeddings vs. Pre-trained Embeddings

When should you use custom embeddings versus pre-trained ones?

Custom embeddings (nn.Embedding):

When you have domain-specific vocabulary
When your dataset is large enough
When you want the embeddings to be optimized specifically for your task

Pre-trained embeddings:

When you have limited training data
When you want to leverage semantic knowledge from large corpora
When you need faster training time

Contextual embeddings:

When you need state-of-the-art performance
When the same word can have different meanings based on context
When you have enough computational resources

Summary

In this tutorial, we've covered:

What word embeddings are and why they're important
How to use PyTorch's nn.Embedding module
How to incorporate pre-trained embeddings like GloVe
How to work with contextual embeddings from models like BERT
Building an LSTM-based sentiment analysis model with embeddings

Word embeddings are a fundamental building block of modern NLP systems. They allow us to represent words as dense vectors that capture semantic relationships, which greatly improves the performance of NLP tasks.

Exercises

Build a simple text classifier using nn.Embedding and a feed-forward neural network.
Load pre-trained GloVe embeddings and use them in a model.
Try different embedding dimensions (50, 100, 300) and observe their effect on model performance.
Implement a word analogy task using embeddings (e.g., "king" - "man" + "woman" = ?).
Compare the performance of fixed vs. trainable embeddings on a text classification task.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What are Word Embeddings?​

Types of Word Embeddings in PyTorch​

1. Using nn.Embedding in PyTorch​

Basic Example​

Creating a Simple Text Classification Model with Embeddings​

2. Using Pre-trained Word Embeddings​

Loading and Using GloVe Embeddings​

Word Similarity with Pre-trained Embeddings​

3. Working with Contextual Embeddings​

Real-world Application: Sentiment Analysis​

Training a Sentiment Analysis Model​

Custom Word Embeddings vs. Pre-trained Embeddings​

Summary​

Exercises​

Additional Resources​

What are Word Embeddings?

Types of Word Embeddings in PyTorch

1. Using nn.Embedding in PyTorch

Basic Example

Creating a Simple Text Classification Model with Embeddings

2. Using Pre-trained Word Embeddings

Loading and Using GloVe Embeddings

Word Similarity with Pre-trained Embeddings

3. Working with Contextual Embeddings

Real-world Application: Sentiment Analysis

Training a Sentiment Analysis Model

Custom Word Embeddings vs. Pre-trained Embeddings

Summary

Exercises

Additional Resources