PyTorch Text Processing
Text data requires special processing before it can be used in neural networks. In this tutorial, we'll explore how to prepare text data for natural language processing (NLP) tasks using PyTorch's built-in tools and libraries.
Introduction
Unlike numerical or image data, text cannot be directly fed into neural networks. We need to convert text into numerical representations that machine learning models can understand. PyTorch provides several utilities to help with this process through its torchtext
library.
By the end of this tutorial, you'll understand:
- How to tokenize text
- How to build vocabularies
- How to create numerical representations of text
- How to use pre-built datasets
- How to create data iterators for training
Prerequisites
Before we get started, make sure you have the following packages installed:
pip install torch torchtext spacy
python -m spacy download en_core_web_sm
Basic Text Processing Pipeline
Let's break down the text processing pipeline into simple steps:
- Tokenization: Breaking text into words or subwords
- Building a vocabulary: Assigning unique indices to tokens
- Numericalization: Converting tokens into numbers
- Creating embeddings: Converting indices into dense vectors
Let's implement each step:
1. Tokenization
Tokenization is the process of breaking text into individual units (tokens) such as words, subwords, or characters.
import torch
import torchtext
from torchtext.data.utils import get_tokenizer
# Create a tokenizer
tokenizer = get_tokenizer('basic_english')
# Sample text
text = "PyTorch is an open source machine learning framework."
# Tokenize the text
tokens = tokenizer(text)
print(tokens)
Output:
['pytorch', 'is', 'an', 'open', 'source', 'machine', 'learning', 'framework', '.']
You can also use spaCy for more advanced tokenization:
import spacy
nlp = spacy.load('en_core_web_sm')
def spacy_tokenizer(text):
return [token.text for token in nlp(text)]
tokens = spacy_tokenizer(text)
print(tokens)
Output:
['PyTorch', 'is', 'an', 'open', 'source', 'machine', 'learning', 'framework', '.']
2. Building a Vocabulary
A vocabulary maps tokens to unique indices. This is essential for the model to understand the data.
from torchtext.vocab import build_vocab_from_iterator
# Sample dataset of sentences
train_data = [
"PyTorch is an open source machine learning framework.",
"It is based on the Torch library.",
"PyTorch provides a wide range of algorithms for deep learning.",
"It is widely used for applications such as natural language processing."
]
# Tokenize the dataset
tokenized_data = [tokenizer(sentence) for sentence in train_data]
# Build vocabulary
vocab = build_vocab_from_iterator(tokenized_data, specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>']) # Set default index for unknown words
# Print vocabulary
print(f"Vocabulary size: {len(vocab)}")
print(f"Token 'pytorch' has index: {vocab['pytorch']}")
print(f"Token 'learning' has index: {vocab['learning']}")
Output:
Vocabulary size: 27
Token 'pytorch' has index: 2
Token 'learning' has index: 8
3. Numericalization
Now let's convert our tokens into numerical indices:
# Convert tokens to indices
def text_to_indices(text, tokenizer, vocab):
tokens = tokenizer(text)
return [vocab[token] for token in tokens]
# Process a sample sentence
sample = "PyTorch provides deep learning tools."
indices = text_to_indices(sample, tokenizer, vocab)
print(f"Original text: '{sample}'")
print(f"Indices: {indices}")
# Some words might not be in the vocabulary (will be mapped to <unk>)
print(f"Index for 'tools' (unknown word): {vocab['tools']}")
Output:
Original text: 'PyTorch provides deep learning tools.'
Indices: [2, 0, 0, 8, 0]
Index for 'tools' (unknown word): 0
4. Creating Embeddings
Embeddings convert indices into dense vectors that capture semantic meaning:
# Create an embedding layer
embedding_dim = 10 # Dimension of the embedding vectors
embedding = torch.nn.Embedding(len(vocab), embedding_dim)
# Convert indices to embeddings
indices_tensor = torch.LongTensor(indices)
embedded_text = embedding(indices_tensor)
print(f"Shape of embedded text: {embedded_text.shape}")
print(f"Embedding for the first token 'pytorch':\n{embedded_text[0]}")
Output:
Shape of embedded text: torch.Size([5, 10])
Embedding for the first token 'pytorch':
tensor([-0.9306, 0.2562, 1.0211, 0.0841, 1.2770, 0.8254, -0.5530, -0.4578,
1.0078, 0.2775], grad_fn=<SelectBackward0>)
Using torchtext's Built-in Datasets
PyTorch provides built-in datasets for common NLP tasks:
from torchtext.datasets import IMDB
# Load the IMDB dataset
train_iter = IMDB(split='train')
# Get a sample
sample = next(iter(train_iter))
print(f"Label: {sample[0]}")
print(f"Text: {sample[1][:200]}...") # Print first 200 characters
Output:
Label: pos
Text: "I went and saw this movie last night after being coaxed to by a few friends. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able..."
Creating Data Batches for Training
Let's create batches of data for training:
from torch.utils.data import DataLoader
import torch.nn.functional as F
# Define a simple dataset
class TextDataset:
def __init__(self, texts, labels, tokenizer, vocab):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.vocab = vocab
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
# Convert text to indices
indices = text_to_indices(text, self.tokenizer, self.vocab)
return torch.tensor(indices), label
# Sample data
texts = [
"PyTorch is amazing for deep learning",
"Natural language processing is fascinating",
"Deep learning models require a lot of data",
"Text processing is the first step in NLP"
]
labels = [1, 0, 1, 0] # Binary labels for demonstration
# Create dataset
dataset = TextDataset(texts, labels, tokenizer, vocab)
# Define collate function to handle variable length sequences
def collate_batch(batch):
# Separate texts and labels
text_list, labels = zip(*batch)
# Find maximum sequence length
max_length = max([len(seq) for seq in text_list])
# Pad sequences
padded_texts = [
F.pad(seq, (0, max_length - len(seq)), value=vocab['<pad>'])
for seq in text_list
]
# Stack tensors
text_tensor = torch.stack(padded_texts)
label_tensor = torch.tensor(labels)
return text_tensor, label_tensor
# Create data loader
batch_size = 2
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_batch, shuffle=True)
# Get a batch
batch = next(iter(dataloader))
texts, labels = batch
print(f"Batch shape: {texts.shape}")
print(f"Labels: {labels}")
Output:
Batch shape: torch.Size([2, 7])
Labels: tensor([0, 1])
Practical Example: Sentiment Analysis
Let's put everything together in a practical sentiment analysis model:
import torch.nn as nn
import torch.optim as optim
# Define a simple RNN model for sentiment analysis
class SimpleRNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, text):
# text shape: [batch size, sequence length]
embedded = self.embedding(text)
# embedded shape: [batch size, sequence length, embedding dim]
output, hidden = self.rnn(embedded)
# output shape: [batch size, sequence length, hidden dim]
# hidden shape: [1, batch size, hidden dim]
# Use the hidden state from the last time step for classification
return self.fc(hidden.squeeze(0))
# Initialize model
vocab_size = len(vocab)
embedding_dim = 32
hidden_dim = 64
output_dim = 2 # Binary classification (positive/negative)
model = SimpleRNN(vocab_size, embedding_dim, hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
# Training function
def train(model, dataloader, optimizer, criterion, epochs=5):
model.train()
for epoch in range(epochs):
epoch_loss = 0
epoch_acc = 0
for texts, labels in dataloader:
optimizer.zero_grad()
# Forward pass
predictions = model(texts)
# Calculate loss
loss = criterion(predictions, labels)
# Backpropagation
loss.backward()
optimizer.step()
# Calculate accuracy
predictions_class = torch.argmax(predictions, dim=1)
correct = (predictions_class == labels).float().sum()
accuracy = correct / len(labels)
epoch_loss += loss.item()
epoch_acc += accuracy.item()
# Print progress
print(f"Epoch {epoch+1}/{epochs}")
print(f"Loss: {epoch_loss / len(dataloader):.3f} | Accuracy: {epoch_acc / len(dataloader):.3f}")
# Train the model
print("Training the model...")
train(model, dataloader, optimizer, criterion)
# Test the model
def predict_sentiment(model, text, tokenizer, vocab):
model.eval()
# Tokenize and convert to indices
tokens = tokenizer(text.lower())
indices = [vocab[token] for token in tokens]
# Convert to tensor
tensor = torch.LongTensor(indices).unsqueeze(0) # Add batch dimension
# Make prediction
prediction = model(tensor)
prediction_class = torch.argmax(prediction, dim=1).item()
return "Positive" if prediction_class == 1 else "Negative"
# Example predictions
test_texts = [
"I love using PyTorch for NLP tasks",
"This tutorial is too complicated"
]
for text in test_texts:
sentiment = predict_sentiment(model, text, tokenizer, vocab)
print(f"Text: '{text}' | Sentiment: {sentiment}")
Output:
Training the model...
Epoch 1/5
Loss: 0.693 | Accuracy: 0.500
Epoch 2/5
Loss: 0.691 | Accuracy: 0.500
Epoch 3/5
Loss: 0.685 | Accuracy: 0.500
Epoch 4/5
Loss: 0.676 | Accuracy: 0.500
Epoch 5/5
Loss: 0.661 | Accuracy: 0.500
Text: 'I love using PyTorch for NLP tasks' | Sentiment: Positive
Text: 'This tutorial is too complicated' | Sentiment: Negative
Note: In this simple example with limited data, the model might not learn effectively. In a real-world scenario, you'd need more data and likely more training epochs.
Using Pre-trained Embeddings
For better results, you can use pre-trained word embeddings like GloVe:
from torchtext.vocab import GloVe
# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)
# Get embedding for a word
word = "machine"
if word in glove.stoi:
embedding = glove[word]
print(f"Embedding for '{word}' (first 10 dimensions): {embedding[:10]}")
else:
print(f"Word '{word}' not found in GloVe vocabulary")
Output:
Embedding for 'machine' (first 10 dimensions): tensor([-0.0602, -0.0736, -0.1646, 0.0443, 0.3839, -0.1559, 0.4550, 0.6735,
-0.0120, 0.5945])
Summary
In this tutorial, we've covered the essential steps for processing text data in PyTorch:
- Tokenization: Breaking text into tokens using tools like
torchtext
tokenizers or spaCy - Vocabulary Building: Creating word-to-index mappings
- Numericalization: Converting tokens to indices
- Embeddings: Representing words as dense vectors
- Batch Processing: Handling variable-length sequences with padding
- Building Models: Creating and training NLP models with PyTorch
These fundamental techniques form the backbone of most NLP projects. As you advance, you can explore more sophisticated methods like subword tokenization (BPE, WordPiece), contextual embeddings (BERT, GPT), and specialized architectures for different NLP tasks.
Exercises
-
Vocabulary Expansion: Modify the code to include special tokens like
<sos>
(start of sentence) and<eos>
(end of sentence). -
Advanced Tokenization: Implement a subword tokenization strategy using the
tokenizers
library. -
Sequence Length Analysis: Write a function to analyze the distribution of sequence lengths in a dataset.
-
Data Augmentation: Implement simple text augmentation techniques like random word deletion or synonym replacement.
-
Custom Dataset: Create a custom dataset from a source like news articles or book reviews and process it using the techniques learned in this tutorial.
Additional Resources
- PyTorch documentation for torchtext
- spaCy documentation
- Hugging Face Tokenizers library
- GloVe: Global Vectors for Word Representation
- The Illustrated Word2vec
Remember, text processing is an essential first step in any NLP project. Mastering these techniques will provide a solid foundation for more advanced natural language processing tasks.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)