PyTorch TorchText
TorchText is a powerful library in the PyTorch ecosystem designed to make text processing and data handling easier for Natural Language Processing (NLP) tasks. Whether you're building sentiment analysis models, machine translation systems, or text classification solutions, TorchText provides the tools to streamline your data preparation workflow.
Introduction to TorchText
TorchText provides utilities for creating datasets that can be easily iterated through for model training, as well as popular text datasets. It also offers common text processing functions that prepare raw text for modeling.
TorchText has undergone significant changes in recent versions. This guide focuses on TorchText 0.12.0+ which introduced a new, more efficient data processing pipeline.
Installing TorchText
Before we begin, let's make sure you have TorchText installed:
pip install torchtext
You'll also need PyTorch installed. If you don't have it yet:
pip install torch
Key Components of TorchText
TorchText consists of several important components:
- Datasets: Pre-built datasets and tools to create custom datasets
- Vocab: Tools for building vocabulary from text data
- Transforms: Functions to process and transform text data
- Data: Utilities for data loading and batching
Let's explore each of these components with practical examples.
Building a Text Processing Pipeline
Step 1: Loading Data
Let's start by creating a simple text classification pipeline. We'll use a dataset of movie reviews for sentiment analysis:
import torch
from torchtext.datasets import IMDB
# Load the IMDB dataset
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')
# Let's look at the first example
first_example = next(iter(train_iter))
print(f"Label: {first_example[0]}")
print(f"Text: {first_example[1][:200]}...") # Print first 200 chars
Output:
Label: pos
Text: If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small adventures of 8 year old kids in the 3rd grade. With the main character Stan, who has an insane but well thought...
Step 2: Creating Tokenizers and Vocabularies
Next, we'll need to tokenize our text and build a vocabulary:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import List, Iterable
# Define the tokenizer
tokenizer = get_tokenizer('basic_english')
# Function to yield tokens
def yield_tokens(data_iter: Iterable) -> List[str]:
for _, text in data_iter:
yield tokenizer(text)
# Build vocabulary
train_iter = IMDB(split='train')
vocab = build_vocab_from_iterator(
yield_tokens(train_iter),
specials=["<unk>"],
min_freq=10
)
vocab.set_default_index(vocab["<unk>"])
# Vocabulary size
print(f"Vocabulary size: {len(vocab)}")
# Let's check some word indices
print(f"Index for 'movie': {vocab['movie']}")
print(f"Index for 'excellent': {vocab['excellent']}")
print(f"Index for 'terrible': {vocab['terrible']}")
Output:
Vocabulary size: 25002
Index for 'movie': 128
Index for 'excellent': 980
Index for 'terrible': 1290
Step 3: Text Processing Pipeline
Now, let's create the complete text processing pipeline:
def text_pipeline(text: str) -> List[int]:
"""Convert text to a list of integers."""
return [vocab[token] for token in tokenizer(text)]
def label_pipeline(label: str) -> int:
"""Convert label to integer."""
return 1 if label == 'pos' else 0
# Let's test our pipeline
text = "This movie was excellent!"
processed_text = text_pipeline(text)
print(f"Original text: '{text}'")
print(f"Processed text: {processed_text}")
Output:
Original text: 'This movie was excellent!'
Processed text: [21, 128, 10, 980]
Step 4: Creating Data Batches
To efficiently train our model, we need to batch our data:
from torch.utils.data import DataLoader
from typing import Tuple
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def collate_batch(batch):
label_list, text_list = [], []
for _label, _text in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)
# Stack and pad sequences
label_list = torch.tensor(label_list, dtype=torch.int64)
text_list = torch.nn.utils.rnn.pad_sequence(text_list,
batch_first=True,
padding_value=0)
return label_list.to(device), text_list.to(device)
# Create DataLoader
train_iter = IMDB(split='train')
batch_size = 16
train_dataloader = DataLoader(
list(train_iter),
batch_size=batch_size,
shuffle=True,
collate_fn=collate_batch
)
# Let's check a batch
first_batch = next(iter(train_dataloader))
labels, texts = first_batch
print(f"Batch of labels shape: {labels.shape}")
print(f"Batch of texts shape: {texts.shape}")
Output:
Batch of labels shape: torch.Size([16])
Batch of texts shape: torch.Size([16, 842])
Building a Simple Text Classification Model
Now that we have our data pipeline, let's build a simple text classification model:
import torch.nn as nn
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text):
# Get average embedding for the sentence
embedded = self.embedding(text)
embedded = embedded.mean(dim=1)
return self.fc(embedded)
# Model parameters
vocab_size = len(vocab)
embed_dim = 64
num_class = 2
# Initialize model
model = TextClassificationModel(vocab_size, embed_dim, num_class).to(device)
print(model)
Output:
TextClassificationModel(
(embedding): Embedding(25002, 64)
(fc): Linear(in_features=64, out_features=2, bias=True)
)
Training the Model
Let's train our model:
import time
def train(dataloader):
model.train()
total_acc, total_count = 0, 0
log_interval = 100
start_time = time.time()
for idx, (label, text) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = model(text)
loss = criterion(predicted_label, label)
loss.backward()
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().item()
total_count += label.size(0)
if idx % log_interval == 0 and idx > 0:
elapsed = time.time() - start_time
print(f'| epoch {epoch} | {idx}/{len(dataloader)} batches '
f'| accuracy: {total_acc/total_count:.3f}')
total_acc, total_count = 0, 0
start_time = time.time()
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)
# Train for 2 epochs (just as an example)
num_epochs = 2
for epoch in range(1, num_epochs + 1):
train_iter = IMDB(split='train')
train_dataloader = DataLoader(
list(train_iter),
batch_size=batch_size,
shuffle=True,
collate_fn=collate_batch
)
print(f'Epoch {epoch}')
print('-' * 59)
train(train_dataloader)
scheduler.step()
This code demonstrates the training loop for our text classification model.
Working with Pretrained Word Embeddings
TorchText also makes it easy to work with pretrained word embeddings:
from torchtext.vocab import GloVe
# Load GloVe embeddings
glove = GloVe(name='6B', dim=100)
# Get the embedding for a word
word = "movie"
if word in glove.stoi:
word_index = glove.stoi[word]
embedding = glove.vectors[word_index]
print(f"Embedding for '{word}' (first 10 dimensions):")
print(embedding[:10])
Output:
Embedding for 'movie' (first 10 dimensions):
tensor([ 0.3214, -0.1688, 0.1518, -0.4629, 0.1204, 0.1473, 0.1530, 0.0927,
-0.1215, 0.7950])
Advanced TorchText Features
Custom Datasets
Let's create a custom text dataset:
from torch.utils.data import Dataset
class CustomTextDataset(Dataset):
def __init__(self, texts, labels, text_transform=None, label_transform=None):
self.texts = texts
self.labels = labels
self.text_transform = text_transform
self.label_transform = label_transform
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
if self.text_transform:
text = self.text_transform(text)
if self.label_transform:
label = self.label_transform(label)
return label, text
# Example data
example_texts = [
"I loved this movie!",
"This was a terrible waste of time.",
"Great acting and beautiful scenery.",
"Boring plot and bad acting."
]
example_labels = ["pos", "neg", "pos", "neg"]
# Create dataset
custom_dataset = CustomTextDataset(
example_texts,
example_labels,
text_transform=text_pipeline,
label_transform=label_pipeline
)
# Create dataloader
custom_dataloader = DataLoader(
custom_dataset,
batch_size=2,
shuffle=True,
collate_fn=collate_batch
)
# Print a batch
sample_batch = next(iter(custom_dataloader))
print("Labels:", sample_batch[0])
print("Text indices:", sample_batch[1])
Using Transformations
TorchText provides various text transformations that can be chained together:
from torchtext.transforms import ToTensor, VocabTransform, Sequential
# Define transformations
text_transform = Sequential(
tokenizer,
VocabTransform(vocab),
ToTensor(padding_value=0)
)
# Apply transformation to text
sample_text = "This is a sample text for transformation"
transformed = text_transform(sample_text)
print(f"Original: '{sample_text}'")
print(f"Transformed: {transformed}")
Real-World Application: News Classification
Let's build a simplified version of a news article classifier:
from torchtext.datasets import AG_NEWS
# Load AG_NEWS dataset
train_iter = AG_NEWS(split='train')
# Check one example
example = next(iter(train_iter))
print(f"Class: {example[0]}, Text: {example[1][:100]}...")
# Class mapping
AG_NEWS_LABEL = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
# Rebuild our pipeline for this dataset
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
# Create vocabulary
train_iter = AG_NEWS(split='train')
vocab = build_vocab_from_iterator(
yield_tokens(train_iter),
specials=["<unk>"],
min_freq=5
)
vocab.set_default_index(vocab["<unk>"])
# Create text transform pipeline
def text_transform(text):
return [vocab[token] for token in tokenizer(text)]
def label_transform(label):
return int(label) - 1 # AG_NEWS labels are 1-indexed
# Let's prepare our model for this 4-class classification task
model = TextClassificationModel(
vocab_size=len(vocab),
embed_dim=64,
num_class=4
).to(device)
# Model could be trained using the same training loop as above
This example demonstrates how to adapt our pipeline for a different dataset with multiple classes.
Summary
In this guide, we've explored TorchText, PyTorch's powerful library for text data processing. We covered:
- Installing and setting up TorchText
- Creating text processing pipelines with tokenizers and vocabularies
- Building and training a text classification model
- Working with pre-trained word embeddings
- Creating custom datasets
- Using transformations for text processing
- A real-world application with news classification
TorchText simplifies many of the complex tasks involved in preparing text data for deep learning models, allowing you to focus more on model architecture and less on data wrangling.
Additional Resources and Exercises
Resources
Exercises
- Basic: Modify the sentiment analysis example to work with the Yelp review dataset.
- Intermediate: Implement a text processing pipeline that uses character-level tokenization instead of word-level tokenization.
- Advanced: Extend the news classification example to include evaluation metrics like precision, recall, and F1-score.
- Challenge: Create a bidirectional LSTM model for text classification using TorchText data processing.
By mastering TorchText, you'll be well-equipped to handle a wide range of NLP tasks efficiently and effectively with PyTorch.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)