Skip to main content

PyTorch TorchText

TorchText is a powerful library in the PyTorch ecosystem designed to make text processing and data handling easier for Natural Language Processing (NLP) tasks. Whether you're building sentiment analysis models, machine translation systems, or text classification solutions, TorchText provides the tools to streamline your data preparation workflow.

Introduction to TorchText

TorchText provides utilities for creating datasets that can be easily iterated through for model training, as well as popular text datasets. It also offers common text processing functions that prepare raw text for modeling.

note

TorchText has undergone significant changes in recent versions. This guide focuses on TorchText 0.12.0+ which introduced a new, more efficient data processing pipeline.

Installing TorchText

Before we begin, let's make sure you have TorchText installed:

bash
pip install torchtext

You'll also need PyTorch installed. If you don't have it yet:

bash
pip install torch

Key Components of TorchText

TorchText consists of several important components:

  1. Datasets: Pre-built datasets and tools to create custom datasets
  2. Vocab: Tools for building vocabulary from text data
  3. Transforms: Functions to process and transform text data
  4. Data: Utilities for data loading and batching

Let's explore each of these components with practical examples.

Building a Text Processing Pipeline

Step 1: Loading Data

Let's start by creating a simple text classification pipeline. We'll use a dataset of movie reviews for sentiment analysis:

python
import torch
from torchtext.datasets import IMDB

# Load the IMDB dataset
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

# Let's look at the first example
first_example = next(iter(train_iter))
print(f"Label: {first_example[0]}")
print(f"Text: {first_example[1][:200]}...") # Print first 200 chars

Output:

Label: pos
Text: If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small adventures of 8 year old kids in the 3rd grade. With the main character Stan, who has an insane but well thought...

Step 2: Creating Tokenizers and Vocabularies

Next, we'll need to tokenize our text and build a vocabulary:

python
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import List, Iterable

# Define the tokenizer
tokenizer = get_tokenizer('basic_english')

# Function to yield tokens
def yield_tokens(data_iter: Iterable) -> List[str]:
for _, text in data_iter:
yield tokenizer(text)

# Build vocabulary
train_iter = IMDB(split='train')
vocab = build_vocab_from_iterator(
yield_tokens(train_iter),
specials=["<unk>"],
min_freq=10
)
vocab.set_default_index(vocab["<unk>"])

# Vocabulary size
print(f"Vocabulary size: {len(vocab)}")

# Let's check some word indices
print(f"Index for 'movie': {vocab['movie']}")
print(f"Index for 'excellent': {vocab['excellent']}")
print(f"Index for 'terrible': {vocab['terrible']}")

Output:

Vocabulary size: 25002
Index for 'movie': 128
Index for 'excellent': 980
Index for 'terrible': 1290

Step 3: Text Processing Pipeline

Now, let's create the complete text processing pipeline:

python
def text_pipeline(text: str) -> List[int]:
"""Convert text to a list of integers."""
return [vocab[token] for token in tokenizer(text)]

def label_pipeline(label: str) -> int:
"""Convert label to integer."""
return 1 if label == 'pos' else 0

# Let's test our pipeline
text = "This movie was excellent!"
processed_text = text_pipeline(text)
print(f"Original text: '{text}'")
print(f"Processed text: {processed_text}")

Output:

Original text: 'This movie was excellent!'
Processed text: [21, 128, 10, 980]

Step 4: Creating Data Batches

To efficiently train our model, we need to batch our data:

python
from torch.utils.data import DataLoader
from typing import Tuple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
label_list, text_list = [], []
for _label, _text in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)

# Stack and pad sequences
label_list = torch.tensor(label_list, dtype=torch.int64)
text_list = torch.nn.utils.rnn.pad_sequence(text_list,
batch_first=True,
padding_value=0)
return label_list.to(device), text_list.to(device)

# Create DataLoader
train_iter = IMDB(split='train')
batch_size = 16
train_dataloader = DataLoader(
list(train_iter),
batch_size=batch_size,
shuffle=True,
collate_fn=collate_batch
)

# Let's check a batch
first_batch = next(iter(train_dataloader))
labels, texts = first_batch
print(f"Batch of labels shape: {labels.shape}")
print(f"Batch of texts shape: {texts.shape}")

Output:

Batch of labels shape: torch.Size([16])
Batch of texts shape: torch.Size([16, 842])

Building a Simple Text Classification Model

Now that we have our data pipeline, let's build a simple text classification model:

python
import torch.nn as nn

class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()

def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()

def forward(self, text):
# Get average embedding for the sentence
embedded = self.embedding(text)
embedded = embedded.mean(dim=1)
return self.fc(embedded)

# Model parameters
vocab_size = len(vocab)
embed_dim = 64
num_class = 2

# Initialize model
model = TextClassificationModel(vocab_size, embed_dim, num_class).to(device)
print(model)

Output:

TextClassificationModel(
(embedding): Embedding(25002, 64)
(fc): Linear(in_features=64, out_features=2, bias=True)
)

Training the Model

Let's train our model:

python
import time

def train(dataloader):
model.train()
total_acc, total_count = 0, 0
log_interval = 100
start_time = time.time()

for idx, (label, text) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = model(text)
loss = criterion(predicted_label, label)
loss.backward()
optimizer.step()

total_acc += (predicted_label.argmax(1) == label).sum().item()
total_count += label.size(0)

if idx % log_interval == 0 and idx > 0:
elapsed = time.time() - start_time
print(f'| epoch {epoch} | {idx}/{len(dataloader)} batches '
f'| accuracy: {total_acc/total_count:.3f}')
total_acc, total_count = 0, 0
start_time = time.time()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=5.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)

# Train for 2 epochs (just as an example)
num_epochs = 2

for epoch in range(1, num_epochs + 1):
train_iter = IMDB(split='train')
train_dataloader = DataLoader(
list(train_iter),
batch_size=batch_size,
shuffle=True,
collate_fn=collate_batch
)

print(f'Epoch {epoch}')
print('-' * 59)
train(train_dataloader)
scheduler.step()

This code demonstrates the training loop for our text classification model.

Working with Pretrained Word Embeddings

TorchText also makes it easy to work with pretrained word embeddings:

python
from torchtext.vocab import GloVe

# Load GloVe embeddings
glove = GloVe(name='6B', dim=100)

# Get the embedding for a word
word = "movie"
if word in glove.stoi:
word_index = glove.stoi[word]
embedding = glove.vectors[word_index]
print(f"Embedding for '{word}' (first 10 dimensions):")
print(embedding[:10])

Output:

Embedding for 'movie' (first 10 dimensions):
tensor([ 0.3214, -0.1688, 0.1518, -0.4629, 0.1204, 0.1473, 0.1530, 0.0927,
-0.1215, 0.7950])

Advanced TorchText Features

Custom Datasets

Let's create a custom text dataset:

python
from torch.utils.data import Dataset

class CustomTextDataset(Dataset):
def __init__(self, texts, labels, text_transform=None, label_transform=None):
self.texts = texts
self.labels = labels
self.text_transform = text_transform
self.label_transform = label_transform

def __len__(self):
return len(self.texts)

def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]

if self.text_transform:
text = self.text_transform(text)
if self.label_transform:
label = self.label_transform(label)

return label, text

# Example data
example_texts = [
"I loved this movie!",
"This was a terrible waste of time.",
"Great acting and beautiful scenery.",
"Boring plot and bad acting."
]
example_labels = ["pos", "neg", "pos", "neg"]

# Create dataset
custom_dataset = CustomTextDataset(
example_texts,
example_labels,
text_transform=text_pipeline,
label_transform=label_pipeline
)

# Create dataloader
custom_dataloader = DataLoader(
custom_dataset,
batch_size=2,
shuffle=True,
collate_fn=collate_batch
)

# Print a batch
sample_batch = next(iter(custom_dataloader))
print("Labels:", sample_batch[0])
print("Text indices:", sample_batch[1])

Using Transformations

TorchText provides various text transformations that can be chained together:

python
from torchtext.transforms import ToTensor, VocabTransform, Sequential

# Define transformations
text_transform = Sequential(
tokenizer,
VocabTransform(vocab),
ToTensor(padding_value=0)
)

# Apply transformation to text
sample_text = "This is a sample text for transformation"
transformed = text_transform(sample_text)
print(f"Original: '{sample_text}'")
print(f"Transformed: {transformed}")

Real-World Application: News Classification

Let's build a simplified version of a news article classifier:

python
from torchtext.datasets import AG_NEWS

# Load AG_NEWS dataset
train_iter = AG_NEWS(split='train')

# Check one example
example = next(iter(train_iter))
print(f"Class: {example[0]}, Text: {example[1][:100]}...")

# Class mapping
AG_NEWS_LABEL = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}

# Rebuild our pipeline for this dataset
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)

# Create vocabulary
train_iter = AG_NEWS(split='train')
vocab = build_vocab_from_iterator(
yield_tokens(train_iter),
specials=["<unk>"],
min_freq=5
)
vocab.set_default_index(vocab["<unk>"])

# Create text transform pipeline
def text_transform(text):
return [vocab[token] for token in tokenizer(text)]

def label_transform(label):
return int(label) - 1 # AG_NEWS labels are 1-indexed

# Let's prepare our model for this 4-class classification task
model = TextClassificationModel(
vocab_size=len(vocab),
embed_dim=64,
num_class=4
).to(device)

# Model could be trained using the same training loop as above

This example demonstrates how to adapt our pipeline for a different dataset with multiple classes.

Summary

In this guide, we've explored TorchText, PyTorch's powerful library for text data processing. We covered:

  • Installing and setting up TorchText
  • Creating text processing pipelines with tokenizers and vocabularies
  • Building and training a text classification model
  • Working with pre-trained word embeddings
  • Creating custom datasets
  • Using transformations for text processing
  • A real-world application with news classification

TorchText simplifies many of the complex tasks involved in preparing text data for deep learning models, allowing you to focus more on model architecture and less on data wrangling.

Additional Resources and Exercises

Resources

Exercises

  1. Basic: Modify the sentiment analysis example to work with the Yelp review dataset.
  2. Intermediate: Implement a text processing pipeline that uses character-level tokenization instead of word-level tokenization.
  3. Advanced: Extend the news classification example to include evaluation metrics like precision, recall, and F1-score.
  4. Challenge: Create a bidirectional LSTM model for text classification using TorchText data processing.

By mastering TorchText, you'll be well-equipped to handle a wide range of NLP tasks efficiently and effectively with PyTorch.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)