PyTorch Text Processing

Text data requires special processing before it can be used in neural networks. In this tutorial, we'll explore how to prepare text data for natural language processing (NLP) tasks using PyTorch's built-in tools and libraries.

Introduction

Unlike numerical or image data, text cannot be directly fed into neural networks. We need to convert text into numerical representations that machine learning models can understand. PyTorch provides several utilities to help with this process through its torchtext library.

By the end of this tutorial, you'll understand:

How to tokenize text
How to build vocabularies
How to create numerical representations of text
How to use pre-built datasets
How to create data iterators for training

Prerequisites

Before we get started, make sure you have the following packages installed:

bash
pip install torch torchtext spacy
python -m spacy download en_core_web_sm

Basic Text Processing Pipeline

Let's break down the text processing pipeline into simple steps:

Tokenization: Breaking text into words or subwords
Building a vocabulary: Assigning unique indices to tokens
Numericalization: Converting tokens into numbers
Creating embeddings: Converting indices into dense vectors

Let's implement each step:

1. Tokenization

Tokenization is the process of breaking text into individual units (tokens) such as words, subwords, or characters.

python
import torch
import torchtext
from torchtext.data.utils import get_tokenizer

# Create a tokenizer
tokenizer = get_tokenizer('basic_english')

# Sample text
text = "PyTorch is an open source machine learning framework."

# Tokenize the text
tokens = tokenizer(text)
print(tokens)

Output:

['pytorch', 'is', 'an', 'open', 'source', 'machine', 'learning', 'framework', '.']

You can also use spaCy for more advanced tokenization:

python
import spacy

nlp = spacy.load('en_core_web_sm')

def spacy_tokenizer(text):
    return [token.text for token in nlp(text)]

tokens = spacy_tokenizer(text)
print(tokens)

Output:

['PyTorch', 'is', 'an', 'open', 'source', 'machine', 'learning', 'framework', '.']

2. Building a Vocabulary

A vocabulary maps tokens to unique indices. This is essential for the model to understand the data.

python
from torchtext.vocab import build_vocab_from_iterator

# Sample dataset of sentences
train_data = [
    "PyTorch is an open source machine learning framework.",
    "It is based on the Torch library.",
    "PyTorch provides a wide range of algorithms for deep learning.",
    "It is widely used for applications such as natural language processing."
]

# Tokenize the dataset
tokenized_data = [tokenizer(sentence) for sentence in train_data]

# Build vocabulary
vocab = build_vocab_from_iterator(tokenized_data, specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])  # Set default index for unknown words

# Print vocabulary
print(f"Vocabulary size: {len(vocab)}")
print(f"Token 'pytorch' has index: {vocab['pytorch']}")
print(f"Token 'learning' has index: {vocab['learning']}")

Output:

Vocabulary size: 27
Token 'pytorch' has index: 2
Token 'learning' has index: 8

3. Numericalization

Now let's convert our tokens into numerical indices:

python
# Convert tokens to indices
def text_to_indices(text, tokenizer, vocab):
    tokens = tokenizer(text)
    return [vocab[token] for token in tokens]

# Process a sample sentence
sample = "PyTorch provides deep learning tools."
indices = text_to_indices(sample, tokenizer, vocab)
print(f"Original text: '{sample}'")
print(f"Indices: {indices}")

# Some words might not be in the vocabulary (will be mapped to <unk>)
print(f"Index for 'tools' (unknown word): {vocab['tools']}")

Output:

Original text: 'PyTorch provides deep learning tools.'
Indices: [2, 0, 0, 8, 0]
Index for 'tools' (unknown word): 0

4. Creating Embeddings

Embeddings convert indices into dense vectors that capture semantic meaning:

python
# Create an embedding layer
embedding_dim = 10  # Dimension of the embedding vectors
embedding = torch.nn.Embedding(len(vocab), embedding_dim)

# Convert indices to embeddings
indices_tensor = torch.LongTensor(indices)
embedded_text = embedding(indices_tensor)

print(f"Shape of embedded text: {embedded_text.shape}")
print(f"Embedding for the first token 'pytorch':\n{embedded_text[0]}")

Output:

Shape of embedded text: torch.Size([5, 10])
Embedding for the first token 'pytorch':
tensor([-0.9306,  0.2562,  1.0211,  0.0841,  1.2770,  0.8254, -0.5530, -0.4578,
         1.0078,  0.2775], grad_fn=<SelectBackward0>)

Using torchtext's Built-in Datasets

PyTorch provides built-in datasets for common NLP tasks:

python
from torchtext.datasets import IMDB

# Load the IMDB dataset
train_iter = IMDB(split='train')

# Get a sample
sample = next(iter(train_iter))
print(f"Label: {sample[0]}")
print(f"Text: {sample[1][:200]}...")  # Print first 200 characters

Output:

Label: pos
Text: "I went and saw this movie last night after being coaxed to by a few friends. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able..."

Creating Data Batches for Training

Let's create batches of data for training:

python
from torch.utils.data import DataLoader
import torch.nn.functional as F

# Define a simple dataset
class TextDataset:
    def __init__(self, texts, labels, tokenizer, vocab):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.vocab = vocab
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        
        # Convert text to indices
        indices = text_to_indices(text, self.tokenizer, self.vocab)
        return torch.tensor(indices), label

# Sample data
texts = [
    "PyTorch is amazing for deep learning",
    "Natural language processing is fascinating",
    "Deep learning models require a lot of data",
    "Text processing is the first step in NLP"
]
labels = [1, 0, 1, 0]  # Binary labels for demonstration

# Create dataset
dataset = TextDataset(texts, labels, tokenizer, vocab)

# Define collate function to handle variable length sequences
def collate_batch(batch):
    # Separate texts and labels
    text_list, labels = zip(*batch)
    
    # Find maximum sequence length
    max_length = max([len(seq) for seq in text_list])
    
    # Pad sequences
    padded_texts = [
        F.pad(seq, (0, max_length - len(seq)), value=vocab['<pad>'])
        for seq in text_list
    ]
    
    # Stack tensors
    text_tensor = torch.stack(padded_texts)
    label_tensor = torch.tensor(labels)
    
    return text_tensor, label_tensor

# Create data loader
batch_size = 2
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate_batch, shuffle=True)

# Get a batch
batch = next(iter(dataloader))
texts, labels = batch

print(f"Batch shape: {texts.shape}")
print(f"Labels: {labels}")

Output:

Batch shape: torch.Size([2, 7])
Labels: tensor([0, 1])

Practical Example: Sentiment Analysis

Let's put everything together in a practical sentiment analysis model:

python
import torch.nn as nn
import torch.optim as optim

# Define a simple RNN model for sentiment analysis
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        # text shape: [batch size, sequence length]
        
        embedded = self.embedding(text)
        # embedded shape: [batch size, sequence length, embedding dim]
        
        output, hidden = self.rnn(embedded)
        # output shape: [batch size, sequence length, hidden dim]
        # hidden shape: [1, batch size, hidden dim]
        
        # Use the hidden state from the last time step for classification
        return self.fc(hidden.squeeze(0))

# Initialize model
vocab_size = len(vocab)
embedding_dim = 32
hidden_dim = 64
output_dim = 2  # Binary classification (positive/negative)

model = SimpleRNN(vocab_size, embedding_dim, hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Training function
def train(model, dataloader, optimizer, criterion, epochs=5):
    model.train()
    
    for epoch in range(epochs):
        epoch_loss = 0
        epoch_acc = 0
        
        for texts, labels in dataloader:
            optimizer.zero_grad()
            
            # Forward pass
            predictions = model(texts)
            
            # Calculate loss
            loss = criterion(predictions, labels)
            
            # Backpropagation
            loss.backward()
            optimizer.step()
            
            # Calculate accuracy
            predictions_class = torch.argmax(predictions, dim=1)
            correct = (predictions_class == labels).float().sum()
            accuracy = correct / len(labels)
            
            epoch_loss += loss.item()
            epoch_acc += accuracy.item()
        
        # Print progress
        print(f"Epoch {epoch+1}/{epochs}")
        print(f"Loss: {epoch_loss / len(dataloader):.3f} | Accuracy: {epoch_acc / len(dataloader):.3f}")

# Train the model
print("Training the model...")
train(model, dataloader, optimizer, criterion)

# Test the model
def predict_sentiment(model, text, tokenizer, vocab):
    model.eval()
    
    # Tokenize and convert to indices
    tokens = tokenizer(text.lower())
    indices = [vocab[token] for token in tokens]
    
    # Convert to tensor
    tensor = torch.LongTensor(indices).unsqueeze(0)  # Add batch dimension
    
    # Make prediction
    prediction = model(tensor)
    prediction_class = torch.argmax(prediction, dim=1).item()
    
    return "Positive" if prediction_class == 1 else "Negative"

# Example predictions
test_texts = [
    "I love using PyTorch for NLP tasks",
    "This tutorial is too complicated"
]

for text in test_texts:
    sentiment = predict_sentiment(model, text, tokenizer, vocab)
    print(f"Text: '{text}' | Sentiment: {sentiment}")

Output:

Training the model...
Epoch 1/5
Loss: 0.693 | Accuracy: 0.500
Epoch 2/5
Loss: 0.691 | Accuracy: 0.500
Epoch 3/5
Loss: 0.685 | Accuracy: 0.500
Epoch 4/5
Loss: 0.676 | Accuracy: 0.500
Epoch 5/5
Loss: 0.661 | Accuracy: 0.500
Text: 'I love using PyTorch for NLP tasks' | Sentiment: Positive
Text: 'This tutorial is too complicated' | Sentiment: Negative

Note: In this simple example with limited data, the model might not learn effectively. In a real-world scenario, you'd need more data and likely more training epochs.

Using Pre-trained Embeddings

For better results, you can use pre-trained word embeddings like GloVe:

python
from torchtext.vocab import GloVe

# Load pre-trained GloVe embeddings
glove = GloVe(name='6B', dim=100)

# Get embedding for a word
word = "machine"
if word in glove.stoi:
    embedding = glove[word]
    print(f"Embedding for '{word}' (first 10 dimensions): {embedding[:10]}")
else:
    print(f"Word '{word}' not found in GloVe vocabulary")

Output:

Embedding for 'machine' (first 10 dimensions): tensor([-0.0602, -0.0736, -0.1646,  0.0443,  0.3839, -0.1559,  0.4550,  0.6735,
        -0.0120,  0.5945])

Summary

In this tutorial, we've covered the essential steps for processing text data in PyTorch:

Tokenization: Breaking text into tokens using tools like torchtext tokenizers or spaCy
Vocabulary Building: Creating word-to-index mappings
Numericalization: Converting tokens to indices
Embeddings: Representing words as dense vectors
Batch Processing: Handling variable-length sequences with padding
Building Models: Creating and training NLP models with PyTorch

These fundamental techniques form the backbone of most NLP projects. As you advance, you can explore more sophisticated methods like subword tokenization (BPE, WordPiece), contextual embeddings (BERT, GPT), and specialized architectures for different NLP tasks.

Exercises

Vocabulary Expansion: Modify the code to include special tokens like <sos> (start of sentence) and <eos> (end of sentence).
Advanced Tokenization: Implement a subword tokenization strategy using the tokenizers library.
Sequence Length Analysis: Write a function to analyze the distribution of sequence lengths in a dataset.
Data Augmentation: Implement simple text augmentation techniques like random word deletion or synonym replacement.
Custom Dataset: Create a custom dataset from a source like news articles or book reviews and process it using the techniques learned in this tutorial.

Additional Resources

Remember, text processing is an essential first step in any NLP project. Mastering these techniques will provide a solid foundation for more advanced natural language processing tasks.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic Text Processing Pipeline​

1. Tokenization​

2. Building a Vocabulary​

3. Numericalization​

4. Creating Embeddings​

Using torchtext's Built-in Datasets​

Creating Data Batches for Training​

Practical Example: Sentiment Analysis​

Using Pre-trained Embeddings​

Summary​

Exercises​

Additional Resources​

Introduction

Prerequisites

Basic Text Processing Pipeline

1. Tokenization

2. Building a Vocabulary

3. Numericalization

4. Creating Embeddings

Using torchtext's Built-in Datasets

Creating Data Batches for Training

Practical Example: Sentiment Analysis

Using Pre-trained Embeddings

Summary

Exercises

Additional Resources