TensorFlow Text Processing

Text processing is a fundamental step in working with natural language data in TensorFlow, especially when building recurrent neural networks (RNNs) for tasks like sentiment analysis, text generation, or machine translation. In this tutorial, we'll explore how to process text data effectively using TensorFlow tools.

Introduction to Text Processing in TensorFlow

Before we can feed text into our RNN models, we need to convert human-readable text into numerical representations that deep learning models can understand. TensorFlow offers several utilities to help with this conversion, primarily through the tf.keras.preprocessing.text module and the more recent tensorflow_text library.

Text processing typically involves:

Tokenizing the text (splitting into words, subwords, or characters)
Building a vocabulary (mapping tokens to integers)
Converting text to sequences of integers
Padding sequences to a uniform length

Let's explore these steps with practical examples.

Basic Text Processing with Keras

Tokenization and Sequence Generation

The most straightforward way to process text in TensorFlow is with the Tokenizer class:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = [
    "TensorFlow is an open source machine learning framework",
    "It is designed for both research and production",
    "Processing text with TensorFlow is powerful and flexible"
]

# Create a tokenizer
tokenizer = Tokenizer()

# Fit the tokenizer on our texts
tokenizer.fit_on_texts(texts)

# Convert texts to sequences (lists of integers)
sequences = tokenizer.texts_to_sequences(texts)

print("Vocabulary size:", len(tokenizer.word_index) + 1)
print("Vocabulary:", tokenizer.word_index)
print("Sequences:", sequences)

Output:

Vocabulary size: 17
Vocabulary: {'tensorflow': 1, 'is': 2, 'and': 3, 'for': 4, 'text': 5, 'an': 6, 'open': 7, 'source': 8, 'machine': 9, 'learning': 10, 'framework': 11, 'it': 12, 'designed': 13, 'both': 14, 'research': 15, 'production': 16}
Sequences: [[1, 2, 6, 7, 8, 9, 10, 11], [12, 2, 13, 4, 14, 15, 3, 16], [5, 5, 2, 1, 2, 3]]

Padding Sequences

Since neural networks require fixed-size inputs, we need to make all sequences the same length through padding:

# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, maxlen=10, padding='post', truncating='post')

print("Padded sequences:")
for seq in padded_sequences:
    print(seq)

Output:

Padded sequences:
[ 1  2  6  7  8  9 10 11  0  0]
[12  2 13  4 14 15  3 16  0  0]
[ 5  5  2  1  2  3  0  0  0  0]

Advanced Text Processing with TensorFlow Text

For more advanced text processing needs, TensorFlow provides the tensorflow_text library, which offers additional functionalities such as subword tokenization.

Installation

First, you need to install the library:

pip install tensorflow-text

Using Subword Tokenization

Subword tokenization breaks words into meaningful subunits, which is useful for handling rare words and morphologically rich languages:

import tensorflow_text as text

# Sample text
texts = [
    "TensorFlow is an open source machine learning framework",
    "It is designed for both research and production",
    "Processing text with TensorFlow is powerful and flexible"
]

# Create a WordpieceTokenizer
vocab_list = ['tensor', 'flow', '##ing', 'pro', '##cess', 'text', 'is', 'an', 'open', 'source',
              'machine', 'learn', '##ing', 'frame', '##work', 'it', 'design', '##ed', 'for', 
              'both', 'research', 'and', 'production', 'with', 'power', '##ful', 'flex', '##ible']

# Create a StaticVocabulary from our list
vocab_table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
        vocab_list,
        list(range(1, len(vocab_list) + 1)),
        key_dtype=tf.string, value_dtype=tf.int64
    ),
    num_oov_buckets=1
)

# Create the WordpieceTokenizer
tokenizer = text.WordpieceTokenizer(vocab_table)

# Tokenize some text
tokens = tokenizer.tokenize(['TensorFlow is processing text'])

print("Tokenized result:", tokens.numpy())

Output:

Tokenized result: [[b'tensor', b'flow', b'is', b'pro', b'##cess', b'##ing', b'text']]

Building an RNN Text Classifier

Let's now apply text processing to create a simple sentiment analysis model using an RNN:

import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the IMDB dataset with the top 10,000 words
max_features = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

print(f"Training samples: {len(x_train)}, Test samples: {len(x_test)}")

# Print a sample review
print("Sample review (as integer indices):")
print(x_train[0][:50], "...")

# Convert indices back to words for better understanding
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

print("\nSample review (decoded):")
print(decode_review(x_train[0][:50]), "...")

# Pad sequences for uniform length
max_len = 200
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)

print(f"Input shape: {x_train.shape}")

# Build a simple RNN model for sentiment analysis
embedding_dim = 128

model = Sequential([
    Embedding(max_features, embedding_dim, input_length=max_len),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])

model.summary()

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model (commented out for brevity)
# history = model.fit(x_train, y_train,
#                    batch_size=32,
#                    epochs=5,
#                    validation_split=0.2)

Output:

Training samples: 25000, Test samples: 25000
Sample review (as integer indices):
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447] ...

Sample review (decoded):
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? ...

Input shape: (25000, 200)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 200, 128)          1280000   
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
=================================================================
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________

Word Embeddings

Word embeddings are dense vector representations of words that capture semantic relationships. Instead of training embeddings from scratch, we often use pre-trained embeddings like GloVe or Word2Vec:

import os
import numpy as np

# Assume we have downloaded GloVe embeddings
glove_dir = "/path/to/glove.6B"
embeddings_index = {}

# Load the GloVe vectors
with open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

# Create an embedding matrix for our vocabulary
embedding_dim = 100
embedding_matrix = np.zeros((max_features, embedding_dim))

for word, i in word_index.items():
    if i < max_features:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# Use the pre-trained embeddings in our model
model_with_pretrained = Sequential([
    Embedding(max_features, embedding_dim, 
              weights=[embedding_matrix],
              input_length=max_len,
              trainable=False),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])

model_with_pretrained.compile(optimizer='adam',
                              loss='binary_crossentropy',
                              metrics=['accuracy'])

Text Generation with RNNs

Another common text processing task is generating text. Let's see a simple example of character-level text generation:

# Sample text data
text = """TensorFlow is a free and open-source software library for machine learning and artificial intelligence. 
It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks."""

# Create character mapping
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

print(f"Total characters: {len(chars)}")

# Create training data
seq_length = 40
step = 3
sentences = []
next_chars = []

for i in range(0, len(text) - seq_length, step):
    sentences.append(text[i:i + seq_length])
    next_chars.append(text[i + seq_length])

print(f"Number of sequences: {len(sentences)}")

# One-hot encode the data
x = np.zeros((len(sentences), seq_length, len(chars)), dtype=np.bool_)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool_)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_to_idx[char]] = 1
    y[i, char_to_idx[next_chars[i]]] = 1

# Build a simple character-level RNN model
model = Sequential([
    LSTM(128, input_shape=(seq_length, len(chars))),
    Dense(len(chars), activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam')

# Example of generating text (after training)
def generate_text(seed_text, length=200):
    # Generate text
    generated = seed_text
    
    for i in range(length):
        x_pred = np.zeros((1, seq_length, len(chars)))
        for t, char in enumerate(seed_text):
            x_pred[0, t, char_to_idx[char]] = 1.
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = np.random.choice(len(chars), p=preds)
        next_char = idx_to_char[next_index]
        
        generated += next_char
        seed_text = seed_text[1:] + next_char
        
    return generated

# After training, you would generate text like this:
# print(generate_text("TensorFlow is"))

Real-world Application: News Classification

Here's an example of how to use text processing in a real-world application of news classification:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Sample news data (title, category)
news_data = [
    ("US President addresses the nation about the economy", "politics"),
    ("New quantum computing breakthrough announced by researchers", "technology"),
    ("Stock market reaches all-time high despite pandemic concerns", "business"),
    ("Scientists discover potential new treatment for cancer", "health"),
    ("Tech company releases new smartphone with advanced AI features", "technology"),
    ("Government announces new tax policies for the upcoming fiscal year", "politics"),
    ("Global health organization warns about new virus strain", "health"),
    ("Major merger between two banking giants approved", "business"),
]

# Separate texts and labels
texts = [item[0] for item in news_data]
labels = [item[1] for item in news_data]

# Create label mapping
unique_labels = list(set(labels))
label_to_idx = {label: i for i, label in enumerate(unique_labels)}
idx_to_label = {i: label for i, label in enumerate(unique_labels)}

# Convert labels to integers
y = [label_to_idx[label] for label in labels]

# Tokenize the text
max_words = 1000
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
x = tokenizer.texts_to_sequences(texts)

# Pad sequences
max_len = 20
x = pad_sequences(x, maxlen=max_len, padding='post')

# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Build the model
embedding_dim = 16

model = Sequential([
    Embedding(max_words, embedding_dim, input_length=max_len),
    LSTM(32),
    Dense(len(unique_labels), activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

print(model.summary())

# Train the model (commented out since this is a small example)
# history = model.fit(
#     x_train, y_train,
#     epochs=10,
#     validation_data=(x_test, y_test),
#     verbose=1
# )

# Function to predict the category of a new article
def predict_category(text):
    # Tokenize and pad
    seq = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(seq, maxlen=max_len, padding='post')
    
    # Get prediction
    pred = model.predict(padded)[0]
    pred_class = np.argmax(pred)
    
    return idx_to_label[pred_class], pred[pred_class]

# Example prediction (after training)
new_article = "New legislation proposed by Senate committee on healthcare"
# category, confidence = predict_category(new_article)
# print(f"Predicted category: {category} with {confidence:.2f} confidence")

Summary

In this tutorial, we've covered the essential aspects of text processing in TensorFlow:

Basic text processing with the Keras preprocessing utilities
Advanced tokenization using TensorFlow Text
Building a sentiment analysis model with RNNs
Using pre-trained word embeddings for better text representation
Character-level text generation with RNNs
News classification as a real-world application

Text processing is a crucial step in natural language processing tasks, and TensorFlow provides robust tools to handle text data efficiently for recurrent neural networks.

Additional Resources

For further exploration:

Exercises

Vocabulary Exploration: Modify the tokenizer in the first example to limit the vocabulary to the top 50 words. How does this affect the sequences?
Embedding Visualization: Extend the word embedding example to visualize word embeddings using TensorBoard's Projector tool.
Multi-class Classification: Build a text classifier that categorizes texts into more than two classes (e.g., news categories).
Sequence Length Analysis: Experiment with different sequence lengths in the padding examples and observe how they affect model performance.
Language Model: Train a language model on a dataset of your choice (e.g., Shakespeare's works) and generate new text in that style.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Text Processing in TensorFlow​

Basic Text Processing with Keras​

Tokenization and Sequence Generation​

Padding Sequences​

Advanced Text Processing with TensorFlow Text​

Installation​

Using Subword Tokenization​

Building an RNN Text Classifier​

Word Embeddings​

Text Generation with RNNs​

Real-world Application: News Classification​

Summary​

Additional Resources​

Exercises​

Introduction to Text Processing in TensorFlow

Basic Text Processing with Keras

Tokenization and Sequence Generation

Padding Sequences

Advanced Text Processing with TensorFlow Text

Installation

Using Subword Tokenization

Building an RNN Text Classifier

Word Embeddings

Text Generation with RNNs

Real-world Application: News Classification

Summary

Additional Resources

Exercises