TensorFlow Text Processing
Text processing is a fundamental step in working with natural language data in TensorFlow, especially when building recurrent neural networks (RNNs) for tasks like sentiment analysis, text generation, or machine translation. In this tutorial, we'll explore how to process text data effectively using TensorFlow tools.
Introduction to Text Processing in TensorFlow
Before we can feed text into our RNN models, we need to convert human-readable text into numerical representations that deep learning models can understand. TensorFlow offers several utilities to help with this conversion, primarily through the tf.keras.preprocessing.text
module and the more recent tensorflow_text
library.
Text processing typically involves:
- Tokenizing the text (splitting into words, subwords, or characters)
- Building a vocabulary (mapping tokens to integers)
- Converting text to sequences of integers
- Padding sequences to a uniform length
Let's explore these steps with practical examples.
Basic Text Processing with Keras
Tokenization and Sequence Generation
The most straightforward way to process text in TensorFlow is with the Tokenizer
class:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample text data
texts = [
"TensorFlow is an open source machine learning framework",
"It is designed for both research and production",
"Processing text with TensorFlow is powerful and flexible"
]
# Create a tokenizer
tokenizer = Tokenizer()
# Fit the tokenizer on our texts
tokenizer.fit_on_texts(texts)
# Convert texts to sequences (lists of integers)
sequences = tokenizer.texts_to_sequences(texts)
print("Vocabulary size:", len(tokenizer.word_index) + 1)
print("Vocabulary:", tokenizer.word_index)
print("Sequences:", sequences)
Output:
Vocabulary size: 17
Vocabulary: {'tensorflow': 1, 'is': 2, 'and': 3, 'for': 4, 'text': 5, 'an': 6, 'open': 7, 'source': 8, 'machine': 9, 'learning': 10, 'framework': 11, 'it': 12, 'designed': 13, 'both': 14, 'research': 15, 'production': 16}
Sequences: [[1, 2, 6, 7, 8, 9, 10, 11], [12, 2, 13, 4, 14, 15, 3, 16], [5, 5, 2, 1, 2, 3]]
Padding Sequences
Since neural networks require fixed-size inputs, we need to make all sequences the same length through padding:
# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, maxlen=10, padding='post', truncating='post')
print("Padded sequences:")
for seq in padded_sequences:
print(seq)
Output:
Padded sequences:
[ 1 2 6 7 8 9 10 11 0 0]
[12 2 13 4 14 15 3 16 0 0]
[ 5 5 2 1 2 3 0 0 0 0]
Advanced Text Processing with TensorFlow Text
For more advanced text processing needs, TensorFlow provides the tensorflow_text
library, which offers additional functionalities such as subword tokenization.
Installation
First, you need to install the library:
pip install tensorflow-text
Using Subword Tokenization
Subword tokenization breaks words into meaningful subunits, which is useful for handling rare words and morphologically rich languages:
import tensorflow_text as text
# Sample text
texts = [
"TensorFlow is an open source machine learning framework",
"It is designed for both research and production",
"Processing text with TensorFlow is powerful and flexible"
]
# Create a WordpieceTokenizer
vocab_list = ['tensor', 'flow', '##ing', 'pro', '##cess', 'text', 'is', 'an', 'open', 'source',
'machine', 'learn', '##ing', 'frame', '##work', 'it', 'design', '##ed', 'for',
'both', 'research', 'and', 'production', 'with', 'power', '##ful', 'flex', '##ible']
# Create a StaticVocabulary from our list
vocab_table = tf.lookup.StaticVocabularyTable(
tf.lookup.KeyValueTensorInitializer(
vocab_list,
list(range(1, len(vocab_list) + 1)),
key_dtype=tf.string, value_dtype=tf.int64
),
num_oov_buckets=1
)
# Create the WordpieceTokenizer
tokenizer = text.WordpieceTokenizer(vocab_table)
# Tokenize some text
tokens = tokenizer.tokenize(['TensorFlow is processing text'])
print("Tokenized result:", tokens.numpy())
Output:
Tokenized result: [[b'tensor', b'flow', b'is', b'pro', b'##cess', b'##ing', b'text']]
Building an RNN Text Classifier
Let's now apply text processing to create a simple sentiment analysis model using an RNN:
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Load the IMDB dataset with the top 10,000 words
max_features = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(f"Training samples: {len(x_train)}, Test samples: {len(x_test)}")
# Print a sample review
print("Sample review (as integer indices):")
print(x_train[0][:50], "...")
# Convert indices back to words for better understanding
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}
def decode_review(encoded_review):
return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])
print("\nSample review (decoded):")
print(decode_review(x_train[0][:50]), "...")
# Pad sequences for uniform length
max_len = 200
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)
print(f"Input shape: {x_train.shape}")
# Build a simple RNN model for sentiment analysis
embedding_dim = 128
model = Sequential([
Embedding(max_features, embedding_dim, input_length=max_len),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(1, activation='sigmoid')
])
model.summary()
# Compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model (commented out for brevity)
# history = model.fit(x_train, y_train,
# batch_size=32,
# epochs=5,
# validation_split=0.2)
Output:
Training samples: 25000, Test samples: 25000
Sample review (as integer indices):
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447] ...
Sample review (decoded):
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? ...
Input shape: (25000, 200)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 200, 128) 1280000
lstm (LSTM) (None, 128) 131584
dense (Dense) (None, 1) 129
=================================================================
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________
Word Embeddings
Word embeddings are dense vector representations of words that capture semantic relationships. Instead of training embeddings from scratch, we often use pre-trained embeddings like GloVe or Word2Vec:
import os
import numpy as np
# Assume we have downloaded GloVe embeddings
glove_dir = "/path/to/glove.6B"
embeddings_index = {}
# Load the GloVe vectors
with open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print(f"Found {len(embeddings_index)} word vectors.")
# Create an embedding matrix for our vocabulary
embedding_dim = 100
embedding_matrix = np.zeros((max_features, embedding_dim))
for word, i in word_index.items():
if i < max_features:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# Use the pre-trained embeddings in our model
model_with_pretrained = Sequential([
Embedding(max_features, embedding_dim,
weights=[embedding_matrix],
input_length=max_len,
trainable=False),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(1, activation='sigmoid')
])
model_with_pretrained.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Text Generation with RNNs
Another common text processing task is generating text. Let's see a simple example of character-level text generation:
# Sample text data
text = """TensorFlow is a free and open-source software library for machine learning and artificial intelligence.
It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks."""
# Create character mapping
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
print(f"Total characters: {len(chars)}")
# Create training data
seq_length = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - seq_length, step):
sentences.append(text[i:i + seq_length])
next_chars.append(text[i + seq_length])
print(f"Number of sequences: {len(sentences)}")
# One-hot encode the data
x = np.zeros((len(sentences), seq_length, len(chars)), dtype=np.bool_)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool_)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_to_idx[char]] = 1
y[i, char_to_idx[next_chars[i]]] = 1
# Build a simple character-level RNN model
model = Sequential([
LSTM(128, input_shape=(seq_length, len(chars))),
Dense(len(chars), activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example of generating text (after training)
def generate_text(seed_text, length=200):
# Generate text
generated = seed_text
for i in range(length):
x_pred = np.zeros((1, seq_length, len(chars)))
for t, char in enumerate(seed_text):
x_pred[0, t, char_to_idx[char]] = 1.
preds = model.predict(x_pred, verbose=0)[0]
next_index = np.random.choice(len(chars), p=preds)
next_char = idx_to_char[next_index]
generated += next_char
seed_text = seed_text[1:] + next_char
return generated
# After training, you would generate text like this:
# print(generate_text("TensorFlow is"))
Real-world Application: News Classification
Here's an example of how to use text processing in a real-world application of news classification:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
# Sample news data (title, category)
news_data = [
("US President addresses the nation about the economy", "politics"),
("New quantum computing breakthrough announced by researchers", "technology"),
("Stock market reaches all-time high despite pandemic concerns", "business"),
("Scientists discover potential new treatment for cancer", "health"),
("Tech company releases new smartphone with advanced AI features", "technology"),
("Government announces new tax policies for the upcoming fiscal year", "politics"),
("Global health organization warns about new virus strain", "health"),
("Major merger between two banking giants approved", "business"),
]
# Separate texts and labels
texts = [item[0] for item in news_data]
labels = [item[1] for item in news_data]
# Create label mapping
unique_labels = list(set(labels))
label_to_idx = {label: i for i, label in enumerate(unique_labels)}
idx_to_label = {i: label for i, label in enumerate(unique_labels)}
# Convert labels to integers
y = [label_to_idx[label] for label in labels]
# Tokenize the text
max_words = 1000
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
x = tokenizer.texts_to_sequences(texts)
# Pad sequences
max_len = 20
x = pad_sequences(x, maxlen=max_len, padding='post')
# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Build the model
embedding_dim = 16
model = Sequential([
Embedding(max_words, embedding_dim, input_length=max_len),
LSTM(32),
Dense(len(unique_labels), activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print(model.summary())
# Train the model (commented out since this is a small example)
# history = model.fit(
# x_train, y_train,
# epochs=10,
# validation_data=(x_test, y_test),
# verbose=1
# )
# Function to predict the category of a new article
def predict_category(text):
# Tokenize and pad
seq = tokenizer.texts_to_sequences([text])
padded = pad_sequences(seq, maxlen=max_len, padding='post')
# Get prediction
pred = model.predict(padded)[0]
pred_class = np.argmax(pred)
return idx_to_label[pred_class], pred[pred_class]
# Example prediction (after training)
new_article = "New legislation proposed by Senate committee on healthcare"
# category, confidence = predict_category(new_article)
# print(f"Predicted category: {category} with {confidence:.2f} confidence")
Summary
In this tutorial, we've covered the essential aspects of text processing in TensorFlow:
- Basic text processing with the Keras preprocessing utilities
- Advanced tokenization using TensorFlow Text
- Building a sentiment analysis model with RNNs
- Using pre-trained word embeddings for better text representation
- Character-level text generation with RNNs
- News classification as a real-world application
Text processing is a crucial step in natural language processing tasks, and TensorFlow provides robust tools to handle text data efficiently for recurrent neural networks.
Additional Resources
For further exploration:
- TensorFlow Text Documentation
- Keras Text Preprocessing API
- RNN Text Classification Tutorial
- Text Generation with RNNs
Exercises
-
Vocabulary Exploration: Modify the tokenizer in the first example to limit the vocabulary to the top 50 words. How does this affect the sequences?
-
Embedding Visualization: Extend the word embedding example to visualize word embeddings using TensorBoard's Projector tool.
-
Multi-class Classification: Build a text classifier that categorizes texts into more than two classes (e.g., news categories).
-
Sequence Length Analysis: Experiment with different sequence lengths in the padding examples and observe how they affect model performance.
-
Language Model: Train a language model on a dataset of your choice (e.g., Shakespeare's works) and generate new text in that style.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)