TensorFlow Embeddings
Introduction
Embeddings are a fundamental concept in deep learning, especially when working with text, categorical data, or any discrete objects. At their core, embeddings are dense vector representations of discrete objects, allowing these objects to be used in neural networks in a meaningful way.
In this tutorial, we'll explore what embeddings are, why they're crucial for working with recurrent neural networks (RNNs), and how to implement them using TensorFlow. By the end, you'll understand how to create, use, and visualize embeddings in your deep learning models.
What Are Embeddings?
Before diving into the code, let's understand the concept of embeddings.
The Need for Embeddings
Traditional machine learning approaches represent categorical variables using one-hot encoding, where each category gets a binary vector with a single "1" and the rest "0"s. For example:
- Dog: [1, 0, 0, 0]
- Cat: [0, 1, 0, 0]
- Horse: [0, 0, 1, 0]
- Cow: [0, 0, 0, 1]
However, this approach has several limitations:
- Dimensionality: With large vocabularies (e.g., thousands of words), one-hot vectors become very large and sparse
- Semantic meaning: One-hot encodings don't capture relationships between items
- Generalization: Models can't easily generalize to unseen data
Embeddings solve these problems by representing each category as a dense vector in a continuous vector space, where similar items are located close to each other.
Creating Embeddings in TensorFlow
Let's implement a basic embedding layer in TensorFlow:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Define vocabulary size and embedding dimension
vocab_size = 10000 # Number of words in vocabulary
embedding_dim = 64 # Size of the embedding vector
# Create an embedding layer
embedding_layer = tf.keras.layers.Embedding(
input_dim=vocab_size, # Size of the vocabulary
output_dim=embedding_dim, # Dimension of the dense embedding
name="embedding"
)
# Let's see what happens when we pass integer indices to the embedding layer
word_indices = tf.constant([42, 1337, 7])
embedded_words = embedding_layer(word_indices)
print(f"Input shape: {word_indices.shape}")
print(f"Output shape: {embedded_words.shape}")
print(f"Example embedding vector:\n{embedded_words[0].numpy()[:10]}...") # Show first 10 dimensions
Output:
Input shape: (3,)
Output shape: (3, 64)
Example embedding vector:
[-0.01117208 0.03457389 0.04560256 -0.04160505 0.00942195 -0.01789882
-0.04011846 -0.03130924 -0.01321477 0.01470375]...
As you can see, the embedding layer converts each integer index (representing a word or category) into a dense vector of size embedding_dim
. Initially, these vectors are randomly initialized, but they will be learned during training.
Using Embeddings with RNNs
Embeddings are particularly useful for RNNs when processing sequences of words or tokens. Here's how to use them together:
# Create a simple RNN model with embeddings
model = tf.keras.Sequential([
# Embedding layer converts integer indices to dense vectors
tf.keras.layers.Embedding(vocab_size, embedding_dim, mask_zero=True),
# RNN layer processes the sequence
tf.keras.layers.LSTM(128, return_sequences=False),
# Output layer
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
Output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 640000
lstm (LSTM) (None, 128) 98816
dense (Dense) (None, 1) 129
=================================================================
Total params: 738,945
Trainable params: 738,945
Non-trainable params: 0
_________________________________________________________________
Practical Example: Sentiment Analysis
Let's implement a practical example: sentiment analysis on movie reviews using the IMDB dataset:
# Load the IMDB dataset
imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
# Print example
print(f"Example review (encoded): {x_train[0][:20]}...")
print(f"Example label: {y_train[0]} (1 = positive, 0 = negative)")
# Pad sequences to ensure uniform length
max_length = 250
x_train = tf.keras.preprocessing.sequence.pad_sequences(
x_train, maxlen=max_length, padding='post'
)
x_test = tf.keras.preprocessing.sequence.pad_sequences(
x_test, maxlen=max_length, padding='post'
)
print(f"Training data shape: {x_train.shape}")
Output:
Example review (encoded): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25]...
Example label: 1 (1 = positive, 0 = negative)
Training data shape: (25000, 250)
Now let's build and train our model:
# Build the model
embedding_dim = 32
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
# Train the model
history = model.fit(
x_train, y_train,
epochs=5,
batch_size=128,
validation_split=0.2
)
# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")
Output:
Epoch 1/5
157/157 [==============================] - 28s 155ms/step - loss: 0.5006 - accuracy: 0.7538 - val_loss: 0.3923 - val_accuracy: 0.8270
Epoch 2/5
157/157 [==============================] - 24s 153ms/step - loss: 0.3028 - accuracy: 0.8762 - val_loss: 0.3367 - val_accuracy: 0.8590
Epoch 3/5
157/157 [==============================] - 24s 153ms/step - loss: 0.2267 - accuracy: 0.9122 - val_loss: 0.3375 - val_accuracy: 0.8568
Epoch 4/5
157/157 [==============================] - 24s 152ms/step - loss: 0.1770 - accuracy: 0.9376 - val_loss: 0.3809 - val_accuracy: 0.8496
Epoch 5/5
157/157 [==============================] - 24s 153ms/step - loss: 0.1260 - accuracy: 0.9591 - val_loss: 0.4103 - val_accuracy: 0.8464
782/782 [==============================] - 11s 14ms/step - loss: 0.3731 - accuracy: 0.8435
Test accuracy: 0.8435
Visualizing Embeddings
One of the most powerful aspects of embeddings is that they capture semantic relationships between words or categories. We can visualize these relationships using dimensionality reduction techniques like t-SNE:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Get the embedding weights
embedding_weights = model.layers[0].get_weights()[0]
# Get the word index from the IMDB dataset
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# Select a subset of words to visualize
num_words_to_visualize = 100
word_embeddings = embedding_weights[:num_words_to_visualize]
# Perform t-SNE dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(word_embeddings)
# Plot the embeddings
plt.figure(figsize=(12, 8))
for i in range(num_words_to_visualize):
if i + 3 in reverse_word_index: # +3 because 0, 1, 2 are reserved indices
word = reverse_word_index[i + 3]
x, y = reduced_embeddings[i]
plt.scatter(x, y)
plt.annotate(word, (x, y))
plt.title("t-SNE visualization of word embeddings")
plt.show()
The output would be a 2D scatter plot showing how words are related in the embedding space.
Pre-trained Embeddings
Instead of learning embeddings from scratch, we can use pre-trained embeddings like GloVe or Word2Vec:
# Download GloVe embeddings (this would typically be done outside your code)
# !wget https://nlp.stanford.edu/data/glove.6B.zip
# !unzip -q glove.6B.zip
# Load pre-trained embeddings
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print(f"Found {len(embeddings_index)} word vectors.")
# Create an embedding matrix for our vocabulary
embedding_dim = 100 # GloVe vectors are 100-dimensional
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
if i < vocab_size:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be zeros
embedding_matrix[i] = embedding_vector
# Create a model using the pre-trained embeddings
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size,
embedding_dim,
weights=[embedding_matrix], # Use pre-trained weights
trainable=False, # Freeze the embeddings
mask_zero=True
),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Advanced: Contextual Embeddings
Modern NLP uses contextual embeddings that change based on the context of a word. TensorFlow Hub provides access to these models:
import tensorflow_hub as hub
# Load a BERT model from TensorFlow Hub
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=False)
# Function to preprocess text for BERT
def bert_encode(texts, tokenizer, max_len=128):
all_tokens = []
all_masks = []
all_segments = []
for text in texts:
text = tokenizer.tokenize(text)
text = text[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
pad_len = max_len - len(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
pad_masks = [1] * len(input_sequence) + [0] * pad_len
segment_ids = [0] * max_len
all_tokens.append(tokens)
all_masks.append(pad_masks)
all_segments.append(segment_ids)
return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
# Example of using BERT embeddings (simplified)
def create_bert_model():
input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
segment_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
clf_output = sequence_output[:, 0, :]
net = tf.keras.layers.Dense(64, activation='relu')(clf_output)
net = tf.keras.layers.Dropout(0.2)(net)
net = tf.keras.layers.Dense(1, activation='sigmoid')(net)
model = tf.keras.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=net)
return model
Summary
In this tutorial, we covered:
- What embeddings are: Dense vector representations of discrete objects
- Why embeddings matter: They provide meaningful representations that capture semantic relationships
- How to create embeddings in TensorFlow: Using the
tf.keras.layers.Embedding
layer - Using embeddings with RNNs: Combining embeddings with recurrent layers for sequence processing
- Real-world application: Sentiment analysis on movie reviews
- Visualizing embeddings: Understanding relationships between words
- Pre-trained embeddings: Leveraging existing knowledge with GloVe or Word2Vec
- Contextual embeddings: Using advanced models like BERT
Embeddings are a crucial concept in deep learning for NLP and other areas dealing with categorical data. They allow neural networks to work effectively with discrete objects by placing them in a continuous vector space where similar items are close to each other.
Additional Resources
- TensorFlow Embedding Layer Documentation
- TensorFlow Text Classification Tutorial
- TensorFlow Hub for Pre-trained Embeddings
- Word2Vec Paper
- GloVe: Global Vectors for Word Representation
Exercises
- Build a text classifier using embeddings for a different dataset (e.g., news classification)
- Experiment with different embedding dimensions and observe the effects on model performance
- Implement a model that uses pre-trained GloVe embeddings and compare it with a model using randomly initialized embeddings
- Create a function to find the most similar words to a given word using the learned embeddings
- Apply embeddings to a non-text problem (e.g., representing users and products in a recommendation system)
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)