TensorFlow Embeddings

Introduction

Embeddings are a fundamental concept in deep learning, especially when working with text, categorical data, or any discrete objects. At their core, embeddings are dense vector representations of discrete objects, allowing these objects to be used in neural networks in a meaningful way.

In this tutorial, we'll explore what embeddings are, why they're crucial for working with recurrent neural networks (RNNs), and how to implement them using TensorFlow. By the end, you'll understand how to create, use, and visualize embeddings in your deep learning models.

What Are Embeddings?

Before diving into the code, let's understand the concept of embeddings.

The Need for Embeddings

Traditional machine learning approaches represent categorical variables using one-hot encoding, where each category gets a binary vector with a single "1" and the rest "0"s. For example:

Dog: [1, 0, 0, 0]
Cat: [0, 1, 0, 0]
Horse: [0, 0, 1, 0]
Cow: [0, 0, 0, 1]

However, this approach has several limitations:

Dimensionality: With large vocabularies (e.g., thousands of words), one-hot vectors become very large and sparse
Semantic meaning: One-hot encodings don't capture relationships between items
Generalization: Models can't easily generalize to unseen data

Embeddings solve these problems by representing each category as a dense vector in a continuous vector space, where similar items are located close to each other.

Creating Embeddings in TensorFlow

Let's implement a basic embedding layer in TensorFlow:

python
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Define vocabulary size and embedding dimension
vocab_size = 10000  # Number of words in vocabulary
embedding_dim = 64  # Size of the embedding vector

# Create an embedding layer
embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,      # Size of the vocabulary 
    output_dim=embedding_dim,  # Dimension of the dense embedding
    name="embedding"
)

# Let's see what happens when we pass integer indices to the embedding layer
word_indices = tf.constant([42, 1337, 7])
embedded_words = embedding_layer(word_indices)

print(f"Input shape: {word_indices.shape}")
print(f"Output shape: {embedded_words.shape}")
print(f"Example embedding vector:\n{embedded_words[0].numpy()[:10]}...")  # Show first 10 dimensions

Output:

Input shape: (3,)
Output shape: (3, 64)
Example embedding vector:
[-0.01117208  0.03457389  0.04560256 -0.04160505  0.00942195 -0.01789882
 -0.04011846 -0.03130924 -0.01321477  0.01470375]...

As you can see, the embedding layer converts each integer index (representing a word or category) into a dense vector of size embedding_dim. Initially, these vectors are randomly initialized, but they will be learned during training.

Using Embeddings with RNNs

Embeddings are particularly useful for RNNs when processing sequences of words or tokens. Here's how to use them together:

python
# Create a simple RNN model with embeddings
model = tf.keras.Sequential([
    # Embedding layer converts integer indices to dense vectors
    tf.keras.layers.Embedding(vocab_size, embedding_dim, mask_zero=True),
    
    # RNN layer processes the sequence
    tf.keras.layers.LSTM(128, return_sequences=False),
    
    # Output layer
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Output:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 64)          640000    
                                                                 
 lstm (LSTM)                 (None, 128)               98816     
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
=================================================================
Total params: 738,945
Trainable params: 738,945
Non-trainable params: 0
_________________________________________________________________

Practical Example: Sentiment Analysis

Let's implement a practical example: sentiment analysis on movie reviews using the IMDB dataset:

python
# Load the IMDB dataset
imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# Print example
print(f"Example review (encoded): {x_train[0][:20]}...")
print(f"Example label: {y_train[0]} (1 = positive, 0 = negative)")

# Pad sequences to ensure uniform length
max_length = 250
x_train = tf.keras.preprocessing.sequence.pad_sequences(
    x_train, maxlen=max_length, padding='post'
)
x_test = tf.keras.preprocessing.sequence.pad_sequences(
    x_test, maxlen=max_length, padding='post'
)

print(f"Training data shape: {x_train.shape}")

Output:

Example review (encoded): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25]...
Example label: 1 (1 = positive, 0 = negative)
Training data shape: (25000, 250)

Now let's build and train our model:

python
# Build the model
embedding_dim = 32
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Train the model
history = model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=128,
    validation_split=0.2
)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {accuracy:.4f}")

Output:

Epoch 1/5
157/157 [==============================] - 28s 155ms/step - loss: 0.5006 - accuracy: 0.7538 - val_loss: 0.3923 - val_accuracy: 0.8270
Epoch 2/5
157/157 [==============================] - 24s 153ms/step - loss: 0.3028 - accuracy: 0.8762 - val_loss: 0.3367 - val_accuracy: 0.8590
Epoch 3/5
157/157 [==============================] - 24s 153ms/step - loss: 0.2267 - accuracy: 0.9122 - val_loss: 0.3375 - val_accuracy: 0.8568
Epoch 4/5
157/157 [==============================] - 24s 152ms/step - loss: 0.1770 - accuracy: 0.9376 - val_loss: 0.3809 - val_accuracy: 0.8496
Epoch 5/5
157/157 [==============================] - 24s 153ms/step - loss: 0.1260 - accuracy: 0.9591 - val_loss: 0.4103 - val_accuracy: 0.8464
782/782 [==============================] - 11s 14ms/step - loss: 0.3731 - accuracy: 0.8435
Test accuracy: 0.8435

Visualizing Embeddings

One of the most powerful aspects of embeddings is that they capture semantic relationships between words or categories. We can visualize these relationships using dimensionality reduction techniques like t-SNE:

python
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Get the embedding weights
embedding_weights = model.layers[0].get_weights()[0]

# Get the word index from the IMDB dataset
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Select a subset of words to visualize
num_words_to_visualize = 100
word_embeddings = embedding_weights[:num_words_to_visualize]

# Perform t-SNE dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(word_embeddings)

# Plot the embeddings
plt.figure(figsize=(12, 8))
for i in range(num_words_to_visualize):
    if i + 3 in reverse_word_index:  # +3 because 0, 1, 2 are reserved indices
        word = reverse_word_index[i + 3]
        x, y = reduced_embeddings[i]
        plt.scatter(x, y)
        plt.annotate(word, (x, y))
plt.title("t-SNE visualization of word embeddings")
plt.show()

The output would be a 2D scatter plot showing how words are related in the embedding space.

Pre-trained Embeddings

Instead of learning embeddings from scratch, we can use pre-trained embeddings like GloVe or Word2Vec:

python
# Download GloVe embeddings (this would typically be done outside your code)
# !wget https://nlp.stanford.edu/data/glove.6B.zip
# !unzip -q glove.6B.zip

# Load pre-trained embeddings
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

# Create an embedding matrix for our vocabulary
embedding_dim = 100  # GloVe vectors are 100-dimensional
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # Words not found in embedding index will be zeros
            embedding_matrix[i] = embedding_vector

# Create a model using the pre-trained embeddings
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        vocab_size, 
        embedding_dim, 
        weights=[embedding_matrix],  # Use pre-trained weights
        trainable=False,  # Freeze the embeddings
        mask_zero=True
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Advanced: Contextual Embeddings

Modern NLP uses contextual embeddings that change based on the context of a word. TensorFlow Hub provides access to these models:

python
import tensorflow_hub as hub

# Load a BERT model from TensorFlow Hub
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=False)

# Function to preprocess text for BERT
def bert_encode(texts, tokenizer, max_len=128):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

# Example of using BERT embeddings (simplified)
def create_bert_model():
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    
    clf_output = sequence_output[:, 0, :]
    net = tf.keras.layers.Dense(64, activation='relu')(clf_output)
    net = tf.keras.layers.Dropout(0.2)(net)
    net = tf.keras.layers.Dense(1, activation='sigmoid')(net)
    
    model = tf.keras.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=net)
    return model

Summary

In this tutorial, we covered:

What embeddings are: Dense vector representations of discrete objects
Why embeddings matter: They provide meaningful representations that capture semantic relationships
How to create embeddings in TensorFlow: Using the tf.keras.layers.Embedding layer
Using embeddings with RNNs: Combining embeddings with recurrent layers for sequence processing
Real-world application: Sentiment analysis on movie reviews
Visualizing embeddings: Understanding relationships between words
Pre-trained embeddings: Leveraging existing knowledge with GloVe or Word2Vec
Contextual embeddings: Using advanced models like BERT

Embeddings are a crucial concept in deep learning for NLP and other areas dealing with categorical data. They allow neural networks to work effectively with discrete objects by placing them in a continuous vector space where similar items are close to each other.

Additional Resources

Exercises

Build a text classifier using embeddings for a different dataset (e.g., news classification)
Experiment with different embedding dimensions and observe the effects on model performance
Implement a model that uses pre-trained GloVe embeddings and compare it with a model using randomly initialized embeddings
Create a function to find the most similar words to a given word using the learned embeddings
Apply embeddings to a non-text problem (e.g., representing users and products in a recommendation system)

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Embeddings?​

The Need for Embeddings​

Creating Embeddings in TensorFlow​

Using Embeddings with RNNs​

Practical Example: Sentiment Analysis​

Visualizing Embeddings​

Pre-trained Embeddings​

Advanced: Contextual Embeddings​

Summary​

Additional Resources​

Exercises​

Introduction

What Are Embeddings?

The Need for Embeddings

Creating Embeddings in TensorFlow

Using Embeddings with RNNs

Practical Example: Sentiment Analysis

Visualizing Embeddings

Pre-trained Embeddings

Advanced: Contextual Embeddings

Summary

Additional Resources

Exercises