TensorFlow Activation Functions
Introduction
Activation functions are a crucial component of neural networks that introduce non-linearity into the model. Without activation functions, neural networks would simply be linear regression models, regardless of how many layers they have. In this tutorial, we'll explore the various activation functions available in TensorFlow, understand their characteristics, and learn how to implement them in your neural network models.
What are Activation Functions?
Activation functions determine the output of a neural network node given an input or set of inputs. They "activate" based on whether the input to the node is relevant for the model's prediction or not.
Here's a simple representation of where activation functions fit in a neural network:
Input → Weight → Sum → Activation Function → Output
Common Activation Functions in TensorFlow
TensorFlow provides a variety of activation functions through its tf.keras.activations
module and the tf.nn
module. Let's explore the most common ones:
1. ReLU (Rectified Linear Unit)
ReLU is one of the most widely used activation functions in deep learning due to its simplicity and effectiveness.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Creating input data
x = np.linspace(-10, 10, 100)
# ReLU activation function
relu_output = tf.nn.relu(x).numpy()
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, relu_output)
plt.grid(True)
plt.title('ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()
Mathematical representation: f(x) = max(0, x)
Characteristics:
- Simple and computationally efficient
- Helps mitigate the vanishing gradient problem
- Not zero-centered, which can cause zig-zagging dynamics in gradient descent
- Can lead to "dying ReLU" problem where neurons become inactive
2. Sigmoid
The sigmoid function transforms input values into a range between 0 and 1, making it useful for models that predict probability.
# Sigmoid activation function
sigmoid_output = tf.nn.sigmoid(x).numpy()
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, sigmoid_output)
plt.grid(True)
plt.title('Sigmoid Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()
Mathematical representation: f(x) = 1 / (1 + e^(-x))
Characteristics:
- Output range is bounded between 0 and 1
- Smooth gradient
- Suffers from vanishing gradient problem for extreme inputs
- Not zero-centered
3. Tanh (Hyperbolic Tangent)
Tanh is similar to the sigmoid function but maps inputs to a range from -1 to 1.
# Tanh activation function
tanh_output = tf.nn.tanh(x).numpy()
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, tanh_output)
plt.grid(True)
plt.title('Tanh Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()
Mathematical representation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Characteristics:
- Output range is bounded between -1 and 1
- Zero-centered, which helps in convergence
- Still suffers from vanishing gradient problem for extreme values
4. Leaky ReLU
Leaky ReLU is a variant of ReLU that allows a small gradient when the unit is not active.
# Leaky ReLU activation function
alpha = 0.1
leaky_relu_output = tf.nn.leaky_relu(x, alpha=alpha).numpy()
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, leaky_relu_output)
plt.grid(True)
plt.title(f'Leaky ReLU Activation Function (alpha={alpha})')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()
Mathematical representation: f(x) = max(αx, x)
where α is a small constant
Characteristics:
- Prevents the "dying ReLU" problem
- Allows for a small gradient when the unit is not active
- Not zero-centered
5. Softmax
Softmax is often used in the output layer of a classifier to normalize the output to a probability distribution.
# Sample logits (raw output from last neural network layer)
logits = tf.constant([[2.0, 1.0, 0.1],
[0.1, 1.0, 3.0]])
# Apply softmax
softmax_output = tf.nn.softmax(logits).numpy()
print("Logits (raw output):")
print(logits.numpy())
print("\nSoftmax probabilities:")
print(softmax_output)
print("\nSum of probabilities for each example:")
print(np.sum(softmax_output, axis=1)) # Should be close to 1 for each example
Output:
Logits (raw output):
[[2. 1. 0.1]
[0.1 1. 3. ]]
Softmax probabilities:
[[0.65900114 0.24243298 0.09856589]
[0.04383154 0.107864 0.84830445]]
Sum of probabilities for each example:
[1. 1.]
Mathematical representation: f(x_i) = e^(x_i) / Σ(e^(x_j))
for all j in the same class
Characteristics:
- Converts logits to probabilities (values between 0 and 1 that sum to 1)
- Useful for multi-class classification problems
- Highlights the largest values and suppresses significantly smaller ones
Implementing Activation Functions in TensorFlow Models
There are two main ways to add activation functions to your TensorFlow models:
1. Directly in Layer Definition
import tensorflow as tf
from tensorflow import keras
# Creating a simple model with activation functions
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(784,)), # ReLU activation
keras.layers.Dense(64, activation='relu'), # ReLU activation
keras.layers.Dense(10, activation='softmax') # Softmax activation
])
model.summary()
2. As a Separate Layer
model = keras.Sequential([
keras.layers.Dense(128, input_shape=(784,)),
keras.layers.Activation('relu'), # Explicit activation layer
keras.layers.Dense(64),
keras.layers.Activation('relu'), # Explicit activation layer
keras.layers.Dense(10),
keras.layers.Activation('softmax') # Explicit activation layer
])
model.summary()
3. Using the Functional API
input_layer = keras.layers.Input(shape=(784,))
hidden_1 = keras.layers.Dense(128)(input_layer)
activation_1 = keras.layers.Activation('relu')(hidden_1)
hidden_2 = keras.layers.Dense(64)(activation_1)
activation_2 = keras.layers.Activation('relu')(hidden_2)
output_layer = keras.layers.Dense(10)(activation_2)
activation_output = keras.layers.Activation('softmax')(output_layer)
model = keras.Model(inputs=input_layer, outputs=activation_output)
model.summary()
Advanced Activation Functions
TensorFlow also provides advanced activation functions through the tf.keras.layers
module:
from tensorflow.keras.layers import LeakyReLU, PReLU, ELU, ThresholdedReLU
model = keras.Sequential([
keras.layers.Dense(128, input_shape=(784,)),
LeakyReLU(alpha=0.1), # Leaky ReLU with alpha=0.1
keras.layers.Dense(64),
PReLU(), # Parametric ReLU
keras.layers.Dense(32),
ELU(alpha=1.0), # Exponential Linear Unit
keras.layers.Dense(16),
ThresholdedReLU(theta=1.0), # Thresholded ReLU
keras.layers.Dense(10, activation='softmax')
])
model.summary()
Practical Example: MNIST Classification
Let's build a simple neural network to classify MNIST digits and observe how different activation functions affect performance:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from time import time
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess the data
x_train = x_train.reshape(-1, 28*28).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28*28).astype('float32') / 255.0
# Convert labels to categorical
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
# Function to create and train model with different activation functions
def create_and_train_model(activation, name):
model = keras.Sequential([
keras.layers.Dense(128, activation=activation, input_shape=(784,)),
keras.layers.Dense(64, activation=activation),
keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
start_time = time()
history = model.fit(
x_train, y_train,
batch_size=128,
epochs=5,
verbose=0,
validation_split=0.1
)
training_time = time() - start_time
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
return history, test_acc, training_time
# List of activation functions to test
activations = ['relu', 'tanh', 'sigmoid']
results = {}
# Train models with different activation functions
for activation in activations:
print(f"Training with {activation}...")
history, test_acc, training_time = create_and_train_model(activation, activation)
results[activation] = {
'history': history,
'test_acc': test_acc,
'time': training_time
}
print(f"{activation} - Test accuracy: {test_acc:.4f}, Training time: {training_time:.2f}s")
# Plotting training histories
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
for activation in activations:
plt.plot(results[activation]['history'].history['accuracy'], label=f'{activation}')
plt.title('Training Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
for activation in activations:
plt.plot(results[activation]['history'].history['val_accuracy'], label=f'{activation}')
plt.title('Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
This example demonstrates how different activation functions affect the training process and final model performance. You might observe that ReLU typically trains faster and achieves better performance for this task.
Choosing the Right Activation Function
When deciding which activation function to use:
-
Hidden layers:
- Start with ReLU (default choice for most cases)
- If you encounter "dying ReLU" problems, try Leaky ReLU or ELU
- For recurrent neural networks, consider tanh
-
Output layer:
- For binary classification: sigmoid
- For multi-class classification: softmax
- For regression: linear (no activation)
Summary
Activation functions are crucial components in neural networks that introduce non-linearity, allowing models to learn complex patterns:
- ReLU is widely used and computationally efficient, but may suffer from dying neurons
- Sigmoid and Tanh are bounded functions useful for specific scenarios but can suffer from vanishing gradients
- Leaky ReLU, PReLU, and ELU are improved variants that address some of ReLU's limitations
- Softmax is specifically designed for multi-class classification outputs
When building neural networks, the choice of activation function can significantly impact model performance, training speed, and convergence. Experimentation is often necessary to find the best activation function for a specific task.
Additional Resources and Exercises
Further Reading
- TensorFlow Keras Activation Functions Documentation
- Understanding Activation Functions in Deep Learning
Exercises
-
Visualization Exercise: Create a function that visualizes all the activation functions we covered, plotting them side by side for comparison.
-
Performance Comparison: Implement a neural network for a dataset of your choice (e.g., CIFAR-10) and compare the performance of different activation functions.
-
Custom Activation Function: Create a custom activation function in TensorFlow and use it in a neural network. Compare its performance with standard activation functions.
-
Dying ReLU Investigation: Create a deep network with ReLU activations and visualize the activations in each layer. Look for evidence of the "dying ReLU" problem and try to mitigate it using Leaky ReLU.
-
Gradient Flow Analysis: Implement a simple network with different activation functions and visualize how gradients flow through the network during backpropagation.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)