Skip to main content

TensorFlow Activation Functions

Introduction

Activation functions are a crucial component of neural networks that introduce non-linearity into the model. Without activation functions, neural networks would simply be linear regression models, regardless of how many layers they have. In this tutorial, we'll explore the various activation functions available in TensorFlow, understand their characteristics, and learn how to implement them in your neural network models.

What are Activation Functions?

Activation functions determine the output of a neural network node given an input or set of inputs. They "activate" based on whether the input to the node is relevant for the model's prediction or not.

Here's a simple representation of where activation functions fit in a neural network:

Input → Weight → Sum → Activation Function → Output

Common Activation Functions in TensorFlow

TensorFlow provides a variety of activation functions through its tf.keras.activations module and the tf.nn module. Let's explore the most common ones:

1. ReLU (Rectified Linear Unit)

ReLU is one of the most widely used activation functions in deep learning due to its simplicity and effectiveness.

python
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Creating input data
x = np.linspace(-10, 10, 100)

# ReLU activation function
relu_output = tf.nn.relu(x).numpy()

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, relu_output)
plt.grid(True)
plt.title('ReLU Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()

Mathematical representation: f(x) = max(0, x)

Characteristics:

  • Simple and computationally efficient
  • Helps mitigate the vanishing gradient problem
  • Not zero-centered, which can cause zig-zagging dynamics in gradient descent
  • Can lead to "dying ReLU" problem where neurons become inactive

2. Sigmoid

The sigmoid function transforms input values into a range between 0 and 1, making it useful for models that predict probability.

python
# Sigmoid activation function
sigmoid_output = tf.nn.sigmoid(x).numpy()

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, sigmoid_output)
plt.grid(True)
plt.title('Sigmoid Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()

Mathematical representation: f(x) = 1 / (1 + e^(-x))

Characteristics:

  • Output range is bounded between 0 and 1
  • Smooth gradient
  • Suffers from vanishing gradient problem for extreme inputs
  • Not zero-centered

3. Tanh (Hyperbolic Tangent)

Tanh is similar to the sigmoid function but maps inputs to a range from -1 to 1.

python
# Tanh activation function
tanh_output = tf.nn.tanh(x).numpy()

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, tanh_output)
plt.grid(True)
plt.title('Tanh Activation Function')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()

Mathematical representation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Characteristics:

  • Output range is bounded between -1 and 1
  • Zero-centered, which helps in convergence
  • Still suffers from vanishing gradient problem for extreme values

4. Leaky ReLU

Leaky ReLU is a variant of ReLU that allows a small gradient when the unit is not active.

python
# Leaky ReLU activation function
alpha = 0.1
leaky_relu_output = tf.nn.leaky_relu(x, alpha=alpha).numpy()

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(x, leaky_relu_output)
plt.grid(True)
plt.title(f'Leaky ReLU Activation Function (alpha={alpha})')
plt.xlabel('Input')
plt.ylabel('Output')
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.show()

Mathematical representation: f(x) = max(αx, x) where α is a small constant

Characteristics:

  • Prevents the "dying ReLU" problem
  • Allows for a small gradient when the unit is not active
  • Not zero-centered

5. Softmax

Softmax is often used in the output layer of a classifier to normalize the output to a probability distribution.

python
# Sample logits (raw output from last neural network layer)
logits = tf.constant([[2.0, 1.0, 0.1],
[0.1, 1.0, 3.0]])

# Apply softmax
softmax_output = tf.nn.softmax(logits).numpy()

print("Logits (raw output):")
print(logits.numpy())
print("\nSoftmax probabilities:")
print(softmax_output)
print("\nSum of probabilities for each example:")
print(np.sum(softmax_output, axis=1)) # Should be close to 1 for each example

Output:

Logits (raw output):
[[2. 1. 0.1]
[0.1 1. 3. ]]

Softmax probabilities:
[[0.65900114 0.24243298 0.09856589]
[0.04383154 0.107864 0.84830445]]

Sum of probabilities for each example:
[1. 1.]

Mathematical representation: f(x_i) = e^(x_i) / Σ(e^(x_j)) for all j in the same class

Characteristics:

  • Converts logits to probabilities (values between 0 and 1 that sum to 1)
  • Useful for multi-class classification problems
  • Highlights the largest values and suppresses significantly smaller ones

Implementing Activation Functions in TensorFlow Models

There are two main ways to add activation functions to your TensorFlow models:

1. Directly in Layer Definition

python
import tensorflow as tf
from tensorflow import keras

# Creating a simple model with activation functions
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(784,)), # ReLU activation
keras.layers.Dense(64, activation='relu'), # ReLU activation
keras.layers.Dense(10, activation='softmax') # Softmax activation
])

model.summary()

2. As a Separate Layer

python
model = keras.Sequential([
keras.layers.Dense(128, input_shape=(784,)),
keras.layers.Activation('relu'), # Explicit activation layer
keras.layers.Dense(64),
keras.layers.Activation('relu'), # Explicit activation layer
keras.layers.Dense(10),
keras.layers.Activation('softmax') # Explicit activation layer
])

model.summary()

3. Using the Functional API

python
input_layer = keras.layers.Input(shape=(784,))
hidden_1 = keras.layers.Dense(128)(input_layer)
activation_1 = keras.layers.Activation('relu')(hidden_1)
hidden_2 = keras.layers.Dense(64)(activation_1)
activation_2 = keras.layers.Activation('relu')(hidden_2)
output_layer = keras.layers.Dense(10)(activation_2)
activation_output = keras.layers.Activation('softmax')(output_layer)

model = keras.Model(inputs=input_layer, outputs=activation_output)
model.summary()

Advanced Activation Functions

TensorFlow also provides advanced activation functions through the tf.keras.layers module:

python
from tensorflow.keras.layers import LeakyReLU, PReLU, ELU, ThresholdedReLU

model = keras.Sequential([
keras.layers.Dense(128, input_shape=(784,)),
LeakyReLU(alpha=0.1), # Leaky ReLU with alpha=0.1
keras.layers.Dense(64),
PReLU(), # Parametric ReLU
keras.layers.Dense(32),
ELU(alpha=1.0), # Exponential Linear Unit
keras.layers.Dense(16),
ThresholdedReLU(theta=1.0), # Thresholded ReLU
keras.layers.Dense(10, activation='softmax')
])

model.summary()

Practical Example: MNIST Classification

Let's build a simple neural network to classify MNIST digits and observe how different activation functions affect performance:

python
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from time import time

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Preprocess the data
x_train = x_train.reshape(-1, 28*28).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28*28).astype('float32') / 255.0

# Convert labels to categorical
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

# Function to create and train model with different activation functions
def create_and_train_model(activation, name):
model = keras.Sequential([
keras.layers.Dense(128, activation=activation, input_shape=(784,)),
keras.layers.Dense(64, activation=activation),
keras.layers.Dense(10, activation='softmax')
])

model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)

start_time = time()
history = model.fit(
x_train, y_train,
batch_size=128,
epochs=5,
verbose=0,
validation_split=0.1
)
training_time = time() - start_time

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)

return history, test_acc, training_time

# List of activation functions to test
activations = ['relu', 'tanh', 'sigmoid']
results = {}

# Train models with different activation functions
for activation in activations:
print(f"Training with {activation}...")
history, test_acc, training_time = create_and_train_model(activation, activation)
results[activation] = {
'history': history,
'test_acc': test_acc,
'time': training_time
}
print(f"{activation} - Test accuracy: {test_acc:.4f}, Training time: {training_time:.2f}s")

# Plotting training histories
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
for activation in activations:
plt.plot(results[activation]['history'].history['accuracy'], label=f'{activation}')
plt.title('Training Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
for activation in activations:
plt.plot(results[activation]['history'].history['val_accuracy'], label=f'{activation}')
plt.title('Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

This example demonstrates how different activation functions affect the training process and final model performance. You might observe that ReLU typically trains faster and achieves better performance for this task.

Choosing the Right Activation Function

When deciding which activation function to use:

  1. Hidden layers:

    • Start with ReLU (default choice for most cases)
    • If you encounter "dying ReLU" problems, try Leaky ReLU or ELU
    • For recurrent neural networks, consider tanh
  2. Output layer:

    • For binary classification: sigmoid
    • For multi-class classification: softmax
    • For regression: linear (no activation)

Summary

Activation functions are crucial components in neural networks that introduce non-linearity, allowing models to learn complex patterns:

  • ReLU is widely used and computationally efficient, but may suffer from dying neurons
  • Sigmoid and Tanh are bounded functions useful for specific scenarios but can suffer from vanishing gradients
  • Leaky ReLU, PReLU, and ELU are improved variants that address some of ReLU's limitations
  • Softmax is specifically designed for multi-class classification outputs

When building neural networks, the choice of activation function can significantly impact model performance, training speed, and convergence. Experimentation is often necessary to find the best activation function for a specific task.

Additional Resources and Exercises

Further Reading

Exercises

  1. Visualization Exercise: Create a function that visualizes all the activation functions we covered, plotting them side by side for comparison.

  2. Performance Comparison: Implement a neural network for a dataset of your choice (e.g., CIFAR-10) and compare the performance of different activation functions.

  3. Custom Activation Function: Create a custom activation function in TensorFlow and use it in a neural network. Compare its performance with standard activation functions.

  4. Dying ReLU Investigation: Create a deep network with ReLU activations and visualize the activations in each layer. Look for evidence of the "dying ReLU" problem and try to mitigate it using Leaky ReLU.

  5. Gradient Flow Analysis: Implement a simple network with different activation functions and visualize how gradients flow through the network during backpropagation.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)