How to choose an activation function
· Category: AI & Machine Learning
Short answer
The activation function introduces non-linearity into a neural network, enabling it to learn complex mappings from inputs to outputs.
Steps
- Use ReLU or Leaky ReLU as the default for hidden layers in most architectures.
- Choose sigmoid or tanh only when outputs must be bounded, such as in gating mechanisms.
- Use softmax in the final layer for multi-class classification.
- Use linear activation for regression output layers.
- Experiment with Swish or GELU for very deep networks if training stability is an issue.
Tips
- ReLU is computationally cheap and mitigates vanishing gradients compared to sigmoid.
- Leaky ReLU and ELU avoid the dying ReLU problem by allowing small negative gradients.
- Match the output activation to the loss function for numerical stability.
- Plot activations during training to detect saturation or dead neurons.
Common issues
- Sigmoid and tanh saturate for large inputs, causing vanishing gradients.
- ReLU neurons can die permanently if weights push them into permanently negative regions.
- Using softmax on binary classification instead of sigmoid wastes parameters.
- Mismatched activation and loss functions lead to poor convergence.
Example
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.