How to choose an activation function

· Category: AI & Machine Learning

Short answer

The activation function introduces non-linearity into a neural network, enabling it to learn complex mappings from inputs to outputs.

Steps

  1. Use ReLU or Leaky ReLU as the default for hidden layers in most architectures.
  2. Choose sigmoid or tanh only when outputs must be bounded, such as in gating mechanisms.
  3. Use softmax in the final layer for multi-class classification.
  4. Use linear activation for regression output layers.
  5. Experiment with Swish or GELU for very deep networks if training stability is an issue.

Tips

  • ReLU is computationally cheap and mitigates vanishing gradients compared to sigmoid.
  • Leaky ReLU and ELU avoid the dying ReLU problem by allowing small negative gradients.
  • Match the output activation to the loss function for numerical stability.
  • Plot activations during training to detect saturation or dead neurons.

Common issues

  • Sigmoid and tanh saturate for large inputs, causing vanishing gradients.
  • ReLU neurons can die permanently if weights push them into permanently negative regions.
  • Using softmax on binary classification instead of sigmoid wastes parameters.
  • Mismatched activation and loss functions lead to poor convergence.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.