How to choose an activation function

Question

QA Hub Editorial · Accepted Answer

Short answer

The activation function introduces non-linearity into a neural network, enabling it to learn complex mappings from inputs to outputs.

Steps

Use ReLU or Leaky ReLU as the default for hidden layers in most architectures.
Choose sigmoid or tanh only when outputs must be bounded, such as in gating mechanisms.
Use softmax in the final layer for multi-class classification.
Use linear activation for regression output layers.
Experiment with Swish or GELU for very deep networks if training stability is an issue.

Tips

ReLU is computationally cheap and mitigates vanishing gradients compared to sigmoid.
Leaky ReLU and ELU avoid the dying ReLU problem by allowing small negative gradients.
Match the output activation to the loss function for numerical stability.
Plot activations during training to detect saturation or dead neurons.

Common issues

Sigmoid and tanh saturate for large inputs, causing vanishing gradients.
ReLU neurons can die permanently if weights push them into permanently negative regions.
Using softmax on binary classification instead of sigmoid wastes parameters.
Mismatched activation and loss functions lead to poor convergence.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.

Short answer

Steps

Tips

Common issues

Example

Related Questions

What is a neural network and how does it learn

How image recognition systems work

How to create a custom dataset in PyTorch

How to build models with PyTorch

How to get started with TensorFlow

How to optimize GPU memory for deep learning training