How to debug a neural network that wont converge
· Category: AI & Machine Learning
Short answer
A neural network that fails to converge usually suffers from incorrect architecture, poor initialization, unsuitable hyperparameters, or data problems.
Steps
- Verify data preprocessing by checking ranges, missing values, and label correctness.
- Start with a tiny dataset and overfit it to confirm the model can learn at all.
- Check gradients for vanishing or exploding values using histograms or gradient clipping.
- Experiment with learning rates across several orders of magnitude.
- Simplify the architecture to a known working baseline and add complexity incrementally.
Tips
- Use Xavier or He initialization depending on the activation function.
- Ensure the loss function matches the task and output activation.
- Visualize intermediate activations to detect saturation or dead layers.
- Enable detailed logging to compare training and validation curves.
Common issues
- Learning rate too high causes loss to oscillate or diverge.
- Learning rate too low causes extremely slow progress or local minima trapping.
- Batch normalization placed incorrectly disrupts gradient flow.
- Data loading bugs that shuffle labels or return empty batches.
Example
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.