How to optimize GPU memory for deep learning training
· Category: AI & Machine Learning
Short answer
Optimizing GPU memory allows training larger models or using bigger batch sizes, directly impacting model capacity and convergence stability.
Steps
- Enable automatic mixed precision to use float16 tensors where numerically safe.
- Apply gradient checkpointing to trade compute for memory by recomputing activations during backpropagation.
- Reduce the batch size or use gradient accumulation to simulate larger batches.
- Clear unused variables and use inplace operations when possible.
- Consider model parallelism or pipeline parallelism for architectures that exceed single-GPU memory.
Tips
- Use memory profilers to identify which layers consume the most VRAM.
- Employ gradient accumulation with scaled loss to maintain numerical stability.
- Choose efficient implementations such as fused optimizers and cudnn benchmarking.
- Offload optimizer states to CPU when using very large models.
Common issues
- Out-of-memory errors when increasing image resolution or sequence length.
- Numerical overflow in float16 without proper loss scaling.
- Slower training when gradient checkpointing adds significant recomputation overhead.
- Data loading on the GPU competing with model memory.
Example
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.