How to optimize GPU memory for deep learning training

· Category: AI & Machine Learning

Short answer

Optimizing GPU memory allows training larger models or using bigger batch sizes, directly impacting model capacity and convergence stability.

Steps

  1. Enable automatic mixed precision to use float16 tensors where numerically safe.
  2. Apply gradient checkpointing to trade compute for memory by recomputing activations during backpropagation.
  3. Reduce the batch size or use gradient accumulation to simulate larger batches.
  4. Clear unused variables and use inplace operations when possible.
  5. Consider model parallelism or pipeline parallelism for architectures that exceed single-GPU memory.

Tips

  • Use memory profilers to identify which layers consume the most VRAM.
  • Employ gradient accumulation with scaled loss to maintain numerical stability.
  • Choose efficient implementations such as fused optimizers and cudnn benchmarking.
  • Offload optimizer states to CPU when using very large models.

Common issues

  • Out-of-memory errors when increasing image resolution or sequence length.
  • Numerical overflow in float16 without proper loss scaling.
  • Slower training when gradient checkpointing adds significant recomputation overhead.
  • Data loading on the GPU competing with model memory.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.