How to handle imbalanced datasets in ML

· Category: AI & Machine Learning

Short answer

Imbalanced datasets bias models toward majority classes. Address this by resampling data, adjusting class weights, or using algorithms designed for imbalance.

Steps

  1. Measure class distribution and identify the degree of imbalance using the imbalance ratio.
  2. Apply SMOTE or ADASYN to synthetically oversample the minority class.
  3. Use random undersampling or Tomek links to reduce the majority class size.
  4. Set class weights inversely proportional to class frequencies in the loss function.
  5. Choose evaluation metrics like F1, precision-recall AUC, or balanced accuracy.

Tips

  • Combine oversampling and undersampling for better results than either alone.
  • Use ensemble methods like BalancedRandomForest that handle imbalance natively.
  • Avoid oversampling before cross-validation to prevent data leakage.
  • Collect more minority class data whenever feasible.

Common issues

  • Oversampling can cause overfitting to synthetic examples.
  • Undersampling discards potentially useful majority class information.
  • Standard accuracy becomes misleading and overly optimistic.
  • Algorithms that minimize overall error implicitly favor the majority class.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.