How to handle imbalanced datasets in ML
· Category: AI & Machine Learning
Short answer
Imbalanced datasets bias models toward majority classes. Address this by resampling data, adjusting class weights, or using algorithms designed for imbalance.
Steps
- Measure class distribution and identify the degree of imbalance using the imbalance ratio.
- Apply SMOTE or ADASYN to synthetically oversample the minority class.
- Use random undersampling or Tomek links to reduce the majority class size.
- Set class weights inversely proportional to class frequencies in the loss function.
- Choose evaluation metrics like F1, precision-recall AUC, or balanced accuracy.
Tips
- Combine oversampling and undersampling for better results than either alone.
- Use ensemble methods like BalancedRandomForest that handle imbalance natively.
- Avoid oversampling before cross-validation to prevent data leakage.
- Collect more minority class data whenever feasible.
Common issues
- Oversampling can cause overfitting to synthetic examples.
- Undersampling discards potentially useful majority class information.
- Standard accuracy becomes misleading and overly optimistic.
- Algorithms that minimize overall error implicitly favor the majority class.
Example
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.