How to handle imbalanced datasets in ML

Question

QA Hub Editorial · Accepted Answer

Short answer

Imbalanced datasets bias models toward majority classes. Address this by resampling data, adjusting class weights, or using algorithms designed for imbalance.

Steps

Measure class distribution and identify the degree of imbalance using the imbalance ratio.
Apply SMOTE or ADASYN to synthetically oversample the minority class.
Use random undersampling or Tomek links to reduce the majority class size.
Set class weights inversely proportional to class frequencies in the loss function.
Choose evaluation metrics like F1, precision-recall AUC, or balanced accuracy.

Tips

Combine oversampling and undersampling for better results than either alone.
Use ensemble methods like BalancedRandomForest that handle imbalance natively.
Avoid oversampling before cross-validation to prevent data leakage.
Collect more minority class data whenever feasible.

Common issues

Oversampling can cause overfitting to synthetic examples.
Undersampling discards potentially useful majority class information.
Standard accuracy becomes misleading and overly optimistic.
Algorithms that minimize overall error implicitly favor the majority class.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.

Short answer

Steps

Tips

Common issues

Example

Related Questions

How to handle imbalanced datasets in classification

How to build a neural network from scratch

What is the bias-variance tradeoff in machine learning

What is the difference between supervised and unsupervised learning

How to use scikit-learn for ML pipelines

How to perform text classification with machine learning