How to use TF-IDF for feature extraction

· Category: AI & Machine Learning

Short answer

TF-IDF measures word importance by combining term frequency in a document with inverse document frequency across the corpus.

Steps

  1. Compute term frequency for each word in each document.
  2. Calculate inverse document frequency as the logarithm of the total documents divided by documents containing the term.
  3. Multiply TF and IDF to obtain a weighted score for each term-document pair.
  4. Normalize vectors to unit length to ensure comparability across documents.
  5. Select the top-k features or apply dimensionality reduction for efficiency.

Tips

  • Use sublinear TF scaling to dampen the effect of very frequent words.
  • Filter extremely rare terms that appear in fewer than a minimum number of documents.
  • Combine TF-IDF with n-grams to capture short phrases.
  • Store the vectorizer vocabulary to guarantee consistent transformation at inference time.

Common issues

  • Sparse high-dimensional vectors causing memory and compute bottlenecks.
  • Failure to account for out-of-vocabulary words during inference.
  • Common words receiving inflated scores without proper IDF normalization.
  • Treating TF-IDF as semantic representation rather than a statistical weighting scheme.

Example

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.