How to use TF-IDF for feature extraction
· Category: AI & Machine Learning
Short answer
TF-IDF measures word importance by combining term frequency in a document with inverse document frequency across the corpus.
Steps
- Compute term frequency for each word in each document.
- Calculate inverse document frequency as the logarithm of the total documents divided by documents containing the term.
- Multiply TF and IDF to obtain a weighted score for each term-document pair.
- Normalize vectors to unit length to ensure comparability across documents.
- Select the top-k features or apply dimensionality reduction for efficiency.
Tips
- Use sublinear TF scaling to dampen the effect of very frequent words.
- Filter extremely rare terms that appear in fewer than a minimum number of documents.
- Combine TF-IDF with n-grams to capture short phrases.
- Store the vectorizer vocabulary to guarantee consistent transformation at inference time.
Common issues
- Sparse high-dimensional vectors causing memory and compute bottlenecks.
- Failure to account for out-of-vocabulary words during inference.
- Common words receiving inflated scores without proper IDF normalization.
- Treating TF-IDF as semantic representation rather than a statistical weighting scheme.
Example
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
This example generates a detailed classification report, illustrating how to evaluate model performance across multiple metrics in practice.