What is tokenization in NLP

· Category: AI & Machine Learning

Short answer

Tokenization is the process of splitting text into smaller units called tokens that a model can process numerically.

How it works

Word tokenization splits text on whitespace and punctuation, producing a vocabulary of complete words. Subword methods like Byte Pair Encoding and WordPiece iteratively merge frequent character sequences to balance vocabulary size and coverage. Character tokenization uses individual letters, avoiding out-of-vocabulary issues but creating longer sequences. SentencePiece treats text as a raw stream and learns boundaries unsupervised, making it language-agnostic.

Example

The sentence "unhappiness" might be tokenized as ["un", "##happiness"] by a WordPiece tokenizer, allowing the model to understand it as a composition of a prefix and a root word even if the full word is rare in training data.

Why it matters

Tokenization directly affects model vocabulary size, sequence length, and ability to handle rare or misspelled words. Modern large language models rely heavily on subword tokenization to achieve broad multilingual coverage without excessive memory consumption.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.