What is tokenization in NLP

Question

QA Hub Editorial · Accepted Answer

Short answer

Tokenization is the process of splitting text into smaller units called tokens that a model can process numerically.

How it works

Word tokenization splits text on whitespace and punctuation, producing a vocabulary of complete words. Subword methods like Byte Pair Encoding and WordPiece iteratively merge frequent character sequences to balance vocabulary size and coverage. Character tokenization uses individual letters, avoiding out-of-vocabulary issues but creating longer sequences. SentencePiece treats text as a raw stream and learns boundaries unsupervised, making it language-agnostic.

Example

The sentence "unhappiness" might be tokenized as ["un", "##happiness"] by a WordPiece tokenizer, allowing the model to understand it as a composition of a prefix and a root word even if the full word is rare in training data.

Why it matters

Tokenization directly affects model vocabulary size, sequence length, and ability to handle rare or misspelled words. Modern large language models rely heavily on subword tokenization to achieve broad multilingual coverage without excessive memory consumption.

Example

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 10)
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

This snippet defines a simple neural network with dropout for regularization, a cross-entropy loss, and the Adam optimizer in PyTorch.

Short answer

How it works

Example

Why it matters

Example

Related Questions

How to preprocess text for NLP tasks

How to evaluate chatbot responses

How to use retrieval augmented generation RAG

How to build a simple chatbot with AI

How to deploy a Hugging Face model

How to use Hugging Face Transformers