What are transformers in deep learning

· Category: AI & Machine Learning

Short answer

Transformers are deep learning models that replace recurrence with self-attention, allowing them to process entire sequences in parallel and capture long-range dependencies efficiently.

How it works

Self-attention computes weighted relationships between all positions in a sequence simultaneously using query, key, and value vectors. Multi-head attention runs multiple attention operations in parallel to capture diverse relational patterns. Positional encodings inject sequence order information. Feed-forward layers and residual connections complete each transformer block. The encoder-decoder architecture maps input sequences to output sequences, while encoder-only models like BERT focus on representation learning.

Example

In machine translation, a transformer encoder processes the source sentence into context-aware embeddings. The decoder then generates the target sentence word by word, attending to both previously generated words and the encoder outputs to maintain semantic alignment.

Why it matters

Transformers enabled the era of large language models such as GPT and BERT, setting new benchmarks across NLP and increasingly in vision and multimodal domains. Their parallelizability makes them highly efficient on modern hardware, and pretraining plus fine-tuning has become the dominant paradigm for transfer learning.

Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model(**inputs)

This example loads a pretrained BERT model and tokenizer, then runs a forward pass on sample text using PyTorch tensors.