What are transformers in deep learning
· Category: AI & Machine Learning
Short answer
Transformers are deep learning models that replace recurrence with self-attention, allowing them to process entire sequences in parallel and capture long-range dependencies efficiently.
How it works
Self-attention computes weighted relationships between all positions in a sequence simultaneously using query, key, and value vectors. Multi-head attention runs multiple attention operations in parallel to capture diverse relational patterns. Positional encodings inject sequence order information. Feed-forward layers and residual connections complete each transformer block. The encoder-decoder architecture maps input sequences to output sequences, while encoder-only models like BERT focus on representation learning.
Example
In machine translation, a transformer encoder processes the source sentence into context-aware embeddings. The decoder then generates the target sentence word by word, attending to both previously generated words and the encoder outputs to maintain semantic alignment.
Why it matters
Transformers enabled the era of large language models such as GPT and BERT, setting new benchmarks across NLP and increasingly in vision and multimodal domains. Their parallelizability makes them highly efficient on modern hardware, and pretraining plus fine-tuning has become the dominant paradigm for transfer learning.
Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model(**inputs)
This example loads a pretrained BERT model and tokenizer, then runs a forward pass on sample text using PyTorch tensors.