A transformer is a neural network architecture built around self-attention, designed to model relationships between all elements of a sequence in parallel (not step-by-step like RNNs).

- **Core idea**: Self-attention lets each token weigh every other token to build rich context.
- **Architecture**: Stacks of layers with multi-head attention, position-wise feed-forward networks, residual connections, and layer normalization.
- **Positional info**: Uses positional encodings to inject order into otherwise order-agnostic attention.
- **Variants**: Encoder–decoder (e.g., T5), encoder-only (e.g., BERT), decoder-only (e.g., GPT).
- **Why it’s powerful**: Parallelizable training, captures long-range dependencies, scales effectively with data/compute.
- **Applications**: NLP (translation, summarization, QA), vision (ViT), speech, protein modeling, recommendation, RL.
- **Notable models**: BERT, GPT series, T5, ViT, Llama, PaLM.