## What is a Transformer?

A **Transformer** is a type of neural network architecture that was introduced in the paper *“Attention Is All You Need”* (Vaswani et al., 2017).  It is now the backbone of almost every state‑of‑the‑art language model (BERT, GPT‑3, T5, etc.) and is increasingly used in vision, audio, and multimodal tasks.

At a high level, a Transformer replaces recurrent layers (RNNs, LSTMs) and convolutional layers with **self‑attention** mechanisms plus simple feed‑forward networks.  This change gives the model:

- **Full parallelism**: All tokens in a sequence are processed at once, not sequentially.
- **Long‑range dependencies**: Each token can attend to every other token directly, no matter how far apart they are.
- **Scalable training**: The architecture can be parallelized over thousands of GPUs or TPUs.

Let’s unpack the core ideas.

---

## 1. Core Building Blocks

| Component | What it does | Why it matters |
|-----------|--------------|----------------|
| **Multi‑head Self‑Attention** | For each token, it computes a weighted sum over all tokens in the sequence, using multiple sets of learnable projections (heads). | Captures *multiple* relational patterns (syntax, semantics, etc.) in parallel. |
| **Positional Encoding** | Injects an ordering signal because self‑attention alone is permutation‑invariant. | Allows the model to distinguish “the first word” from “the last word” even though all tokens are processed simultaneously. |
| **Feed‑Forward Network (FFN)** | A simple two‑layer MLP applied position‑wise (same weights for every token). | Adds nonlinearity and capacity beyond pure attention. |
| **Residual Connections + Layer Normalization** | `y = LayerNorm(x + Sublayer(x))` | Facilitates gradient flow and stabilizes training. |

A single *Transformer block* (or *layer*) therefore consists of:

```
Input x
↓                         (shared across all tokens)
Multi-head Self-Attention
↓
Add & LayerNorm
↓
Feed‑Forward Network
↓
Add & LayerNorm
↓
Output
```

When stacking many of these blocks (e.g., 12, 24, 96 layers), the network learns hierarchical representations of language or other data modalities.

---

## 2. Two General Helmets

1. **Encoder‑only** : Like BERT.  
   *Used mainly for “understanding” tasks*: masked language modeling, classification, question answering, etc.

2. **Decoder‑only** : Like GPT.  
   *Used mainly for “generation” tasks*: next‑token prediction, translation, summarization, etc.

3. **Encoder–Decoder** : Like T5 or the original Transformer.  
   *Used for sequence‑to‑sequence tasks*: machine translation, summarization, code generation, etc.

The encoder and decoder are almost identical, except the decoder adds a *causal* (autoregressive) mask so that position *i* can only attend to positions *< i*.

---

## 3. Why Transformers Outperform RNNs/CNNs

| Issue | RNN | CNN | Transformer |
|-------|-----|-----|-------------|
| **Sequential processing** | Inference & training are inherently sequential → slower | Sliding windows, limited context | Fully parallel across sequence length |
| **Long‑range dependence** | Vanishing gradients, requires many steps | Strided convolutions, large receptive field | O(1) attention per token |
| **Parameter efficiency** | Large hidden states needed | Requires many layers/filters | Reuses attention heads + FFN across time |

Moreover, self‑attention scales **quadratically** with sequence length in terms of memory and time, but the constant factors are tiny compared to RNNs, and several *efficient transformer* variants (Linformer, Performer, Reformer, etc.) cut this cost dramatically.

---

## 4. Training a Transformer

1. **Pretraining (unsupervised)**  
   - **Masked Language Modeling (MLM)** – Randomly mask tokens and train the model to predict them (BERT).  
   - **Causal Language Modeling** – Predict next token (GPT).  
   - **Sequence‑to‑sequence objectives** – e.g., T5’s “text‑to‑text” pretraining.  

2. **Fine‑tuning (supervised)**  
   Add a task‑specific head (classification layer, sequence‑labeling head) and train on labeled data. Transfer learning works remarkably well.

---

## 5. Practical Example: BERT

- 12 encoder layers, 12 attention heads, ~110M parameters.  
- Masked span pretraining on billions of tokens.  
- Fine‑tuned on GLUE, SQuAD, etc., with linear classification heads.  

Outcome: state‑of‑the‑art scores on many NLP benchmarks.

---

## 6. Common Transformer Variants

| Model | Purpose | Key Differences |
|-------|---------|-----------------|
| **GPT‑n** | Generation | Decoder‑only, causal mask. |
| **BERT** | Understanding | Encoder‑only, MLM objective. |
| **RoBERTa** | Improved BERT | Larger batch, longer training, no next‑sentence objective. |
| **T5** | Unified text tasks | Encoder–decoder, “text‑to‑text” formulation. |
| **Vision Transformers (ViT)** | Vision tasks | Splits images into patches, uses encoder only. |
| **Efficient Transformers** (Linformer, Performer, Reformer, etc.) | Lower compute | Approximate self‑attention, sparse attention. |

---

## 7. What a Transformer *Actually* Looks Like

```text
Sequence:  ["The", "cat", "sat", "on", "the", "mat", "<SEP>"]
Positional encodings added to each token embedding.
Attention Matrix (12 heads) forms a space of influence:
  head 1: [0.1, 0.4, 0.2, 0.2, ...]  (focus on nearby tokens)
  head 2: [0.0, 0.1, 0.8, 0.05, ...] (focus on long‑range “the” ~ “mat”)
Feed‑forward networks applied to each position.
Stacked through many layers → final contextual embeddings.
```

---

## 8. Take‑aways

- **Self‑Attention** is the hallmark of Transformers: each token “sees” every other token in one step.  
- **Parallelism** allows massive sequence lengths and batch sizes, making them trainable on powerful GPUs/TPUs.  
- **Pretrain‑and‑fine‑tune** paradigm gives incredible performance across language, vision, speech, and multimodal tasks.  
- Many *efficient* variants keep the core idea while reducing memory/time, opening the door to edge‑device deployment.

So, when you hear “transformer” in a machine‑learning context, think of a self‑attentive neural net that can process an entire sequence all at once and learns rich, context‑aware representations from huge unlabeled corpora.