# 🧠 NeurIPS 2025 — Impact of Layer Norm on Memorization and Generalization in Transformers

This repository contains the official implementation for the paper: **"Impact of Layer Norm on Memorization and Generalization in Transformers"**, accepted at **NeurIPS 2025**.

---

## 🧩 Overview

This work studies how **Layer Normalization (LN)** governs the balance between **memorization** and **generalization** in transformer architectures.  

We show that removing LN parameters **mitigates memorization in Post-LN models (e.g., BERT)**, but **impairs learning and generalization in Pre-LN models (e.g., GPT-Neo)**.  We also show how early LNs play the most significant role alongwith the gradient norm analysis.

These findings reveal intrinsic differences in how layer normalization influences the memorization and learning dynamics of modern large-scale transformers.

---

## ⚙️ How to Run Experiments

All experiments are launched through the dataset-specific script:
`<dataset_name>_dataset_lm_random_noisy.py`

### ▶️ 1. Run experiments **without LN parameter removal**
```bash
python3.8 <dataset_name>_dataset_lm_random_noisy.py   --model_name <model_name>   --epochs <num_epochs>   --device cuda:0   --learning_rate <learning_rate>   --batch_size <batch_size>   --remove none
```

### ▶️ 2. Run experiments **with LN parameter removal (all layers)**
```bash
python3.8 <dataset_name>_dataset_lm_random_noisy.py   --model_name <model_name>   --epochs <num_epochs>   --device cuda:0   --learning_rate <learning_rate>   --batch_size <batch_size>   --remove layer_norm
```

### ▶️ 3. Run experiments **for early / middle / late layer LN removal**
```bash
python3.8 <dataset_name>_dataset_lm_random_noisy.py   --model_name <model_name>   --epochs <num_epochs>   --device cuda:0   --learning_rate <learning_rate>   --batch_size <batch_size>   --remove layer_norm   --multiple_layer_analysis
```

### ▶️ 4. Compute **gradient norms** after training
```bash
python3.8 <dataset_name>_dataset_lm_random_noisy.py   --model_name <model_name>   --model_path <saved_model_path>
```

---

## 🧠 Dataset–Model Configurations

This study evaluates the impact of LayerNorm parameter removal across a diverse set of **textual, and visual datasets**, using both **Post-LN** and **Pre-LN** transformer architectures.

---

### 🟩 Post-LN Models

| **Dataset** | **Model** | **Description** |
|:-------------|:-----------|:----------------|
| **Emotions** | **BERT** (Devlin et al., 2019) | 12-layer bidirectional transformer for masked language modeling and next sentence prediction. |
| **Emotions** | **DeBERTa** (He et al., 2020) | 12-layer model with disentangled position/content embeddings and decoding-enhanced attention. |
| **Emotions** | **DistilBERT** (Sanh et al., 2019) | 6-layer distilled BERT that is smaller and faster while retaining strong performance. |
| **News** | **ELECTRA** (Clark et al., 2020) | 12-layer model using replaced token detection for sample-efficient pre-training. |
| **News** | **Longformer** (Beltagy et al., 2020) | 12-layer model with sparse attention for efficient long-sequence processing. |
| **Tweets** | **RoBERTa** (Liu et al., 2019) | 12-layer robustly optimized BERT variant trained with dynamic masking and more data. |

---

### 🟦 Pre-LN Models

| **Dataset** | **Model** | **Description** |
|:-------------|:-----------|:----------------|
| **Emotions** | **GPT-Neo 125M** (Black et al., 2022) | 12-layer open-source causal language model trained on The Pile dataset. |
| **News** | **Qwen2 0.5B** (Yang et al., 2024) | 24-layer efficient LLM optimized for generative tasks, using RMSNorm (Zhang & Sennrich, 2019). |
| **Tweets** | **GPT-2 Medium** (Radford et al., 2019) | 24-layer unidirectional transformer trained for causal language modeling. |
| **Tweets** | **RoBERTa-PreLayerNorm** (Ott et al., 2019) | 24-layer variant of RoBERTa with Pre-LN configuration for improved training stability. |
| **CIFAR-10** | **ViT-B** (Dosovitskiy et al., 2020) | 12-layer Vision Transformer (Base) for image classification. |
| **UTK-Face** | **DeiT** (Touvron et al., 2021) | 12-layer Data-efficient Image Transformer trained with distillation and no external data. |
| **NICO++** | **ViT-S** (Assran et al., 2022) | 12-layer smaller ViT variant trained with Masked Siamese Networks (MSN). |

---

### 📊 Summary

- **Post-LN architectures (6 models):** BERT, DeBERTa, DistilBERT, ELECTRA, Longformer, RoBERTa  
- **Pre-LN architectures (7 models):** GPT-Neo, GPT-2, Qwen2, RoBERTa-PreLayerNorm, ViT-B, ViT-S, DeiT  
- **Datasets covered:** Emotions, News, Tweets, CIFAR-10, UTK-Face, NICO++  

These configurations allow a **comprehensive cross-domain analysis** of how removing LayerNorm parameters impacts **memorization** and **generalization** in both **encoder** and **decoder** style transformer models.

---



## 📈 Training Details

- **Learning Rate:** `2e-5`  
- **Batch Size:** `16`  
- **Epochs:**  
  - Post-LN models → 40 epochs  
  - Pre-LN models → 70 epochs  
- **Optimizer:** Adam  
- **Data Augmentation:** None  

We train Pre-LN models longer (70 epochs) since LN removal reduces their initial learning capability. This helps assess whether the model can recover accuracy over time.

---

## 📘 Glossary

| Term | Description |
|------|--------------|
| `<dataset_name>` | One of: `emotions`, `news`, `tweets`, `cfar10`, `utk_face`, `nico` |
| `<model_name>` | Model backbone name (e.g., `bert-base-uncased`, `google/vit-base-patch16-224-in21k`, `google/electra-base-discriminator`, etc.) |
| `<num_epochs>` | Training epochs (40 for Post-LN, 70 for Pre-LN) |
| `<learning_rate>` | Learning rate (default `2e-5`) |
| `<batch_size>` | Batch size (default `16`) |
| `<saved_model_path>` | Path to checkpoint for analysis |
| `--remove none` | Standard training with LN parameters intact |
| `--remove layer_norm` | Remove LN parameters across all layers |
| `--multiple_layer_analysis` | Evaluate early / middle / late layer LN removal |

---

## 🧭 Citation

If you use this repository, please cite our paper:

```bibtex
@inproceedings{singhal2025NeurIPS,
  title={Impact of Layer Norm on Memorization and Generalization in Transformers},
  author={Rishi Singhal and Jung-Eun Kim},
  booktitle={NeurIPS},
  year={2025}
}
```

---

## 🔗 Resources

- 📄 **Paper:** [NeurIPS 2025 (accepted)](https://neurips.cc/virtual/2025/poster/119800)  
- 💻 **Code:** [https://github.com/JEKimLab/NeurIPS2025_LayernormMemorization](https://github.com/JEKimLab/NeurIPS2025_LayernormMemorization)

---
