This is the implemention for submission "Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training"

## LANTON Pretraining on Openwebtext (GPT-style)

Train a GPT-style causal LM on **Openwebtext-100k* using the **LANTON** optimizer. Uses Hugging Face streaming data, PyTorch DDP, and BF16 autocast.

## Contents
- `train_gpt.py` – training script (gpt tokenizer, checkpoints, W&B logging)
- `run.sh` – example launcher (4 GPUs, cosine LR, warmup, LANTON)
- `lanton.py` – the optimizer implementation, including the parameter update and the learning rate rescale by the noise ratio.

## Quick Start

### 1) Environment
```bash
conda create -n lanton python=3.10 -y
conda activate lanton

# Pick wheels matching your CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

pip install transformers datasets accelerate wandb

```
### Launch the training (4 or 8 GPUs)
`run.sh` runs:
```bash
WANDB_NAME="gpt_medium_openwebtext"
lrs=(5e-3)
for lr in "${lrs[@]}"; do
    echo "▶ running lr=$lr" 
    torchrun --standalone --nproc_per_node=4 train_gpt.py \
    --model gpt2-medium \
    --optimizer LANTON \
    --dataset openwebtext-100k \
    --batch_size 16 \
    --num_epochs 1 \
    --hidden_size 768 \
    --max_lr $lr \
    --warmup_steps=300 \
    --wandb_log \
    --wandb_project "${WANDB_NAME}" \
    --val_interval 50 \
    --val_max_batches 100 \
    --save_interval 500 \
    --noise_momentum=0.9 \
    --scale1 300\
    --scale2 1.0
done

```



### Reproducibility

Seeds are set (PyTorch/CUDA/cudnn deterministic). Hyper-params logged to W&B. Periodic checkpoints and logs allow resume and replottin

### Licenses / Data

Models: GPT-style architecture implemented locally (no OpenAI weights included).

Tokenizer: GPT (MIT license).

Dataset: Openwebtext via Hugging Face; follow the dataset card’s license/terms.