This is the implemention for submission "Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training"

## LANTON Pretraining on C4 (TinyLlama/LLaMA-style)

Train a LLaMA-style causal LM on **C4 (en)** using the **LANTON** optimizer. Uses Hugging Face streaming data, PyTorch DDP, and BF16 autocast.

## Contents
- `train_c4.py` – training script (DDP, streaming C4, T5 tokenizer, checkpoints, W&B logging)
- `run.sh` – example launcher (4 GPUs, cosine LR, warmup, LANTON)
- `modelling_llama_new` – laod llama models with specificed model size
- `lanton.py` – the optimizer implementation, including the parameter update and the learning rate rescale by the noise ratio.

## Quick Start

### 1) Environment
```bash
conda create -n lanton python=3.10 -y
conda activate lanton

# Pick wheels matching your CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

pip install transformers datasets accelerate wandb

```
### Launch the training (4 or 8 GPUs)
`run.sh` runs:
```bash
lr=5e-3 
WANDB_NAME="llama_1B"
SEED=42

echo "=== Running lr=${lr} ==="
python -m torch.distributed.run --standalone --nproc_per_node=4 train_c4.py \
    --batch_size=64 \
    --grad_micro_steps=16 \
    --total_bs=1024 \
    --max_lr=$lr \
    --weight_decay=0.1 \
    --warmup_iters=1000 \
    --max_iters=10000 \
    --eval_interval=200 \
    --model_name=TinyLlama \
    --n_layer=12 \
    --save_interval=100 \
    --seed="${SEED}" \
    --wandb_run_name="llama_C4" \
    --wandb_project="${WANDB_NAME}" \
    --wandb_log \
    --optimizer=LANTON \
    --noise_momentum=0.9 \
    --scale1=300 \
    --scale2=1.0
echo "=== Done lr=${lr} ==="
```
### Resuming the training (4 or 8 GPUs)
Auto resume if a matching ckpt exists, please add `--resume_from_checkpoint auto` in the `run.sh`

```bash
python -m torch.distributed.run --standalone --nproc_per_node=4 train_c4.py \
  ... --resume_from_checkpoint auto
```

### Reproducibility

Seeds are set (PyTorch/CUDA/cudnn deterministic). Hyper-params logged to W&B. Periodic checkpoints and logs allow resume and replottin

### Licenses / Data

Models: LLaMA-style architecture implemented locally (no Meta weights included).

Tokenizer: t5-small (Apache-2.0).

Dataset: allenai/c4 (English) via Hugging Face; follow the dataset card’s license/terms.