# AdaMuon

This is the official repository for the paper AdaMuon: Adaptive Muon Optimizer

## Introduction

AdaMuon is an effective optimizer based on Muon. It can achieve more than 40% training efficiency compared to AdamW.

## Quick Start

This repository contains two projects: one is the GPT-2 experiments, and the other is the open-sourced Megatron-LM code, which we included to facilitate large-scale experiments.

To use AdaMuon in your own training pipeline on other architectures and datasets, use the following pseudo code as an example:

```python
from opt_config import configure_optimizers

# Model
model = Model()

# Optimizer
optimizer = configure_optimizers(model.parameters(), weight_decay=0.1, learning_rate=6e-4)

# Training
for epoch in range(epochs):
    for X, Y in data_loader:
        # standard training code
        logits, loss = model(X, Y)
        loss.backward()
        # ...
```
