# NeurIPS2025_codeAppendix

This repository contains code to reproduce the experiments presented in **FAL: First Attentions Last**, an efficient Transformer architecture designed to reduce communication overhead in Tensor Parallelism.

FAL bypasses per-layer MHA–MLP communication by reusing the first layer’s attention output across all MLPs, enabling parallel MHA-MLP execution and eliminating costly All-reduce operations. We also introduce FAL+, which improves model quality by augmenting attention outputs with the normalized first attention signal.

The repository includes PyTorch code, training scripts, and configuration files used in our experiments.

The project is organized into several main components:

## Project Structure

- **Benchmark/**  
  Scripts and policies for benchmarking model inference and training, including All-Reduce and various policy modules.

- **Fast_Train/**  
  Experiments and architectures for fast pretraining with different attention mechanisms (e.g., FAL, FAL+, GQA, MHA, MoE).  
  See [Fast_Train/README.md](Fast_Train/README.md) for details and instructions.

- **Motivation_Analysis/**  
  Analysis scripts for ablation, gradient, and similarity studies on language models.

- **Train_from_Scratch/**  
  Scripts and modules for training models from scratch, including FAL and FAL+ variants.

## Key Features

- **FAL and FAL+ Attention**:  
  Integration and comparison of First Attentions Last (FAL) and FAL+ mechanisms across multiple architectures.

- **Benchmarking**:  
  Tools for distributed training and inference benchmarking, including All-reduce strategies.

- **Analysis**:  
  Scripts for ablation studies, gradient analysis, and representational similarity (CKA) analysis.

- **Pre-commit Hooks**:  
  Code formatting and linting are enforced via pre-commit configuration files in each subproject.

## Getting Started

1. **Install dependencies**
   Make sure you have Python 3.8+ and install the required packages with the following versions:

   #### PyTorch and related libraries (CUDA 12.1)

   ```bash
   pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
   pip install torchmetrics==1.4.0.post0
   conda install datasets -y
   ```

   #### Colossal-AI and other required packages

   ```bash
   pip install wandb
   pip install triton==2.2.0
   pip install transformers==4.39.3
   pip install tqdm==4.66.4
   pip install colossalai==0.4.0
   pip install scipy==1.12.0
   ```

2. **Run Experiments**  
   - For fast pretraining experiments, see [Fast_Train/README.md](Fast_Train/README.md).
   - For training from scratch, use scripts in [Train_from_Scratch/](Train_from_Scratch/).

3. **Benchmarking**  
   Use scripts in [Benchmark/Train/](Benchmark/Train/) and [Benchmark/Inference/](Benchmark/Inference/) for distributed training and inference.

4. **Analysis**  
   Explore the [Motivation_Analysis/](Motivation_Analysis/) folder for ablation and similarity analysis scripts.


---

For more details, refer to the documentation and comments in each subfolder.