# Memory-Efficient Training with In-Place FFT Implementation

This repository is the official implementation of [Memory-Efficient Training with In-Place FFT Implementation]

## Requirements

This project requires PyTorch compiled with the **same CUDA version as your local environment**.

To install this package in editable mode:

1. Make sure your system CUDA version matches the PyTorch version.  
   For example, if your system has CUDA 12.1 installed:

    ```bash
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    ```
2. Then install this package:

    ```bash
    pip install --no-build-isolation -e .
    ```
❗ If you see an error like
RuntimeError: The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.1),
make sure to reinstall PyTorch with the correct CUDA version as shown above.

## Usage

To measure the memory usage of our proposed in-place FFT method, run:

```bash
python test/circulant_layer_rdfft.py
````

This script benchmarks the memory consumption step-by-step **only for our method**.

To test other methods (e.g., full fine-tuning, LoRA, standard FFT), please refer to the corresponding scripts under the `test/` directory or modify `circulant_layer_rdfft.py` accordingly.


## Results

## Results

We report the peak GPU memory usage (in MB) for different fine-tuning methods under various settings of hidden dimension `D` and batch size `B`. For each entry, we also denote in parentheses how many times memory usage is reduced compared to full fine-tuning.

| Method      | D=4096, B=1 | D=4096, B=16 | D=4096, B=256 | D=1024, B=1 | D=1024, B=16 | D=1024, B=256 |
|-------------|-------------|--------------|---------------|-------------|--------------|----------------|
| full-finetune | 144.33     | 145.50      | 164.25        | 24.27       | 24.56       | 29.25          |
| lora          | 20.31 (×7.11) | 21.25 (×6.85) | 39.38 (×4.17) | 16.77 (×1.45) | 17.00 (×1.44) | 21.69 (×1.35) |
| fftp=128      | 3.65 (×39.55) | 35.88 (×4.06) | 551.50 (×0.30) | 0.25 (×95.22) | 2.66 (×9.22) | 41.22 (×0.71) |
| rfftp=128     | 3.14 (×45.93) | 35.14 (×4.14) | 547.13 (×0.30) | 0.22 (×111.20) | 2.53 (×9.72) | 40.30 (×0.73) |
| **ours p=128** | **1.06 (×135.78)** | **2.00 (×72.73)** | **20.50 (×8.01)** | **0.08 (×308.73)** | **0.34 (×71.35)** | **5.03 (×5.81)** |
| fftp=256      | 1.89 (×76.24) | 19.03 (×7.65) | 293.25 (×0.56) | 0.15 (×166.80) | 1.61 (×15.24) | 25.08 (×1.17) |
| rfftp=256     | 1.62 (×89.17) | 18.35 (×7.93) | 286.06 (×0.57) | 0.12 (×194.92) | 1.48 (×16.63) | 24.02 (×1.22) |
| **ours p=256** | **0.56 (×256.36)** | **1.50 (×96.97)** | **20.25 (×8.11)** | **0.05 (×512.42)** | **0.33 (×74.74)** | **5.02 (×5.83)** |
| fftp=512      | 1.02 (×141.97) | 10.63 (×13.68) | 164.50 (×1.00) | 0.09 (×267.23) | 1.09 (×22.60) | 17.03 (×1.72) |
| rfftp=512     | 0.86 (×167.28) | 10.03 (×14.51) | 156.66 (×1.05) | 0.08 (×312.61) | 0.96 (×25.68) | 15.05 (×1.94) |
| **ours p=512** | **0.31 (×461.14)** | **1.38 (×105.78)** | **20.13 (×8.16)** | **0.03 (×764.69)** | **0.32 (×76.56)** | **5.01 (×5.84)** |
| fftp=1024     | 0.58 (×249.23) | 6.44 (×22.59) | 100.22 (×1.64) | 0.06 (×382.35) | 0.83 (×29.76) | 13.01 (×2.25) |
| rfftp=1024    | 0.49 (×295.88) | 5.88 (×24.73) | 92.24 (×1.78) | 0.05 (×447.79) | 0.70 (×35.15) | 11.02 (×2.65) |
| **ours p=1024** | **0.19 (×767.76)** | **1.31 (×110.82)** | **20.06 (×8.19)** | **0.02 (×1,014.39)** | **0.32 (×77.50)** | **5.00 (×5.84)** |
| fftp=4096     | 0.25 (×575.07) | 3.30 (×44.12) | 52.05 (×3.16) | N/A           | N/A           | N/A             |
| rfftp=4096    | 0.21 (×698.79) | 2.78 (×52.25) | 44.04 (×3.73) | N/A           | N/A           | N/A             |
| **ours p=4096** | **0.09 (×1,531.54)** | **1.27 (×114.92)** | **20.02 (×8.21)** | **N/A**       | **N/A**         | **N/A**           |

Our method achieves the lowest memory footprint across all tested configurations. 





