This folder contains code to benchmark the running speed (achieved TFLOPS) or our model and mamba and original test-time training layer. 

## How to run

To see our speed on your hardware:
```python
python3 test_speed_lact_bidirectional_pytorch.py  # for speed benchmarking our pytorch implementation
python3 test_speed_lact_bidirectional_triton_fused.py # for speed benchmarking our slightly faster implementation with triton kernel fusion
```

To see mamba speed on your hardware
```python
python3 test_speed_mamba.py
```

To test the speed for the original TTT implementation with inference kernels:

```python
cd original_ttt_speed/
python3 test_ttt_linear.py # for original TTT-Linear

python3 test_ttt_mlp.py # for TTT-MLP, note this kernel only runs on H100; Note this only support head-dim=64 with minibatch size of 64
```


## Tuning the shape params
In our code `test_speed_lact_bidirectional_pytorch.py` at line 217, and line 254 at test_speed_lact_bidirectional_triton_fused.py,  we annoated about how to tune these shape params, and our common use of these shape params. 