# Vision-LSTM (ViL)

Pytorch implementation and pre-trained models of Vision-LSTM (ViL), an adaption of xLSTM to computer vision.


## License

This project is licensed under the MIT License, except the following folders/files, 
which are licensed under the AGPL-3.0 license:
- src/vislstm/modules/xlstm
- vision_lstm/vision_lstm.py
- vision_lstm/vision_lstm2.py

# Get started

This code-base supports simple usage of Vision-LSTM with an "architecture-only" implementation and
also a full training pipeline.

## Architecture only
The package vision_lstm provides a standalone implementation in the style of [timm](https://github.com/huggingface/pytorch-image-models).


# Version1 pre-trained models

In the first iteration of ViL, models were trained with (i) bilateral_avg pooling instead of bilateral_concat 
(ii) causal conv1d instead of conv2d before q and k (iii) no biases in projection and layernorms (iv) 224 resolution
for the whole training process instead of pre-training at 192 resolution followed by a short fine-tuning on 224 
resolution. These changes improve ImageNet-1K accuracy of a ViL-T from 77.3% to 78.3%. See Appendix A in the paper
for more details. We recommend to use VisionLSTM2 instead of VisionLSTM but keep support for the initial version as-is.
Pre-trained models of the first iteration can be loaded as follows:


