Cluster-Masked Scanning and Pretraining for Enhanced xLSTM Vision Performance

ICLR 2026 Conference Submission16949 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LSTM, Cluster-Masked, Autoregressive Pretraining, Visual Tasks, Visual Pretraining
TL;DR: MAL, a novel framework that enhances xLSTM's performance in visual representation learning through innovative cluster-masked masking and scanning strategies, significantly outperforms traditional supervised models in various visual tasks.
Abstract: While modern recurrent architectures like xLSTM show promise for vision tasks, their potential has been hindered by the challenge of effectively applying autoregressive pretraining---a cornerstone of NLP success---to 2D image data. This paper introduces MAL, a framework that unlocks autoregressive learning for vision-oriented xLSTMs. Our core innovation is a cluster-masked pretraining strategy, which reorganizes an image into a sequence of semantically meaningful local clusters. This approach creates a more structured input sequence uniquely suited to xLSTM's memory mechanisms. By combining this with our novel cluster scanning strategy which defines an optimal processing order, MAL effectively learns powerful visual representations by predicting entire image regions autoregressively. Our experiments show that this novel pretraining scheme allows MAL to significantly outperform traditional supervised models, fully leveraging the scaling potential of xLSTM and setting a new performance benchmark.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16949
Loading