Baleen: Self‑Interpretable, Robust SSMs with Stochastic Selective Memory

Baleen: Self‑Interpretable, Robust SSMs with Stochastic Selective Memory

ICLR 2026 Conference Submission15041 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretation, Robustness, Large Language Models, State-space models

Abstract: We introduce \textit{\textbf{Baleen}}, a family of state space models that unifies \textbf{stochastic selection} with \textbf{information bottleneck} to build interpretable and robust long‑context learners. Unlike Mamba/Mamba2’s deterministic gates, Baleen treats selection as a random variable and regularizes it with a closed‑form KL to a sparsity prior: (i) \textit{\textbf{\bib}} samples Bernoulli state‑transition gates; (ii) \textit{\textbf{\gib}} samples Exponential time‑intervals. This yields an explicit trade‑off between retention and compression and exposes token‑level selection heatmaps at inference for self‑interpretation. On language benchmarks, \ib improves average accuracy over Mamba2 by +0.95 at 370M pretraining and +1.38 at 7B finetuning. Baleen delivers stronger robustness to localized perturbations and adversarial attacks: under CIFAR‑10 sequence perturbation, prefix damage falls to 0.6\% vs 26.5\% for Mamba2 (average under attacks 0.542 vs 0.385). Finally, Baleen’s self‑interpretations outperform IG/Grad‑CAM on average fidelity across four text classification tasks. We will release our Baleen‑7B models on Hugging Face with code, checkpoints, and an interactive selection‑heatmap demo.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15041

Loading