Keywords: Interpretation, Robustness, Large Language Models, State-space models
Abstract: We introduce \textit{\textbf{Baleen}}, a family of state space models that unifies \textbf{stochastic selection} with \textbf{information bottleneck} to build interpretable and robust long‑context learners. Unlike Mamba/Mamba2’s deterministic gates, Baleen treats selection as a random variable and regularizes it with a closed‑form KL to a sparsity prior: (i) \textit{\textbf{\bib}} samples Bernoulli state‑transition gates; (ii) \textit{\textbf{\gib}} samples Exponential time‑intervals. This yields an explicit trade‑off between retention and compression and exposes token‑level selection heatmaps at inference for self‑interpretation. On language benchmarks, \ib improves average accuracy over Mamba2 by +0.95 at 370M pretraining and +1.38 at 7B finetuning. Baleen delivers stronger robustness to localized perturbations and adversarial attacks: under CIFAR‑10 sequence perturbation, prefix damage falls to 0.6\% vs 26.5\% for Mamba2 (average under attacks 0.542 vs 0.385). Finally, Baleen’s self‑interpretations outperform IG/Grad‑CAM on average fidelity across four text classification tasks. We will release our Baleen‑7B models on Hugging Face with code, checkpoints, and an interactive selection‑heatmap demo.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15041
Loading