Architectural and Inferential Inductive Biases for Exchangeable Sequence Modeling

Daksh Mittal; Ang Li; Thomson Yen; C. Daniel Guetta; Hongseok Namkoong

Architectural and Inferential Inductive Biases for Exchangeable Sequence Modeling

Daksh Mittal, Ang Li, Thomson Yen, C. Daniel Guetta, Hongseok Namkoong

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Exchangeability, Inductive Bias, Epistemic uncertainty, Multi-step inference, Transformer architecture

TL;DR: We investiage the use of autoregressive models for exchangeable sequences in decision-making, showing multi-step inference improves decision making and standard causal architectures outperform existing custom ones.

Abstract: Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences---i.i.d. observations when conditioned on some latent factor---enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: its inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Müller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. Through empirical evaluation, we find that these custom architectures can significantly underperform compared to standard causal masking, highlighting the need for new architectural innovations in Transformer-based modeling of exchangeable sequences.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 25099

Loading