Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

ICLR 2026 Conference Submission14637 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial robustness, inference-compute scaling, VLMs, efficiency
TL;DR: Adversarial pretraining makes test-time compute scaling a better defense.
Abstract: Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute and research investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing that LLM reasoning aids achievement of top-level specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of inference compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute scaling can offer benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, fuels successful application of defensive specifications to adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as attacked data's contents become more in-distribution. We empirically support this hypothesis across various vision language models and attack types, finding robustness gains from test-time compute are present as long as specification following on OOD data is enabled by compositional generalization, while RL finetuning and long reasoning traces are not critical. For example, we show that adding test-time defensive specifications to a VLM robustified via adversarial pretraining causes the success rate of gradient-based multimodal attacks to fall, but this same intervention provides no such benefit to non-robustified models. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding the compositional generalization needed for OOD data. Accordingly, we argue for layering of train-time and test-time defenses to obtain their synergistic benefit.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14637
Loading