Keywords: adversarial robustness, inference-compute scaling, VLMs, efficiency
TL;DR: Adversarial pretraining makes test-time compute scaling a better defense.
Abstract: Test-time reasoning has raised benchmark performances and even shown promise in addressing the historically intractable problem of making models robust to adversarially out-of-distribution (OOD) data. Indeed, recent work used reasoning to aid satisfaction of model specifications designed to thwart attacks, finding a striking correlation between LLM reasoning effort and robustness to jailbreaks.
However, this benefit fades when stronger (e.g. gradient-based or multimodal) attacks are used.
This may be expected as models often can't follow instructions on the adversarially OOD data created by such attacks, and instruction following is needed to act in accordance with the attacker-thwarting spec.
Thus, we hypothesize that the test-time robustness benefits of specs are unlocked by initial robustness sufficient to follow instructions on OOD data.
Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the components of attacked data.
Guided by the RICH, we test models of varying initial-robustness levels, finding inference-compute adds robustness even to white-box multimodal attacks, provided the model has sufficient initial robustness.
Further evidencing a rich-get-richer dynamic, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder (creating the first adversarially robust reasoning VLM in the process).
Robustifying models makes attacked components of data more in-distribution (ID), and the RICH suggests this fuels compositional generalization -- understanding OOD data via its ID components -- to following spec instructions on adversarial data.
Consistently, we find test-time defenses both build and depend on train-time data and defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14637
Loading