Position - The Rashomon Attack Surface (RAS): Navigating Predictive Multiplicity to Route Around AI Safety
Track: Position paper track
Published Or Accepted: false
Keywords: Rashomon effect, predictive multiplicity, AI safety, adversarial machine learning, large language models, decoding and routing policies, selective jailbreaks, orchestration/agents, multiplicity-aware risk (MAR), consensus-gating, counterfactual-veto
TL;DR: We argue that predictive multiplicity is an attack surface and propose a practical attacker loop (fingerprint → hop → selective) with set-aware evaluation (MAR, transfer gap, hop gain) and lightweight defenses (consensus-gating, counterfactual-veto).
Abstract: Modern AI deployments seldom run a single decision
pathway; production stacks realize many near-equivalent
routes (model + decoding/routing/tools) that meet the
same quality bar yet differ on edge cases. Safety evalu-
ations and red-teaming, however, often assume one fixed
route, hiding risk and obscuring attack/defense levers. We
present a deployment-realistic attacker loop—Fingerprint
(few probes), Hop (small, user-plausible retries or decod-
ing/routing tweaks), and Selective jailbreak (target weaker
neighbors)—and a set-aware evaluation checklist: report set-
level risk, measure neighbor differences, and probe steer-
ability. We also propose two practical defenses, Consensus-
Gating and Counterfactual-Veto, that reason over counterfac-
tual neighbors with modest overhead. This reframing aligns
evaluation with how systems are actually used and helps ex-
plain selective, only-partially-transferable jailbreaks reported
in recent studies.
Submission Number: 14
Loading