Position - The Rashomon Attack Surface (RAS): Navigating Predictive Multiplicity to Route Around AI Safety

Published: 24 Dec 2025, Last Modified: 24 Dec 2025MURE Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Position paper track
Published Or Accepted: false
Keywords: Rashomon effect, predictive multiplicity, AI safety, adversarial machine learning, large language models, decoding and routing policies, selective jailbreaks, orchestration/agents, multiplicity-aware risk (MAR), consensus-gating, counterfactual-veto
TL;DR: We argue that predictive multiplicity is an attack surface and propose a practical attacker loop (fingerprint → hop → selective) with set-aware evaluation (MAR, transfer gap, hop gain) and lightweight defenses (consensus-gating, counterfactual-veto).
Abstract: Modern AI deployments seldom run a single decision pathway; production stacks realize many near-equivalent routes (model + decoding/routing/tools) that meet the same quality bar yet differ on edge cases. Safety evalu- ations and red-teaming, however, often assume one fixed route, hiding risk and obscuring attack/defense levers. We present a deployment-realistic attacker loop—Fingerprint (few probes), Hop (small, user-plausible retries or decod- ing/routing tweaks), and Selective jailbreak (target weaker neighbors)—and a set-aware evaluation checklist: report set- level risk, measure neighbor differences, and probe steer- ability. We also propose two practical defenses, Consensus- Gating and Counterfactual-Veto, that reason over counterfac- tual neighbors with modest overhead. This reframing aligns evaluation with how systems are actually used and helps ex- plain selective, only-partially-transferable jailbreaks reported in recent studies.
Submission Number: 14
Loading