Abstract: Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role—a concept we call *role separation*—is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine *role-separation learning*: the process of teaching LLMs to robustly distinguish system and user tokens. Through a *simple, controlled experimental framework*, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing *invariant signals* that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, modifying position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.
Lay Summary: Modern AI systems like ChatGPT handle multiple types of instructions simultaneously: system rules that define their behavior, user questions, and information from external tools. For these systems to work safely and reliably, they must clearly distinguish between trusted instructions (from the system) and potentially untrusted input (from users)—much like how a bank teller must distinguish between their manager's policies and customer requests.
While existing training methods achieve impressive performance on standard tests, we discovered they succeed for the wrong reasons. Instead of truly learning to identify different roles, AI models rely on unreliable shortcuts: they assume certain types of tasks are always instructions, or they simply follow whatever text appears first in their input. This creates serious security risks—imagine if that bank teller started following customer instructions to override bank policies just because the customer mentioned "account management."
We tested this by training AI models on safe examples, then evaluating them on tricky situations they hadn't seen before. The models consistently failed, revealing they hadn't learned genuine role separation but had simply memorized patterns from their training.
Traditional solutions like showing the AI more varied examples only create temporary fixes—new shortcuts inevitably emerge. Instead, we developed a technique that strengthens the fundamental signals that should distinguish different roles. By modifying how the AI processes the position of different text segments, we help it learn clearer distinctions between trusted and untrusted inputs.
This approach significantly improves AI reliability and security without compromising performance on normal tasks—a crucial advancement as AI systems are deployed in sensitive applications like healthcare and finance.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: LLM; role-separation; shortcut learning; position encoding
Submission Number: 14427
Loading