Mechanistic Analysis of Demonstration Conflicts in In-Context Learning

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: In Context Learning, Rule Inference, Conflict Resolution, Mechanistic Interpretability
Abstract: In-context learning (ICL) enables large language models (LLMs) to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable. To understand how models process such conflicts, we study demonstration-dependent tasks where models must infer underlying patterns—scenarios we characterize as rule inference. We find that models suffer substantial performance degradation from single corrupted demonstrations, with corrupted rules accounting for 80\% of wrong predictions despite appearing in only one position. This systematic misleading behavior motivates our investigation of how models process conflicting evidence internally. Using linear probes and logit lens analysis, we discover models encode both correct and incorrect rules in early layers but develop prediction confidence only in late layers, revealing a two-phase computational structure. We identify attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to single corrupted evidence. Targeted ablation validates our findings, with masking identified heads improving performance by over 10\%. Our work establishes a mechanistic foundation for understanding how LLMs process conflicting evidence during in-context rule learning.
Primary Area: interpretability and explainable AI
Submission Number: 7964
Loading