Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

ACL ARR 2026 January Submission1512 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-Context Learning, Demonstration Selection

Abstract: Many studies show that not all demonstrations help in-context learning (ICL), limiting performance. Therefore, in this paper, we analyze why demonstrations become ineffective using gradient flow. By setting the gradient flow to zero, we reveal two cases of ineffectiveness: the model has either already learned the information or it is irrelevant to the query. We also prove that in the multi-layer attention model, effectiveness disparities amplify with depth, directing attention toward effective demonstrations. Building on the above discussion, we propose GradS, which selects demonstrations via gradient-flow signals and explicitly accounts for already assimilated information. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experiment confirms that the disparity in demonstration effectiveness is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $1.3%$ on average over the strongest baselines, achieving new SOTA results in the demonstration selection.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: few-shot QA, prompting

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 1512

Loading