Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Published: 24 Jun 2024, Last Modified: 24 Jun 2024ICML 2024 MI Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, In-context learning, Universality, Causal Analysis
TL;DR: We demonstrate that language models solve retrieval tasks using a universal modular internal task decomposition, which can be leveraged to mitigate prompt injection.
Abstract: When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks, from text understanding to coding. We apply causal analysis on ORION for 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input.
Submission Number: 40
Loading