Keywords: Instruction-following, Large language models (LLMs), Linear Probes, Representation Engineering, AI agents
TL;DR: We discovered a dimension in LLMs' input embeddings linked to instruction-following using linear probes. Modifying representations along this dimension improves instruction following, offering insights into enhancing AI agents' behavior.
Abstract: Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided guidelines. However, LLMs often fail to follow even simple instructions. To improve instruction-following behavior and prevent undesirable outputs, we need a deeper understanding of how LLMs' internal states relate to these outcomes.
Our analysis of LLM internal states revealed a dimension in the input embedding space linked to successful instruction-following. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality.
This work provides insights into the internal workings of LLMs' instruction-following, paving the way for reliable LLM agents.
Email Of Author Nominated As Reviewer: jh2324@cam.ac.uk
Submission Number: 10
Loading