Focus On This, Not That! Steering LLMs With Adaptive Feature Specification

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: instruction tuning, LLMs, spurious correlations, robustness, distribution shift, bias
TL;DR: We introduce a method to trains LLMs to adaptively condition their task behaviours based on specified features.
Abstract: Despite the success of Instruction Tuning (IT) in training large language models (LLMs) to perform arbitrary user-specified tasks, these models often still leverage spurious or biased features learned from their training data, leading to undesired behaviours when deploying them in new contexts. In this work, we introduce *Focus Instruction Tuning* (FIT), which trains LLMs to condition their responses by ''focusing on'' specific features whilst ignoring others, leading to different behaviours based on which features are specified. Across several experimental settings, we show that focus-tuned models can be adaptively steered by focusing on different features at inference-time, such as (a) improving robustness by focusing on task-causal features and ignoring spurious features, and (b) mitigating bias by ignoring demographic categories. Furthermore, FIT can steer behaviour in new contexts, generalising under distribution shift and to new unseen features at inference time, thereby facilitating more robust, fair, and explainable LLM applications in real-world environments.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9611
Loading