Token-Aware Inference-Time Intervention for Large Language Model Alignment

Tianbo Wang; Yuqing Ma; Kewei Liao; Chengzhao Yang; Zhange Zhang; Jiakai Wang; Xianglong Liu

Token-Aware Inference-Time Intervention for Large Language Model Alignment

Tianbo Wang, Yuqing Ma, Kewei Liao, Chengzhao Yang, Zhange Zhang, Jiakai Wang, Xianglong Liu

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Alignment, Inference-Time Intervention, Mutual Information, Graph Network, Misalignment Estimation, Uncertainty Quantification

Abstract: Effectively mitigating the misalignment of large language models (LLMs) is crucial for ensuring secure AI applications. Inference-Time Intervention (ITI) technique, which applies interventions to internal representations along the probed alignment direction during inference, offers substantial alignment enhancements with minimal cost. However, previous ITI methods adopt coarse sentence-level analysis which neglects the misalignment discrepancy among varied tokens, resulting in deviant alignment direction and inflexible intervention strength. In this work, we propose a Token-Aware Inference-Time Intervention (TA-ITI) approach to fully utilize token-level alignment information, therefore realizing superior post-intervention performance. TA-ITI primarily consists of Mutual Information-Guided Token-level Graph Aggregation (MIG) and Misalignment-aware Adaptive Token-level Intervention (MAI). MIG develops a MI-guided graph to exploit the tokens' informative interaction for representation enrichment, thus improving alignment probing and facilitating subsequent intervention. MAI comprehensively perceives the token-level misalignment degree from token representation and prediction to guide the adaptive adjustment of intervention strength, thereby enhancing final alignment performance. Extensive experiments on three alignment capabilities demonstrate the efficacy of TA-ITI, notably surpassing baseline by 25.8\% on the primary metric of truthfulness.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2152

Loading