Keywords: LLM Alignment, Inference-Time Intervention, Mutual Information, Graph Network, Misalignment Estimation, Uncertainty Quantification
Abstract: Effectively mitigating the misalignment of large language models (LLMs) is crucial for ensuring secure AI applications. Inference-Time Intervention (ITI) technique, which applies interventions to internal representations along the probed alignment direction during inference, offers substantial alignment enhancements with minimal cost. However, previous ITI methods adopt coarse sentence-level analysis which neglects the misalignment discrepancy among varied tokens, resulting in deviant alignment direction and inflexible intervention strength.
In this work, we propose a Token-Aware Inference-Time Intervention (TA-ITI) approach to fully utilize token-level alignment information, therefore realizing superior post-intervention performance. TA-ITI primarily consists of Mutual Information-Guided Token-level Graph Aggregation (MIG) and Misalignment-aware Adaptive Token-level Intervention (MAI). MIG develops a MI-guided graph to exploit the tokens' informative interaction for representation enrichment, thus improving alignment probing and facilitating subsequent intervention.
MAI comprehensively perceives the token-level misalignment degree from token representation and prediction to guide the adaptive adjustment of intervention strength, thereby enhancing final alignment performance. Extensive experiments on three alignment capabilities demonstrate the efficacy of TA-ITI, notably surpassing baseline by 25.8\% on the primary metric of truthfulness.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2152
Loading