An Empirical Study on Enhancing LLMs' Alignment Capabilities through Restyled In-Context Learning Demonstration Examples
Keywords: alignment, in-context learning, safety
TL;DR: This paper proposes a low-cost, tuning-free method based on in-context learning (ICL) to effectively enhance the alignment capabilities of LLMs.
Abstract: Alignment tuning is crucial for ensuring large language models (LLMs) behave safely, ethically, and align with human values. It bridges the gap between raw model capabilities and nuanced task requirements, such as helpfulness and user safety. Current alignment approaches, like instruction-following through supervised fine-tuning (SFT) and preference optimization (PO), require high-quality data and significant resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment.
Leveraging the autoregressive nature of LLMs, we observed that aligned models adjust the probability distribution of early polarity tokens during decoding, influencing their response trajectory. Among polarity tokens, malicious tokens induce LLMs to positively respond to toxic queries, whereas benign tokens encourage constructive output. Based on this, we designed heuristic rules to select ICL demonstration examples that effectively influence polarity token distributions.
We packaged these examples as prompts to trigger few-shot learning, improving LLM alignment. Furthermore, the style and content of ICL demonstrations critically impact few-shot learning. Rewriting examples in a unified, structured style improved LLM accuracy and helpfulness, while specific content encouraged refusal of malicious prompts, enhancing safety.
Our experiments show that rewritten examples boost alignment, safety, and reasoning across various tasks. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.15 increase on the Alpaca-eval task (from 4.44 → 4.59), a 0.10 enhancement on the just-eval-instruct benchmark (from 4.50 → 4.60), and a maximum improvement of 0.08 (from 3.53 → 3.61) on the MT-Bench dataset. These findings underscore the need for deeper analysis and theoretical understanding of alignment for advancing future LLM research.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11338
Loading