Introspective Adversarial Learning: Autonomous and Continual Preference Learning for LLM Alignment

08 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM alignment, Self-Play
TL;DR: IAL enables self-improving LLM alignment without human labels, surpassing existing methods.
Abstract: Large Language Models (LLMs) exhibit impressive capabilities across diverse tasks, yet aligning their outputs with human preferences remains a significant and costly challenge. Traditional alignment methods like Reinforcement Learning from Human Feedback (RLHF) depend heavily on extensive human-annotated preference data, which is difficult to scale. We propose Introspective Adversarial Learning (IAL), a novel alignment framework that enables LLMs to autonomously refine their own outputs through iterative self-improvement, without requiring additional human supervision. IAL introduces a Player-Advisor mechanism where the Player generates candidate responses and the Advisor provides constructive refinement strategies. The refined responses are evaluated by a reward model, and the contrast between initial and improved outputs drives a Preference Transductive Learning process. This reflective cycle allows the model to generate high-quality preference data internally and progressively enhance alignment. Experiments on the zephyr-7b-sft-full model, evaluated via the HuggingFace Open LLM Leaderboard and MT-Bench, show that IAL consistently improves alignment performance while preserving strong general task capabilities. Compared to state-of-the-art methods such as SPIN, SPA, and DPO, IAL achieves superior results without relying on costly human preference annotations, offering a scalable and efficient pathway toward better-aligned LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3094
Loading