Large Language Model Value Alignment via Multi-Stage Fine-Tuning and Expert-Annotated Supervision

Published: 19 Jun 2025, Last Modified: 12 Jul 20254th Muslims in ML Workshop co-located with ICML 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Track 1: Machine Learning Research by Muslim Authors
Keywords: LLM, Value Alignment
Abstract: Ensuring that large language models (LLMs) generate responses aligned with human values is a critical challenge in AI safety and deployment. We present a multi-stage alignment framework that combines expert annotation, structured arbitration, and iterative fine-tuning. In our approach, model responses to diverse user prompts are rated by multiple experts on key dimensions. Cases with conflicting ratings are escalated to senior-expert arbitration, resulting in high-confidence consensus labels. This curated supervision is used in successive rounds of model fine-tuning, with each iteration further refining alignment. To safeguard conversational quality, we employ Sentence-BERT to quantitatively measure dialogue coherence before and after alignment. Our experimental results demonstrate that this process improves alignment outcomes, while maintaining or enhancing coherence and relevance. Our framework provides a systematic, scalable solution for aligning LLMs with human values and intent.
Submission Number: 27
Loading