Keywords: Steerable Pluralistic Alignment, PEFT, MOE, LoRA, RLVR, GRPO
Abstract: Steerable pluralistic alignment aims to enable large language models (LLMs) to reliably adhere to diverse and potentially conflicting human values, particularly when target objectives involve multi-dimensional, compositional values. Current methods largely rely on prompt engineering or reasoning-time guidance, which often results in fragile and non‑persistent control once prompts are perturbed or omitted.
In this work, we study value‑controllable alignment through discrete condition vectors and propose Verifiable‑reward‑Routed LoRA—a parameter‑efficient mixture‑of‑experts LoRA framework enhanced with conditioned gating. This gating mechanism dynamically directs the flow among multiple LoRA experts based on an input value or moral vector. To ensure that such routing leads to semantically compliant outputs, we formulate post‑training as a reinforcement learning problem with verifiable rewards. We further introduce a conditional consistency reward, computed by an external model‑based verifier implemented as a lightweight discriminator, and optimize the adapter parameters using GRPO.
Experiments on the Touché23‑valueEval (value alignment) and MIC (moral alignment) benchmarks, using two 8‑billion‑parameter backbones, show that our method consistently outperforms prompt‑based steering and multi‑task PEFT baselines. It attains the highest overall controllability across micro‑F1, macro‑F1, and Jaccard metrics—a conclusion further reinforced by human pairwise evaluations.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: fine-tuning,prompting,safety and alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 4725
Loading