Abstract: Reinforcement Learning from Human Feedback and its variants have demonstrated remarkable performance in aligning with human values and intentions to generate helpful, harmless, and honest responses. However, most of them rely on costly human-annotated pairwise comparisons for supervised alignment, which is not suitable for list-level scenarios, such as community question answering. Additionally, human preferences are influenced by multiple intrinsic factors in responses, leading to decision-making inconsistencies.
Therefore, we propose **Self-supervised Attribute-aware dynamic preference ranking**, called *SeAdpra*. It quantifies preference differences between responses based on Attribute-Perceptual Distance Factors (APDF) and dynamically determines the list-wise alignment order. Furthermore, it achieves fine-grained preference difference learning and enables precise alignment with the optimal one.
We specifically constructed a challenging code preference dataset named **StaCoCoQA**, and introduced more cost-effective and scalable preference evaluation metrics: **PrefHit** and **PrefRecall**. Extensive experimental results show that *shortname* exhibits superior performance and generalizability on both StaCoCoQA and preference datasets from eight popular domains.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Preference Alignment,Community Question and Answering,Large Language Model ,Reinforcement Learning from Human Feedback
Contribution Types: NLP engineering experiment, Reproduction study, Data resources
Languages Studied: English
Submission Number: 1832
Loading