Leveraging Domain Knowledge for Efficient Reward Modeling in RLHF: A Case-Study in E-Commerce Opinion Summarization

ACL ARR 2024 June Submission4606 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: E-Commerce Opinion Summarization is the task of summarizing users' opinions expressed on a product (such as laptop, book, etc.). Prior approaches have failed to impart the human-desirable properties within an opinion summary. Recently, Reinforcement Learning from Human Feedback (RLHF) has become a dominant strategy in aligning Language Models (LMs) with human values. This motivates us to leverage RLHF for our task. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. The training process for $\varphi$ requires sizeable human preference data, usually in the order of tens of thousands. However, human goals are subjective, and vary from task-to-task, hindering us from using a general purpose off-the-shelf reward model. This necessitates a large-scale preference annotation for our task, which is expensive and time-consuming. To address this challenge and still leverage RLHF, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), while advancing SOTA ($\sim4$-point \rouge{L} improvement, $68\%$ of times preferred by humans over SOTA). Our technique also omits Alignment Tax and provides some interpretability. We release our code: [anon.4open.science/efficient-rlhf](https://anonymous.4open.science/r/reward-approx-social-choice-opp-summ-B380).
Paper Type: Long
Research Area: Summarization
Research Area Keywords: opinion summarization, summarization, alignment, reinforcement-learning-from-human-feedback
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 4606
Loading