AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Abstract: As the integration of large language models into daily life is on the rise, there is still a lack of benchmarks for $\textit{advising on subjective and personal dilemmas}$. To address this, we introduce AdvisorQA, to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips Reddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a $\textit{collective intelligence}$. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments with PPO and DPO validate the efficacy of AdvisorQA-trained models through our helpfulness metric, as well as GPT-4 and human evaluations. We also analyze the limitations of each trainer in subjective tasks. AdvisorQA marks a significant leap in enhancing QA systems to provide personalized and empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets, evaluation methodologies, benchmarking
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 651
Loading