Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

ACL ARR 2024 June Submission4581 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Personalization in Information Retrieval (IR) is a topic studied by the community for a long time. However, the collection and curation of high-quality training data requires significant costs and time investment, especially for collecting user-related information. In this paper we explore the usefulness of Large Language Models (LLMs) in generating synthetic documents tailored to user's personal interests using user-related information. We introduce a new dataset, Sy-SE-PQA, to study the effectiveness of models fine-tuned on LLM-generated data and study how the complexity of personalization impacts model performances. We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Starting from questions in SE-PQA, we generate synthetic answers using different prompt techniques and LLMs. Our findings suggest that LLMs have high potential in generating training data, tailored to user's needs, for neural retrieval models and it can be used to replace training data. The code is publicly available.

Paper Type: Short

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Information Retrieval; Question Answering; Personalization; Generated Data

Contribution Types: Data resources

Languages Studied: English

Submission Number: 4581

Loading