Abstract: Personalization in Information Retrieval (IR) is a topic studied by the community for a long time. However, the collection and curation of high-quality training data requires significant costs and time investment, especially for collecting user-related information. In this paper we explore the usefulness of Large Language Models (LLMs) in generating synthetic documents tailored to user's personal interests using user-related information. We introduce a new dataset, Sy-SE-PQA, to study the effectiveness of models fine-tuned on LLM-generated data and study how the complexity of personalization impacts model performances. We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Starting from questions in SE-PQA, we generate synthetic answers using different prompt techniques and LLMs. Our findings suggest that LLMs have high potential in generating training data, tailored to user's needs, for neural retrieval models and it can be used to replace training data. The code is publicly available.
Paper Type: Short
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Information Retrieval; Question Answering; Personalization; Generated Data
Contribution Types: Data resources
Languages Studied: English
Submission Number: 4581
Loading