Keywords: conversaitonal search, multi-intent
TL;DR: A large-scale synthetic dataset comprising over 100,000 high-quality, information-seeking conversations for conversational search
Abstract: In recent years, search engines have made significant advancements. Yet, traditional ad-hoc search engines often struggle with complex search scenarios (e.g. multi-turn information seeking). This challenge has shifted the focus towards conversational search, an approach enabling search engines to interact directly with users to obtain more precise results. Progress in conversational search has been slow due to a lack of data and difficulties in gathering real-world conversational search data. To address these hurdles, we embarked on a journey to autonomously create a large-scale, high-quality conversational search dataset. Previous efforts to create such datasets often overlooked the multi-intent aspect and contextual information, or resulted in a biased dataset, where all dialogue queries linked to a single positive passage. In our study, we have incorporated multi-intent based on the existing search sessions and converted each keyword-based query into multiple natural language queries based on different latent intents present in the related passage. We then contextualized these natural language queries within the same session and organized them into a conversational search tree. A carefully designed dialogue discriminator was utilized to ensure the consistency and coherence of all generated conversations, assessing their quality and filtering out any substandard ones.
After extensive data cleaning, we are proud to introduce the \textbf{I}ntent-oriented and \textbf{C}ontext-aware \textbf{Conv}ersational search dataset (ICConv), a large-scale synthetic dataset comprising over 100,000 high-quality, information-seeking conversations. Our human annotators have evaluated ICConv based on six dialogue and search related criteria and it has performed admirably. We further explore the statistical characteristics of ICConv and validate the effectiveness of various conversational search methods using it as a standard for comparison.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14160
Loading