Student Lead Author Indication: Yes
Keywords: troubleshooting, data retrieval
Abstract: Problem-solution datasets are commonly used for generative AI training. Especially when building a domain-specific conversational system, both fine-tuning and retrieval-augmented generation rely heavily on the availability of high-quality dataset. In system troubleshooting domain, few troubleshooting datasets are available for model tuning. We find annotating troubleshooting dataset a nontrivial task mainly due to the multi-modality nature of the data, which contain a mix of disparate data artifacts such as codes, log messages, console outputs, commands, and some descriptions in natural language. In this paper, we present a comprehensive approach to acquire the most relevant online forum data as answer for the input problem description. Our main goal is to retrieve the most relevant online forum post for a given problem description, either as a semi-supervised curation method for training dataset, or as a retrieval mechanism for augmented generation. Our key idea is to effectively separate data artifacts in the documents and assess the relevance across pairs of heterogeneous artifact types. To this end, we utilize a bag of language models, and then use the weighted accumulative score to find the most relevant answer. Compared with several baseline techniques, our method demonstrates significant improvements by at least 42.89\% against the best competitor regarding search ranking quality. Also, it successfully ranks the ground-truth forum posts within the top 10 in 96.1\% of the cases, significantly reducing human annotation effort.
Submission Number: 43
Loading