FineKB: Domain-Adaptive Issue Summarization and Cluster-Aware Retrieval for Support Knowledge Bases

Murat Kalender; Jeannie M Fitzgerald; Samaksh Gulati; Aashutosh Nema; Ian Roche

FineKB: Domain-Adaptive Issue Summarization and Cluster-Aware Retrieval for Support Knowledge Bases

Murat Kalender, Jeannie M Fitzgerald, Samaksh Gulati, Aashutosh Nema, Ian Roche

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge base retrieval, Large language models, Domain adaptation, Summarization, Clustering, Information retrieval, Enterprise support, Weak supervision, Fine-tuning

TL;DR: FineKB leverages domain-adapted LLM issue summarization and clustering-based retrieval to bridge noisy case descriptions with KB articles, improving retrieval accuracy at top-3 from 24% to 66%.

Abstract: Retrieving relevant knowledge base (KB) articles for enterprise support cases is difficult due to the semantic mismatch between noisy, verbose case descriptions and concise KB content. We present FineKB, a domain-adaptive issue–summarization and cluster-aware retrieval framework that addresses this gap through (i) a finetuned LLM trained on teacher-generated pseudo-summaries to normalize heterogeneous case narratives, (ii) per-KB multi-centroid clustering that models the diverse sub-problems associated with each KB article, and (iii) a confidence-adaptive hybrid inference mechanism that augments high-confidence vector search with selective content lookup and LLM reasoning for ambiguous cases. At inference time, raw case text is embedded and matched against this summary-structured index, avoiding runtime summarization while improving alignment. Experiments on large-scale enterprise data show that FineKB achieves 65.39\% Recall@3, substantially outperforming KB-content dense retrieval (42.73\%). To support reproducible research on noisy-to-structured retrieval, we release FineKB-Vectors, a vectorized dataset containing case-summary and KB-article embeddings.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 13617

Loading