ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

Ashish Chouhan, Saifeldin Mandour, Michael Gertz

Published: 2024, Last Modified: 07 Jul 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.