Distinguishing Speech from Writing with Multi-Topic Clustering and Visual Explainable AI

Published: 28 Apr 2026, Last Modified: 04 May 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Explainable AI, Visualization, Stylometry, Linguistic Features, Writing Style, Topic Modeling
TL;DR: We introduce an open-source interactive framework for multi-level topic and token-level XAI exploration. Combining topic modeling with syntactic features improves classification and supports analysis with fine-tuned and prompt-based LLMs.
Abstract: Distinguishing spoken from written language has long been central to linguistics and stylometry [1-7], yet computational approaches often prioritize classification accuracy over interpretability [9]. In this work, we present an open-source, interactive framework to analyze register variation in U.S. presidential speech and writing. The framework integrates linguistically motivated syntactic features, multi-level topic modeling, transformer-based classification, and explainable AI (XAI) Our dataset consists of 41,306 balanced sentences (20,654 spoken transcripts and 20,652 written sentences from presidential-authored books). We extract a broad set of syntactic and discourse-level features, including clause structure, pronoun usage, parse depth, modal verbs, negation, discourse markers, interjections, and punctuation density. The Multi-Level Topic Modeling (MTM) approach is used to capture semantic variation across levels of granularity. Sentence embeddings are generated using the all-MiniLM-L6-v2 model [9], clustered with BERTopic [10], which leverages HDBSCAN [11] for density-based clustering and UMAP [12] for dimensionality reduction and visualization. Topic modeling is performed iteratively across 5 to 400 clusters, allowing sentences to be represented with hierarchical semantic assignments. For classification, we compare traditional machine learning models (SVM, Random Forest) with fine-tuned transformer models, including DeBERTa-v3-large [13] and GPT-Neo [14]. We additionally evaluate prompt-based classification using Claude Haiku [15] with zero-shot (ZS), few-shot (FS), chain-of-thought (CoT), linguistically guided (LG), and self-consistency (SC) strategies. To investigate interpretability, we extract token-level explanations using attention weights and Integrated Gradients, and compute discriminative token statistics across classes. Results show that incorporating MTM significantly improves classification performance over syntactic features alone, with Random Forest models achieving approximately 0.84 F1 for spoken language. Fine-tuned DeBERTa achieves the highest overall accuracy (0.94), substantially outperforming prompt-based LLM approaches. Interestingly, XAI-derived token features contribute marginal performance gains when MTM is included, suggesting that hierarchical topic structure already captures much of the discriminative semantic information. To support qualitative exploration, we develop an interactive Observable-based visualization framework featuring UMAP topic projections, topic-distribution comparisons, and token-level Integrated Gradient highlights. This dual-view interface enables researchers to examine how semantic clusters and token attributions differ across registers and to analyze model errors interactively. Our findings demonstrate that combining linguistically grounded features with multi-level semantic clustering yields strong and interpretable register classification. More broadly, this work illustrates how visualization and explainability can move stylometry beyond black-box prediction toward interpretable analysis.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 151
Loading