MonitorLLM: Real-Time Structural and Bias Evaluation of Generative AI through Knowledge Graphs

Mohd Ariful Haque; kishor datta gupta; Roy George

MonitorLLM: Real-Time Structural and Bias Evaluation of Generative AI through Knowledge Graphs

Mohd Ariful Haque, kishor datta gupta, Roy George

Published: 24 Sept 2025, Last Modified: 29 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Knowledge Graphs, Hallucination Detection, Semantic Drift, Bias Monitoring, Fairness in AI, Continuous Evaluation, Trustworthy AI

TL;DR: MonitorLLM is a continuous knowledge graph–based framework that unifies structural reliability, hallucination detection, and bias monitoring across demographic groups to provide auditable, vendor-agnostic evaluation of generative AI.

Abstract: Large Language Models (LLMs) provide remarkable generative capabilities but remain vulnerable to hallucinations, semantic drift, and biased framings. Existing evaluation methods are static and dataset-bound, offering limited insight into how models evolve under real-world conditions. We present \textbf{MonitorLLM}, a knowledge graph–based framework for continuous and interpretable evaluation of generative AI. MonitorLLM compares deterministic, ontology-driven graphs with LLM-generated graphs from live news streams, quantifying deviations through schema-aware structural metrics and hallucination checks. To extend beyond factual reliability, we introduce a \textit{Generalized Prompt Framework} (GPF) that probes diverse demographic, socioeconomic, and political groups, enabling the construction of bias-aware knowledge graphs and dissimilarity metrics. An adaptive anomaly detector integrates both structural and bias dimensions, capturing temporal drift and reliability shifts. Experiments across nine LLMs demonstrate that MonitorLLM highlights model stability, surfaces hallucinations, and reveals disparities in group-conditioned framings, offering a vendor-agnostic and auditable path toward trustworthy deployment of generative AI.

Submission Number: 94

Loading