Measuring What Matters: Probing Transit Reasoning Consistency in Large Language Models

Published: 08 Nov 2025, Last Modified: 08 Nov 2025NeurIPS 2025 Workshop NORA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Composite Reasoning, Multi-step Reasoning, Commonsense Knowledge, KG Reasoning, Agentic Systems, LLM Consistency, LLM Evaluation
Abstract: We propose benchmark along with a comprehensive evaluation framework for transit-domain Large Language Models that transcends traditional accuracy metrics by probing in-context learning capabilities and multi-step reasoning processes. Our approach introduces four complementary evaluation paradigms such as Perturbation Chains, Narrative Coherence Checks, Minimal Edit Plausibility, and Cross-Modal Anchoring, that collectively assess how models adapt, reason, and maintain consistency under domain-specific constraints. Through systematic evaluation of four state-of-the-art models, we demonstrate substantial performance disparities in cascading reasoning scenarios despite similar baseline accuracy, revealing fundamental limitations in current evaluation methodologies. Our framework along with the benchmark provides actionable insights for post-training optimization strategies, enables principled comparison of retrieval-augmented versus tool-calling architectures, and establishes theoretical foundations for deploying specialized smaller models in safety-critical transit applications. The benchmark and evaluation suite will be shared with community along with further extended studies.
Submission Number: 31
Loading