Evaluating Global Decision Faithfulness of LLMs with Structured Tabular Decision Simulations

Evaluating Global Decision Faithfulness of LLMs with Structured Tabular Decision Simulations

TMLR Paper8940 Authors

14 May 2026 (modified: 27 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply that their decisions are grounded in relevant, domain-appropriate factors. In structured decision settings, such as medical triage, financial risk assessment, or policy analysis, reliable performance requires more than producing correct labels: a model should make consistent decisions across multiple instances and rely on relevant, domain-grounded decision factors. We introduce **Structured Tabular Decision Simulations (STaDS)**, an evaluation framework that casts expert-like decision problems into tabular form and evaluates LLMs along three behavioral dimensions: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. The third dimension extends faithfulness evaluation from local reasoning traces to global decision faithfulness: whether a model's stated decision factors align with the factors that behaviorally affect its predictions across many instances. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that predictive competence and global decision faithfulness are empirically separable: models frequently achieve high accuracy while exhibiting low or negative alignment between stated and behaviorally measured feature reliance. This accuracy-faithfulness gap is consistent across model families and domains, and remains visible in a targeted domain-specialized medical-model case study. Our results highlight that accuracy metrics alone are insufficient and motivate the adoption of global faithfulness evaluation as a complementary protocol.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=R4NninzmGb

Changes Since Last Submission: We summarize the major revisions below, organized by theme. --- ### Reframing and presentation The paper title has been updated to **“Evaluating Global Decision Faithfulness of LLMs with Structured Tabular Decision Simulations”** to more precisely reflect the paper’s central contribution. The abstract and introduction have been substantially revised to foreground the faithfulness framing rather than the broader “understanding” framing, reducing conceptual ambiguity. We also improved presentation and notation throughout the manuscript. A new notation table, Table 2, has been added to define the core symbols at first use. Abbreviations and symbols are now introduced before use, and section transitions have been revised for coherence. Sections 3.3 and 5.1 have also been compressed to reduce redundancy and improve readability. --- ### Feature ablation methodology and perturbation operators We substantially expanded the treatment of LAO interventions. The main text now explicitly distinguishes deletion from replacement-based perturbation operators and motivates deletion as a clear **missing-information intervention**, rather than as a universal feature-attribution estimator in the style of classical XAI methods. To address the concern that deletion changes the input structure, we compare deletion-LAO with four alternative operators: constant replacement, mean replacement, empirical marginal sampling, and column-wise permutation (Section 6.4, Appendix B.1, Tables 7 and 36--37). To further validate deletion as a reliable behavioral intervention, we conduct a distribution-based perturbation check using 30 repeated generations at temperature 0.1, comparing within-condition disagreement with cross-condition distributional distance. Deletion produces statistically reliable distributional shifts beyond ordinary sampling variation ($p < 0.001$) without inflating within-condition variability, whereas column-wise permutation produces noisier counterfactuals for informative features. These results support deletion-LAO as the primary STaDS intervention. --- ### Post-processing audit We added a systematic audit of the GPT-4.1-mini post-processing step (Section 6.4, Appendix B.3, Table 38). The audit compares raw extractable predictions with cleaned outputs under no-ablation, single-column ablation, and multi-column ablation settings. Post-audit accuracy remains at $1.00$ across all settings except Breast Cancer--Qwen3-8B under single-column ablation. --- ### Correlated group ablations We added correlated group ablations for Iris and Breast Cancer (Section 6.4, Appendix B.4, Table 39 and Figures 21--22). These results confirm that single-feature LAO can underestimate higher-order feature reliance, but also show that such interactions are model- and dataset-specific. --- ### Domain-specialized model evaluation We added a case study evaluating MedGemma-4B on two healthcare datasets, Breast Cancer and Pima Diabetes, compared against the general-domain Gemma3-4B baseline (Section 6.5, Appendix B.5, Figure 8, Table 40). The results show that domain specialization improves the apparent clinical relevance of self-reported attributions without necessarily ensuring alignment with LAO-based behavioral reliance. This reinforces the paper’s central claim that global decision faithfulness must be evaluated separately from predictive performance or explanation plausibility. --- ### Evaluation cost reporting We added approximate prompt token counts to Table 4 for all 15 datasets at 100 test rows. Detailed wall-clock runtime for single-feature LAO evaluation and correlated group ablations is reported in Tables 9--10 and Figure 14 in Appendix A.1. These additions improve reproducibility and clarify the computational cost of applying STaDS across models and datasets. --- ### Hyperparameter sensitivity We added a sensitivity analysis for the Penalized Accuracy penalty weights $\alpha$ and $\beta$ over the simplex $\alpha + \beta = 1$ (Section 6.4, Appendix B.2, Figure 20). The mean LAO-induced $\mathrm{PenAcc}$ degradation varies smoothly across the explored range without qualitative reversals, supporting the use of the neutral default $\alpha = \beta = 0.5$ rather than a tuned hyperparameter. --- ### Positioning relative to multi-hop QA and related benchmarks We added Table 1, which systematically compares STaDS with multi-hop question answering, long-context QA, tabular QA, LLM-based tabular prediction, chain-of-thought faithfulness, and post-hoc XAI. The table distinguishes these paradigms along five axes: typical input, unit of evaluation, primary evaluation signal, whether repeated domain-level decisions are evaluated, and whether stated-versus-behavioral decision factors are compared. This addition clarifies that STaDS is not simply a tabular prediction benchmark, but a framework for evaluating repeated structured decision behavior and global decision faithfulness.

Assigned Action Editor: ~Chris_J_Maddison1

Submission Number: 8940

Loading