MetaCog-Bench: A Process-Based Benchmark for Evaluating Metacognitive Monitoring and Control in Large Language Models

MetaCog-Bench: A Process-Based Benchmark for Evaluating Metacognitive Monitoring and Control in Large Language Models

29 Mar 2026 (modified: 27 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce MetaCog-Bench, a benchmark for evaluating metacognitive monitoring and control in large language models, grounded in the Nelson & Narens (1990) framework. Unlike prior benchmarks that rely on LLM-as-judge evaluation---which inflates scores when the same model family serves as both subject and evaluator---MetaCog-Bench uses exclusively deterministic evaluation: regex matching, keyword detection, JSON field verification, and Expected Calibration Error (ECE) computation. The benchmark comprises 147 tasks organized into five tiers spanning three metacognitive dimensions: Metacognitive Sensitivity (MS), Strategy Adaptation Frequency (SAF), and Cross-Domain Transfer Coefficient (CDTC). We evaluate seven models from six providers---including five proprietary frontier models, one proprietary mid-tier model, and one open-weight model (12B)---with three runs per model for statistical rigor. Grok-3-mini-fast achieves the highest overall score (0.864±0.009) with perfect metacognitive control (SAF=1.000), while DeepSeek-V3 follows closely (0.859±0.007) with the best confidence calibration (ECE=0.050). GPT-4o exhibits a striking monitoring-control dissociation: strong calibration (ECE=0.069) but weak sycophancy resistance (91.7%) and domain transfer (65.0%). The open-weight Open-Mistral-Nemo (12B) scores 0.710±0.026 overall but achieves near-proprietary sycophancy resistance (SAF=0.956), suggesting some metacognitive capabilities do not require frontier-scale models. All models achieve ≥96% ecological validity with unconstrained prompts versus ≤40% under JSON format constraints, demonstrating that structured output formats suppress metacognitive expression. A systematic keyword evaluation audit (100 sampled responses) validates the deterministic scoring pipeline at >96% accuracy.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Marc_Lanctot1

Submission Number: 8162

Loading