Keywords: LLM evaluation, causal representation learning, observational studies
Abstract: Deriving actionable insights from language model evaluations to guide post-training is a central challenge, hampered by complex confounding effects and the prohibitive cost of controlled studies. In this paper, we propose a causal representation learning framework to uncover a hierarchy of LLM capabilities purely from publicly available observational data. Drawing insights from recent factor analysis \citep{ruan2024observational}, we model the observed benchmark performance as a linear transformation of a few latent capability factors. Crucially, we model these latent factors as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. The hierarchy that we discover --- from general problem-solving to instruction-following, and finally to mathematical reasoning --- is more than an interpretation; it provides a causal roadmap for post-training. We demonstrate through targeted fine-tuning experiments that interventions on parent capabilities propagate to child capabilities as predicted by our model, offering validated, actionable guidance for practitioners.
Primary Area: interpretability and explainable AI
Submission Number: 13100
Loading