An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

Emil Javurek; Valentyn Melnychuk; Jonas Schweisthal; Konstantin Hess; Dennis Frauen; Stefan Feuerriegel

An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

Emil Javurek, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Dennis Frauen, Stefan Feuerriegel

Published: 26 Jan 2026, Last Modified: 13 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal Machine Learning, Doubly Robust Estimation, Neyman-Orthogonality, Markov Decision Process

Abstract: Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential out- comes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under model misspecification), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.

Primary Area: causal reasoning

Submission Number: 7404

Loading