Quantile-Optimal Policy Learning under Unmeasured Confounding

Zhongren Chen; Siyu Chen; Zhengling Qi; Zhuoran Yang; Xiaohong Chen

Quantile-Optimal Policy Learning under Unmeasured Confounding

Zhongren Chen, Siyu Chen, Zhengling Qi, Zhuoran Yang, Xiaohong Chen

24 Sept 2024 (modified: 05 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantile Treatment Effect, Causal Inference, Offline Contextual Bandit

TL;DR: We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest alpha-th quantile for some alpha between 0 and 1.

Abstract: We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $\alpha$-th quantile for some $\alpha \in (0, 1)$. We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are $\tilde{O}(n^{-1/2})$ quantile-optimal under a mild coverage assumption on the offline dataset. To our best knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exists unmeasured confounding.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3357

Loading