Generalized Linear Markov Decision Process

ICLR 2026 Conference Submission18418 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Structured MDPs, Bellman Completeness, Generalized Linear Models, Offline Reinforcement Learning, Sample Efficiency
TL;DR: We propose a novel framework that integrates generalized linear models (GLMs) into reward modeling while preserving the structured representations of linear MDP.
Abstract: The linear Markov Decision Process (MDP) provides a principled basis for reinforcement learning (RL) but assumes that both transitions and rewards are linear in the \textit{same} feature space. This severely limits its applicability when rewards are nonlinear or discrete. We introduce the Generalized Linear MDP (GLMDP), which retains linear transitions while modeling rewards with generalized linear models \textbf{under potentially different feature maps}. This separation is crucial: transitions may admit rich representations learned from large unlabeled trajectories, while rewards can be modeled with limited labeled data. We show that GLMDPs are Bellman complete with respect to a new function class, enabling efficient value iteration. Based on this, we develop algorithms with provable guarantees in both \textbf{offline} and \textbf{online} settings. For offline RL, we design pessimistic and semi-supervised value iteration methods that achieve policy suboptimality bounds and demonstrate significant label-efficiency gains. For online RL, we propose an optimistic algorithm with a near-optimal regret bound. Together, these results broaden the scope of structured and sample-efficient RL to applications with complex reward structures, such as healthcare and e-commerce.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18418
Loading