Information-Theoretic State Variable Selection for Reinforcement Learning

TMLR Paper6822 Authors

06 Jan 2026 (modified: 16 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Identifying the most suitable variables to represent the state is a fundamental challenge in Reinforcement Learning (RL). These variables must efficiently capture the information necessary for making optimal decisions. In order to address this problem, in this paper, we introduce the Transfer Entropy Redundancy Criterion (TERC), an information-theoretic criterion, which determines if there is \textit{entropy transferred} from state variables to actions during training. We define an algorithm based on TERC that provably excludes variables from the state that do not affect the agent's policy during learning. Our approach is policy-dependent, making it agnostic to the underlying learning algorithm. Consequently, we use our method to enhance efficiency across three different algorithm classes (represented by tabular Q-learning, Actor-Critic, and Proximal Policy Optimization (PPO)) in a variety of environments. Furthermore, to highlight the differences between the proposed methodology and the current state-of-the-art feature selection approaches, we present a series of controlled experiments on synthetic data, before generalizing to real-world decision-making tasks. We also introduce a representation of the problem that compactly captures the transfer of information from state variables to actions as Bayesian networks.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=6aNZjKVwLn
Changes Since Last Submission: We thank the reviewers and the Action Editor for their constructive feedback. We have carefully addressed all concerns raised in the meta-review and made substantial revisions to the original manuscript. Below we summarise the key changes: **1. Quantified inference-efficiency benefits.** The meta-review noted that our inference-efficiency claims were previously insufficiently supported. We have now conducted systematic inference-timing experiments in appropriate evaluation settings, measuring wall-clock inference times for policies trained on the full state versus the TERC-selected state. The updated results show concrete speed-ups, reaching up to **2.6× on CPU**. These findings are reported in new **Tables 1 and 2**, with corresponding revisions to the **Introduction**, **Results**, and **Conclusion**. **2. Empirical validation of Condition 1.** Reviewers questioned whether our assumption (Condition 1) is genuinely “weak.” We have added a new Appendix section that empirically evaluates Condition 1 across all environments by computing redundancy across all subset-pair sizes. We also clarify that, even when Condition 1 is not satisfied, TERC still identifies an information-theoretically optimal representation (though it may not be minimal). **3. Improved notation and definitions.** We have added explicit definitions for all notation in Section 3 (Background), including set-difference notation (e.g., \(\mathcal{X}_{\backslash \mathcal{P}}\)), the meaning of “incomplete subset,” and other previously undefined terms. We also clarify the CPMCR definition and revise the wording of Lemma 1 and Theorems 1–2 to remove redundant structure and improve readability. **4. Stability analysis.** Following reviewer suggestions, we now include a stability analysis showing that TERC selects the same state variables consistently across random seeds (i.e., selection is stable under re-training). **5. Structural improvements.** In line with the reviewers’ recommendations, we condensed the Background section, improved the flow of Section 5 (CPMCR), and streamlined the overall presentation to make the manuscript easier to follow.
Assigned Action Editor: ~Shaofeng_Zou1
Submission Number: 6822
Loading