Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor–Critic in Reproducing-Kernel Hilbert Spaces

Na Li; Hangguan Shan; Wei Ni; Wenjie Zhang; Xinyu Li

Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor–Critic in Reproducing-Kernel Hilbert Spaces

Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, explainable reinforcement learning, Shapley values, reproducing-kernel Hilbert spaces

TL;DR: This paper firstly extends interpretability to RKHS-based Actor–Critic methods to assist the optimization process and establishes a global non-asymptotic convergence bound under state perturbations.

Abstract: Actor--critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use *state attributions* to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose *RKHS--SHAP-based Advanced Actor--Critic (RSA2C)*, an attribution-aware, kernelized, two–timescale AC, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS--SHAP (kernel mean embedding for on-manifold expectations and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. Theoretically, we derive a global, non-asymptotic convergence bound under *state perturbations*, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three standard continuous-control environments show that RSA2C achieves efficiency, stability, and interpretability.

Primary Area: reinforcement learning

Submission Number: 3528

Loading