Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning

Stephanie Milani; Zhicheng Zhang; Nicholay Topin; Lirong Xia; Fei Fang

Aligning Agent Policies with Preferences: Human-Centered Interpretable Reinforcement Learning

Stephanie Milani, Zhicheng Zhang, Nicholay Topin, Lirong Xia, Fei Fang

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, interpretability, explainability, transparency, ai agents, interpretable reinforcement learning, learning from human feedback

Abstract: AI agents are increasingly developed for high-stakes decision-making, like finance and education. These decisions are captured by a policy, which defines the agent's behavior across situations and contexts. A natural choice for training these policies is reinforcement learning (RL), but achieving strong performance in such complex settings typically requires representing policies with expressive function approximators. While effective, these representations are often not interpretable, hindering our ability to understand and collaborate with these agents. Many desirable attributes of an interpretable policy, such as simplicity or alignment with institutional values, require human feedback. Yet existing methods typically collect such feedback only after training is complete, missing the opportunity to \textit{inform} the learning process itself. Consequently, an unaddressed challenge in interpretable RL is to enable AI agents to integrate preference feedback into policy generation. To address this gap, we propose a novel framework to align interpretable policies with human feedback during training. Our framework interleaves preference learning with an evolutionary algorithm, using updated preference estimates to guide the generation of better-aligned policies, and using newly-generated policies to query users to refine the preference model. Evolutionary algorithms enable the exploration of the full space of policies; however, it is intractable to maintain separate preference estimates---like win rates or utility values---for each individual policy in this infinite space. To handle this challenge, we propose to represent policies as feature vectors consisting of a finite set of meaningful attributes. For example, among a set of policies with similar performance, some may be more intuitive or more amenable to human intervention. To maximize the value of each user query, we employ a novel filtering technique to avoid presenting policies that are dominated in all dimensions, as repeated selections of clearly superior policies provides little information. We validate our method with experiments using decision-tree-structured policies, as they are widely considered to be interpretable. We leverage synthetic preference data on two RL environments: CartPole and PotholeWorld. PASTEL produces substantially more preference-aligned decision-tree policies than both VIPER and RDPS in both environments. We also show that it requires fewer preference queries to produce such policies and is more robust to preference noise. By bridging the gap between training RL agents and evaluating their explanations, we believe our work opens new avenues for developing more interpretable, user-centered RL systems.

Submission Number: 22

Loading