Keywords: reinforcement learning, agent policy alignment, sequence modeling
TL;DR: We provide a novel perspective on agent policy alignment and introduce a novel sequence model based reinforcement learning method namely Categorical Decision Mamba.
Abstract: Recently, sequence modeling methods have been applied to solve the problem of off-policy reinforcement learning. One notable example is the work on Decision Mamba, incorporating Mamba block into the Decision-Transformer-type neural network architecture. In this work, we
begin our exploration with the latest sequential decision-making model, leveraging its strengths as a foundation for further development. We propose a theoretical measure of alignment on the policy of the agent with the human expert, known as Expected Agent Alignment Error (EA2E). Furthermore, we provide a complete theoretical proof that reducing the Wasserstein-1 distance between distributions of the present model (agent) and the target model (agent) effectively aligns the agent's policy with the potential expert's. Building upon theoretical results, we propose Categorical Decision Mamba (CDMamba), which originates from Decision Mamba (DMamba). The core improvements of CDMamba involve utilizing histograms of categorical distributions as inputs to the Mamba model, minimizing the Wasserstein-1 distance between distributions, which ultimately yields a trained model with aligned policy and superior performance.
Submission Number: 29
Loading