Keywords: Offline RL; Inverse Transition Learning; Expert Feedback; Constrained RL
TL;DR: We use expert demonstrations to develop constraints in a gradient-free manner and use them to infer the dynamics of the environment as well as high-performing interpretable policies.
Abstract: Offline Reinforcement learning is commonly used
for sequential decision-making in domains such
as healthcare and education, where the rewards
are known and the transition dynamics T must
be estimated on the basis of batch data. A key
challenge for all tasks is how to learn a reliable estimate
of the transition dynamics T that produce
near-optimal policies that are safe enough so that
they never take actions that are far away from the
best action with respect to their value functions
and informative enough so that they communicate
the uncertainties they have. Using an expert’s
feedback, we propose a new constraint-based approach
that captures our desiderata for reliably
learning a posterior distribution of the transition
dynamics T that is free from gradients. Our results
demonstrate that by using our constraints,
we learn a high-performing policy, while considerably
reducing the policy’s variance over different
datasets. We also explain how combining uncertainty
estimation with these constraints can help
us infer a partial ranking of actions that produce
higher returns, and helps us infer safer and more
informative policies for planning.
Submission Number: 37
Loading