Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback

Hal Daumé III; John Langford; Amr Sharaf

Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback

Hal Daumé III, John Langford, Amr Sharaf

15 Feb 2018 (modified: 10 Feb 2022)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode. We introduce a novel algorithm, RESIDUAL LOSS PREDICTION (RESLOPE), that solves such problems by automatically learning an internal representation of a denser reward function. RESLOPE operates as a reduction to contextual bandits, using its learned loss representation to solve the credit assignment problem, and a contextual bandit oracle to trade-off exploration and exploitation. RESLOPE enjoys a no-regret reduction-style theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings.

TL;DR: We present a novel algorithm for solving reinforcement learning and bandit structured prediction problems with very sparse loss feedback.

Keywords: Reinforcement Learning, Structured Prediction, Contextual Bandits, Learning Reduction

Code: [![github](/images/github_icon.svg) hal3/reslope](https://github.com/hal3/reslope)

16 Replies

Loading