Abstract: Offline Reinforcement Learning (Offline RL) is widely used for optimizing task-oriented dialogue policies by training on pre-collected dialogues, which boosts efficiency, especially when data is limited. However, traditional offline RL methods struggle with accurately measuring experience priority, leading to the loss of valuable data and susceptibility to noisy samples. To this end, this paper proposes the Adjustable Mirror Loss (AMLoss) method, which redefines experience priority by quantifying the real-time incremental contribution of each experience to policy improvement. Specifically, the contribution is computed as the loss difference between the Main and Delayed Q-networks, with a larger difference indicating a more significant learning contribution and, consequently, a higher sampling priority. By emphasizing experiences that offer greater learning gains and deprioritizing those less effective or affected by noise, AMLoss helps retain critical data. Moreover, a Sum Tree structure is introduced for efficient hierarchical storage and weighted sampling of priorities. Experimental results confirm that AMLoss effectively prioritizes important experiences while filtering out noisy ones, leading to optimal performance across various tasks.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Task-oriented Dialogue System, Dialogue Policy, Offline Reinforcement Learning, Experience Priority
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1053
Loading