Learning What Matters: Dynamic Experience Prioritization for Task-Oriented Dialogue Policy via Stage-aware Experience Management

Learning What Matters: Dynamic Experience Prioritization for Task-Oriented Dialogue Policy via Stage-aware Experience Management

ACL ARR 2025 May Submission2497 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Experience replay plays a pivotal role in enhancing sample efficiency for reinforcement learning-based dialogue policy optimization. However, traditional random sampling or static heuristic strategies fail to dynamically exploit critical experiences following policy learning stages, resulting in inefficient sampling and noise propagation. To address this issue, this paper presents a dynamic Stage-aware Experience Management (SEM) framework that establishes quantitative mapping between policy learning stages and experience states to adjust replay priorities adaptively. This framework adopts a quadripartite experience state paradigm to characterize the stages of policy learning and provide a quantitative basis for experience management decisions. Moreover, a dual Q-network structure is employed to monitor loss discrepancies and trends in real-time, discriminating each experience as stable, forgotten, unmastered, or noisy. Benefiting from this dynamic stage-aware mechanism, the SEM prioritizes replaying critical experiences in forgotten and unmastered experiences to strengthen weak links while suppressing noisy samples to reduce interference. Experiments on four public dialogue datasets verify the effectiveness and generalizability of the SEM in dynamic priority management.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Task-oriented Dialogue System, Dialogue Policy, Off-policy Reinforcement Learning, Sampling Efficiency, Experience Priority

Contribution Types: NLP engineering experiment

Languages Studied: English

Keywords: Task-oriented Dialogue System, Dialogue Policy, Off-policy Reinforcement Learning, Sampling Efficiency, Experience Priority

Submission Number: 2497

Loading