DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual Systems

DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual Systems

ACL ARR 2026 January Submission10123 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dialog Policy, Dual Systems, Task Oriented Dialog, Contextual Multi-Armed Bandit, Exploration Exploitation Trade-off

Abstract: Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space $\mathcal{C}$ that captures dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves SOTA performance in success rate, efficiency, and generalization, with human evaluations confirming that its decisions are well aligned with expert judgment. The code is available at \href{https://anonymous.4open.science/r/DyBBT-C6B7}{Anonymous Github}.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Dialogue and Interactive Systems, Machine Learning for NLP

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 10123

Loading