Zero-Shot Cross-Domain Dialogue State Tracking with Small LLMs: Learning to Think through Reinforcement Learning

Zero-Shot Cross-Domain Dialogue State Tracking with Small LLMs: Learning to Think through Reinforcement Learning

ACL ARR 2025 May Submission2266 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Dialogue State Tracking (DST) is essential for task-oriented dialogue systems to track user goals, but zero-shot adaptation to unseen domains poses significant challenges. This paper proposes an innovative approach to enhance small LLMs for zero-shot cross-domain DST using reinforcement learning (RL) with verifiable rewards. We introduce two novel techniques: a Dynamic Difficulty Sampling Pipeline, which adaptively selects training examples to optimize learning efficiency, and a Difficulty-Weighted Fuzzy Match Reward Function, which provides granular feedback to address sparse rewards and prioritize difficult slots. Employing the Group Relative Policy Optimization (GRPO) algorithm, our method boosts the reasoning capabilities of small LLMs, enabling robust generalization to new domains without further training. Experiments on MultiWOZ 2.1 and 2.4 show our approach achieves state-of-the-art performance among small models and rivals larger ones, while being computationally efficient. This work demonstrates the effect of RL-based post-training for compact LLMs, paving the way for scalable, resource-efficient dialogue systems. Our code and model is available at (https://anonymous.4open.science/r/DSTRL-769B).

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Dialogue State Tracking, Zero-shot Cross-domain, reinforcement learning, Chain-of-Thought

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Keywords: Dialogue State Tracking, Reinforcement Learning, LLMs

Submission Number: 2266

Loading