Zero-Shot Cross-Domain Dialogue State Tracking with Small LLMs: Learning to Think through Reinforcement Learning
Abstract: Dialogue State Tracking (DST) is essential for task-oriented dialogue systems to track user goals, but zero-shot adaptation to unseen domains poses significant challenges. This paper proposes an innovative approach to enhance small LLMs for zero-shot cross-domain DST using reinforcement learning (RL) with verifiable rewards. We introduce two novel techniques: a Dynamic Difficulty Sampling Pipeline, which adaptively selects training examples to optimize learning efficiency, and a Difficulty-Weighted Fuzzy Match Reward Function, which provides granular feedback to address sparse rewards and prioritize difficult slots. Employing the Group Relative Policy Optimization (GRPO) algorithm, our method boosts the reasoning capabilities of small LLMs, enabling robust generalization to new domains without further training. Experiments on MultiWOZ 2.1 and 2.4 show our approach achieves state-of-the-art performance among small models and rivals larger ones, while being computationally efficient. This work demonstrates the effect of RL-based post-training for compact LLMs, paving the way for scalable, resource-efficient dialogue systems. Our code and model is available at (https://anonymous.4open.science/r/DSTRL-769B).
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Dialogue State Tracking, Zero-shot Cross-domain, reinforcement learning, Chain-of-Thought
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Keywords: Dialogue State Tracking, Reinforcement Learning, LLMs
Submission Number: 2266
Loading