Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

ACL ARR 2025 July Submission941 Authors

29 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional task-oriented dialog (ToD) systems rely heavily on labor-intensive turn-level annotations, such as dialogue states and policy labels, for training. This work explores whether large language models (LLMs) can be fine-tuned solely on natural language dialogs to perform ToD tasks, without requiring such annotations. We evaluate their ability to generalize to unseen domains and compare their performance with models trained on fully annotated data. Through extensive experiments with three open-source LLMs of varying sizes and two diverse ToD datasets, we find that models fine-tuned without turn-level annotations generate coherent and contextually appropriate responses. However, their task completion performance – measured by accurate execution of API Calls – remains suboptimal, with the best models achieving only around 42% success in unseen domains. To improve task completion, we propose ZeroToD, a framework that incorporates a schema augmentation mechanism to enhance API Call accuracy and overall task completion rates, particularly in out-of-domain settings. Through neural activation analysis, we show that augmentation enables models to recognize semantic similarities across domains in lower layers while maintaining domain-specific distinctions in higher layers. We also compare ZeroToD with fine-tuning-free alternatives, such as prompting off-the-shelf LLMs, and find that our framework enables smaller, fine-tuned models that outperform large-scale proprietary LLMs in task completion. Additionally, a human study evaluating informativeness, fluency, and task completion confirms our empirical findings. These findings suggest the feasibility of developing cost-effective, scalable, and zero-shot generalizable ToD systems for real-world applications.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: NLP, Transformers, LLama, Flant5, TOD, Generalization, Neuron Activation Analysis, SGD, KETOD, Data Augmentation, Dialog Systems

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Previous URL: https://openreview.net/forum?id=E4IcDRZK1P

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: We received a very generic meta review. It would be helpful to get a more critical review that would allow us to improve the results. As for the other reviewers, except for one reviewer we did not get any response from them during the rebuttal phase.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: No

B1 Cite Creators Of Artifacts: N/A

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: section 4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: section 3

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: section 3

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5

C4 Parameters For Packages: Yes

C4 Elaboration: Appendix H

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix H

D2 Recruitment And Payment: Yes

D2 Elaboration: Appendix H

D3 Data Consent: Yes

D3 Elaboration: Appendix H

D4 Ethics Review Board Approval: Yes

D4 Elaboration: Appendix H

D5 Characteristics Of Annotators: Yes

D5 Elaboration: Appendix H

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 941

Loading