Abstract: Recent works try to optimize a Task-oriented Dialogue System with reinforcement learning (RL) by building user simulators. However, most of them only focus on training the dialogue system using a single user simulator. In this paper, we propose a framework called MUST to improve the dialogue agent by utilizing multiple user simulators simultaneously shown in Figure 1. Two core research problems of the proposed MUST are: (1) how to specify these different simulators effectively in the RL training? and (2) what model architecture should we use to learn a user simulator with better generalization capability? To tackle the first problem, we formulate the simulator selection task to train the system agent as a Multi-armed bandit (MAB) problem and modify one Upper Confidence Bound (UCB) algorithms called UCB1 to guide this selection process. To deal with the second problem, we present a new user simulator model called U-GPT based on the Generative Pre-trained Transformer (GPT). Extensive empirical results demonstrate that the dialogue system trained by the proposed MUST achieves a better performance than those trained by a single user simulator and our modified UCB1 algorithm can accelerate the MUST training. Furthermore, we reveal that our GPT-based user simulator outperforms previous learning-based simulators through direct and indirect evaluations.
Paper Type: long
0 Replies
Loading