A randomized prospective study of a hybrid rule- and data-driven virtual patient

Published: 01 Jan 2024, Last Modified: 16 May 2025Nat. Lang. Eng. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Randomized prospective studies represent the gold standard for experimental design. In this paper, we present a randomized prospective study to validate the benefits of combining rule-based and data-driven natural language understanding methods in a virtual patient dialogue system. The system uses a rule-based pattern matching approach together with a machine learning (ML) approach in the form of a text-based convolutional neural network, combining the two methods with a simple logistic regression model to choose between their predictions for each dialogue turn. In an earlier, retrospective study, the hybrid system yielded a nearly 50% error reduction on our initial data, in part due to the differential performance between the two methods as a function of label frequency. Given these gains, and considering that our hybrid approach is unique among virtual patient systems, we compare the hybrid system to the rule-based system by itself in a randomized prospective study. We evaluate 110 unique medical student subjects interacting with the system over 5,296 conversation turns, to verify whether similar gains are observed in a deployed system. This prospective study broadly confirms the findings from the earlier one but also highlights important deficits in our training data. The hybrid approach still improves over either rule-based or ML approaches individually, even handling unseen classes with some success. However, we observe that live subjects ask more out-of-scope questions than expected. To better handle such questions, we investigate several modifications to the system combination component. These show significant overall accuracy improvements and modest F1 improvements on out-of-scope queries in an offline evaluation. We provide further analysis to characterize the difficulty of the out-of-scope problem that we have identified, as well as to suggest future improvements over the baseline we establish here.
Loading