Abstract: Effective testing of sequence-to-sequence (seq2seq) models, such as those used in question answering (QA) systems, is essential for ensuring their reliability. While recent efforts have introduced metamorphic testing strategies to detect bugs without requiring ground-truth labels, the efficiency of these methods remains limited by their lack of test case prioritization. Executing all test cases uniformly can lead to wasted resources and slower fault discovery. In this paper, we propose a white-box prioritization framework that ranks test cases based on internal signals extracted from the underlying model. Building upon a prior work that introduced two whitebox techniques (i.e., GRI and WALI) for identifying vulnerable tokens, we adapt these techniques to the task of test prioritization. Instead of generating new test inputs, our methods analyze test cases produced by QAQA and prioritize those most likely to uncover faults. We evaluate our approaches on three widely-used QA datasets: BoolQ, NarrativeQA, and SQuAD2. Experimental results show that GRI significantly improves the rate of bug detection under constrained testing budgets, while WALI achieves comparable performance to baseline methods. Our findings demonstrate the value of incorporating white-box insights into the prioritization process, offering a more efficient and effective way to test QA systems.
External IDs:dblp:conf/qrs/ShaoDYZS25
Loading