- Keywords: Semantic parsing, Text-to-SQL, Fuzzing, Software Testing
- TL;DR: We develop test suites to approximate semantic accuracy of text-to-sql systems.
- Abstract: We propose test suite accuracy to approximate semantic accuracy for Text-to-SQL models, where a predicted query is semantically correct if its denotation is the same as the gold for every possible database. Our method distills a small test suite of databases that achieves high code coverage for the gold query from a large number of randomly generated databases. At evaluation time, it computes the denotation accuracy of the predicted queries on the distilled test suite, hence calculating a tight upper-bound for semantic accuracy efficiently. We generate a distilled test suite for SPIDER, COSQL, and SPARC, and evaluate 21 models submitted to the SPIDER leaderboard. We manually examine 100 predictions where our approach disagrees with the current metric, and verify that our method is always correct. The current metric of SPIDER leads to a 2.5% false negative rate on average and 8.1% in the worst case, indicating that test suite accuracy is needed to reflect progress in semantic parsing better.