Abstract: While Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score (VES) prioritize execution time—a metric we prove is fundamentally decoupled from consumption-based cloud billing. This paper evaluates the cost trade-offs between Reasoning and Non-Reasoning Large Language Models across 180 query executions on Google BigQuery using the 230GB StackOverflow dataset. Our analysis reveals that: (1) Reasoning models process 44.5% fewer bytes than Non-Reasoning counterparts while maintaining equivalent correctness (96.7%–100%); (2) execution time correlates weakly with query cost (r = 0.16), indicating that speed optimization does not imply cost efficiency; and (3) Non-Reasoning models exhibit extreme cost variance of up to 3.4×, producing outliers exceeding 36GB per query due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.
Loading