Optimizing Large Language Models for Robust Domain-Specific Text-to-SQL: From Prompting to Preference Alignment
Track: Scientific Track
Keywords: Text-to-SQL, Large Language Models, RLAIF, ORPO, Preference Alignment, Prompt Engineering, Constrained Decoding, Robustness
TL;DR: We provide a reproducible pipeline for domain-specific Text-to-SQL, demonstrating that monolithic alignment via ORPO avoids the catastrophic collapse of PPO while offering a low-latency, single-pass alternative to complex agentic workflows.
Abstract: This work explores the optimization of Large Language Models (LLMs) for the task of generating SQL queries from natural language (NL2SQL), a critical capability for democratizing access to domain-specific data. While recent benchmarks show promising results for LLMs, deployment in real-world analytical processing requires strict adherence to SQL grammar, deep domain understanding, and robustness against out-of-scope queries. We present a comprehensive study evaluating three stages of optimization: (1) advanced prompting strategies including Chain-of-Thought and multi-turn conversational handling; (2) constrained decoding to enforce syntactic validity; and (3) Reinforcement Learning with AI Feedback (RLAIF). We specifically compare Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) using a novel reward modeling approach based on execution and semantic principles.
Our results reveal that while standard PPO suffers from reward sparsity and catastrophic collapse on 7B models, monolithic alignment via ORPO scales efficiently to 20B parameter models. This provides a stable alternative to expensive inference-time scaling,
offering a highly reproducible, single-pass pipeline for adapting open-weights models to complex data environments, serving as a low-latency alternative to agentic systems.
Submission Number: 11
Loading