Surface Mastery, Deep Failures: ATC-QA Benchmark Uncovers Critical Limitations of LLMs in Aviation Safety

Surface Mastery, Deep Failures: ATC-QA Benchmark Uncovers Critical Limitations of LLMs in Aviation Safety

ACL ARR 2025 May Submission254 Authors

10 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present ATC-QA, a novel benchmark for evaluating large language models (LLMs) in aviation safety applications. Derived from 43,264 qualified Aviation Safety Reporting System (ASRS) reports, our benchmark comprises 47,151 question-answer pairs spanning seven question types and four difficulty levels.Experimental evaluation of nine representative LLMs reveals a striking dichotomy: apparent mastery of classification tasks (up to 95\% accuracy) coupled with profound failures in critical capabilities. We identify a pronounced "terminology generation bottleneck" where even top-performing models achieve only 20\% accuracy on fill-in-the-blank questions—a 75 percentage point drop from their classification performance. Our analysis further uncovers systematic process-result discrepancies in calculation tasks, where models produce correct numerical answers (53-82\% accuracy) through fundamentally flawed reasoning processes (8-55\% correctness).Counter-intuitive performance patterns across difficulty levels, where models often perform better on more complex questions, suggest fundamental differences between human and machine-perceived complexity. Architecture choice significantly impacts performance beyond parameter count, with similar-sized models showing up to 4× performance gaps on domain-specific tasks. ATC-QA provides a framework for assessing LLM capabilities in safety-critical environments where domain expertise is essential, highlighting the need for specialized evaluation in high-stakes domains.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Benchmarking, Evaluation, NLP Datasets, Evaluation Methodologies, Metrics, Language Resources

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English, Chinese

Keywords: Benchmarking, Evaluation, NLP Datasets, Evaluation Methodologies, Metrics, Language Resources

Submission Number: 254

Loading