Abstract: Cellular networks, e.g., 4G/5G, rely on complex technical specifications to ensure correct functionality; however, these specifications often contain flaws or ambiguities. In this paper, we investigate the application of Large Language Models for automated cellular network specification refinement. We identify Change Requests, which record specification revisions, as a key source of domain-specific data and formulate specification refinement as three complementary sub-tasks. We introduce CR-Eval, a benchmark of 200 security-related test cases, and evaluate 17 open-source and 14 proprietary models. The best-performing model, GPT-o3-mini, identifies weaknesses in over 127 test cases within five trials. We further study LLM specialization, showing that fine-tuning an 8B model can outperform advanced LLMs such as DeepSeek-R1 and Qwen3-235B. Evaluations on 30 real-world cellular attacks demonstrate the practical impact and remaining challenges. The codebase and benchmark are available at https://github.com/jianshuod/CR-Eval.
External IDs:dblp:journals/corr/abs-2507-04214
Loading