WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
Keywords: large language models, mathematical reasoning, wireless communications, reinforcement learning, benchmark datasets
TL;DR: We train 0.5B-7B models with GRPO on wireless math, achieving 39.5% accuracy (near GPT-4o) with only 7B parameters and unexpectedly gaining 8.4 points on general math benchmarks using WirelessMathBench-XL (4,027 problems).
Abstract: Large language models excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance.
We present \textbf{WirelessMathBench-XL}, the first training-scale benchmark for wireless mathematics, comprising 4,027 expert-validated problems from 970 state-of-the-art papers.
To validate dataset quality, we train a family of models called \textbf{WirelessMathLM} (0.5B, 3B, 7B) using Group Relative Policy Optimization with binary verification rewards. WirelessMathLM-7B achieves 39.5\% accuracy, approaching GPT-4o (40.4\%) while using approximately 100 times fewer parameters than DeepSeek-R1 (671B, 57.4\%), with dramatic improvements across all scales (0.5B: +11\%, 3B: +103\%, 7B: +81\%).
Our controlled experiments reveal three findings that challenge prevailing assumptions about domain specialization. First, verification-based reinforcement learning outperforms supervised fine-tuning by +23\% relative (25.1\% vs 20.4\%) on identical data, showing that exploration against deterministic verifiers enables learning beyond labeled examples. Second, specialized training strengthens foundational capabilities: models gain +8.4 points across five general mathematical benchmarks and improve on knowledge, reasoning, and programming tasks without regression. Third, performance improvements distribute uniformly across 20 wireless subdomains regardless of training prevalence, demonstrating generalizable principle learning rather than pattern memorization.
Primary Area: datasets and benchmarks
Submission Number: 23248
Loading