Keywords: LLM for Network Systems, Dynamic Benchmark
Abstract: As large language models (LLMs) expand into high-stakes domains like network
system operations, evaluating their real-world reliability becomes increasingly
critical. However, existing benchmarks risk contamination due to static design,
show high statistical variance from limited dataset size, and fail to reflect the
complexity of production environments. We introduce NetArena, a dynamic
benchmark generation framework for network applications. NetArena features a
novel abstraction and unified interface that generalizes across applications, effec-
tively addressing the challenges of dynamic benchmarking posed by the diversity
of network tasks. At runtime, users can generate unlimited queries on demand.
NetArena integrates with network emulators to provide execution-time feedback
on correctness, safety, and latency. We demonstrate NetArena on three repre-
sentative applications and find that (1) it significantly improve statistical reliability
among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents
achieve only 13–38% average performance (as low as 3%) for large-scale, realistic
queries, (3) it reveals finer-grained behaviors missed by static, correctness-only
benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on
network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md
Primary Area: datasets and benchmarks
Submission Number: 13584
Loading