PSMBench: A Benchmark and Dataset for Evaluating LLMs Extraction of Protocol State Machines from RFC Specifications

Zilin Shen; Xinyu Luo; Imtiaz Karim; Elisa Bertino

PSMBench: A Benchmark and Dataset for Evaluating LLMs Extraction of Protocol State Machines from RFC Specifications

Zilin Shen, Xinyu Luo, Imtiaz Karim, Elisa Bertino

Published: 18 Sept 2025, Last Modified: 23 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarking, Dataset, Large Language Model, Network Protocols, Protocol State Machine Extraction

TL;DR: Propose benchmark and sataset to evaluate LLMs extraction of state machines from network protocols standards, curated the datasets and benchmark on several open and close LLMs.

Abstract: Accurately extracting protocol-state machines (PSMs) from the long, densely written Request-for-Comments (RFC) standards that govern Internet‐scale communication remains a bottleneck for automated security analysis and protocol testing. In this paper, we introduce RFC2PSM, the first large-scale dataset that pairs 1,580 pages of cleaned RFC text with 108 manually validated states and 297 transitions covering 14 widely deployed protocols spanning the data-link, transport, session, and application layers. Built on this corpus, we propose PsmBench, a benchmark that (i) feeds chunked RFC to an LLM, (ii) prompts the model to emit a machine-readable PSM, and (iii) scores the output with structure-aware, semantic fuzzy-matching metrics that reward partially correct graphs. A comprehensive baseline study of nine state-of-the-art open and commercial LLMs reveals a persistent state–transition gap: models identify many individual states (up to $0.82$ F1) but struggle to assemble coherent transition graphs ($\leq 0.38$ F1), highlighting challenges in long-context reasoning, alias resolution, and action/event disambiguation. We release the dataset, evaluation code, and all model outputs as open-sourced, providing a fully reproducible starting point for future work on reasoning over technical prose and generating executable graph structures. RFC2PSM and PsmBench aim to catalyze cross-disciplinary progress toward LLMs that can interpret and verify the protocols that keep the Internet safe.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/zilinlin/RFC2PSM

Code URL: https://github.com/Zilinlin/RFC_PSM_Benchmark

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 2060

Loading