Abstract: We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating code for complete specifications pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 30,000 tokens. Our comprehensive evaluation of 10 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification.
Significant disparities of their performance on the benchmark, differentiating the ability on long context code generation tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, NLP Applications
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 7748
Loading