Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs) Evaluation, Procedural Reasoning, Operating Systems (OS), Step-Level Verification
TL;DR: SVAC tests LLMs on tracing OS algorithms step-by-step. Models struggle with state tracking, causing error cascades. Notably, Chain-of-Thought harms accuracy while few-shot works best.
Abstract: Large language models (LLMs) are increasingly applied to technical domains requiring structured, multi-step algorithmic reasoning. Yet systematic benchmarks that isolate intermediate reasoning fidelity — rather than merely checking final answers — remain scarce. We introduce SVAC (Step-Level Verification of Algorithm Correctness), a benchmark that evaluates LLMs on their ability to manually trace three canonical Operating System resource-management algorithms: page replacement (FIFO, LRU, OPT, Clock), process scheduling (SJF, Round Robin, MLFQ), and disk scheduling (FCFS, SSTF, SCAN, C-SCAN, LOOK). For each problem instance, models must produce a full execution trace under strict constraints: no code, no external tools, and no step abbreviation. Ground-truth traces generated by deterministic solvers enable fine-grained, step-level scoring across five complexity levels and three prompting strategies (zero-shot, few-shot, chain-of-thought). Our evaluation reveals characteristic failure modes including early error cascades, hit/miss misclassification, and algorithm-specific state corruption that vary systematically across model families, algorithm types, and input complexity. SVAC provides the first unified framework for assessing OS-algorithm reasoning in LLMs, with implications for understanding the depth and limits of procedural reasoning in current foundation models.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading