Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William H English; Chase Walker; Dominic Simon; Sumit Kumar Jha; Rickard Ewetz

Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William H English, Chase Walker, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Temporal Logic, Linear Temporal Logic, Verification, Natural Language

TL;DR: We present VLTL-Bench, a benchmark unifying lifting, grounding, translation, and verification for NL-to-LTL. While prior models excel at lifting and translation, our evaluation reveals grounding into concrete state spaces remains a major challenge.

Abstract: Empirical evaluation of state-of-the-art natural language (NL) to temporal logic (TL) translation systems reveals near-perfect performance on existing benchmarks. However, current studies only measure the accuracy of the translation of NL logic into formal TL, ignoring a system’s capacity to ground atomic propositions into new scenarios or environments. This is a critical feature, necessary for the verification of resulting formulas in a concrete state space. In this paper, we introduce the Verifiable Linear Temporal Logic Benchmark (VLTL-Bench), a unifying benchmark for automated NL-to-LTL translation. The dataset consists of three unique state spaces and thousands of diverse natural language specifications and their corresponding formal temporal logic specifications. Moreover, the benchmark contains sample traces to verify the temporal logic expressions. While the benchmark directly supports end-to-end evaluation, we observe that many frameworks decompose the process into i) lifting, ii) grounding, iii) translation, and iv) verification. The benchmark provides ground truths after each of these steps to enable researchers to improve and evaluate different substeps of the overall problem. Using the benchmark, we evaluate several state‑of‑the‑art NL-to-TL translation models and frameworks, including nl2spec, NL2TL, NL2LTL, Lang2LTL, sequence-to-sequence translation, and various LLM prompting techniques. Our evaluation confirms that existing work is capable of reliably performing lifting and translation with high accuracy, while it exposes their struggles to ground the translation into a state space, which stems from the lack of existing datasets.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 20605

Loading