VERIFY: A Novel Multi-Domain Dataset Grounding LTL in Contextual Natural Language via Provable Intermediate Logic

Paapa Kwesi Quansah; Pablo Rivas; Ernest Bonnah

VERIFY: A Novel Multi-Domain Dataset Grounding LTL in Contextual Natural Language via Provable Intermediate Logic

Paapa Kwesi Quansah, Pablo Rivas, Ernest Bonnah

Published: 26 Jan 2026, Last Modified: 30 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Linear Temporal Logic, model checking, formal verification

TL;DR: dataset for translating between LTL and natural language

Abstract: Bridging the gap between the formal precision of system specifications and the nuances of human language is critical for reliable engineering, robotics, and AI safety, but it remains a major bottleneck. Prior efforts in grounding formal logic remain fragmented, resulting in datasets that are very small-scale (~2-5k examples), domain-specific, or translate logic into overly technical forms rather than context-rich natural language (NL). Thus, failing to adequately bridge formal methods and practical NLP. To address this gap, we introduce VERIFY, the first large-scale dataset meticulously designed to unify these elements. This dataset contains more than 200k+ rigorously generated triplets, each comprising a Linear Temporal Logic (LTL) formula, a structured, human-readable 'Intermediate Technical Language' (ITL) representation designed as a bridge between logic and text, and a domain-specific NL description contextualized across 13 diverse domains. VERIFY's construction pipeline ensures high fidelity: LTL formulas are enumerated and verified via model checking, mapped to the novel ITL representation using a provably complete formal grammar, and then translated into context-aware NL via LLM-driven generation. We guarantee data quality through extensive validation protocols, i.e., manual expert verification of 10,000 diverse samples. Furthermore, automated semantic consistency checks judged by Llama 3.3 confirmed an estimated >97% semantic correctness. From the initial experiments, we demonstrate VERIFY's scalability, logical complexity, and contextual diversity, significantly challenging standard models such as T5 and Llama 3.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 23392

Loading