jqBench: a benchmark for reading and editing JSON from natural language and/or examples

ICLR 2026 Conference Submission14361 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: JSON, benchmark, code generation, nl-to-code, programming-by-example
TL;DR: We introduce a benchmark for reading and editing JSON data using natural language and/or examples, with a focus on the `jq` tool.
Abstract: We introduce jqBench, a new benchmark for evaluating language models on JSON querying and transformation tasks, where the intent can be given specified using natural language and/or examples. Whereas jqBench is mainly aimed at using the `jq` tool, it can be used to evaluate other programming languages that query and/or transform JSON. Benchmarks are automatically created from two rich sources of data: Stack Overflow discussions (751 instances with instructions and examples, called jqStack) and the Spider dataset for SQL generation from natural language (893 instances with instructions and JSON Schema, called jqSpider). We describe and analyze the automated pipeline for benchmark creation, and perform extensive baseline experiments on different models to analyze the complexity and failure modes. Using implicit feedback, the best model (Claude Opus 4.1) scores 77% on the jqStack benchmarks and 81\% on the jqSpider benchmarks. Additionally, we show (1) that access to the documentation surprisingly does not help, (2) `jq` performs comparable to Python, and (3) that automatic feedback (and therefore examples) is crucial. Besides the final benchmarks, we release the intermediate artifacts from each generation step (including failed or invalid conversions) as well as an LLM-friendly version of the documentation, to facilitate further research on JSON querying and transformation.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14361
Loading