jqBench: a benchmark for reading and editing JSON from natural language and/or examples

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: JSON, benchmark, code generation, nl-to-code, programming-by-example
TL;DR: We introduce a benchmark for reading and editing JSON data using natural language and/or examples, with a focus on the `jq` tool.
Abstract: We introduce jqBench, a new benchmark for evaluating language models on JSON querying and transformation tasks, where the intent can be given specified using natural language and/or examples. Whereas jqBench is mainly aimed at using the jq tool, it can be used to evaluate other programming languages that query and/or transform JSON. Benchmarks are automatically created from two rich sources of data: Stack Overflow discussions (1496 instances with instructions and examples, called jqStack) and the Spider dataset for SQL generation from natural language (859 instances with instructions and JSON Schema, called jqSpider). We describe and analyze the automated pipeline for benchmark creation, and perform extensive baseline experiments on different models to analyze the complexity and failure modes. Using implicit feedback, the best model (Opus 4.1) scores 76% on the jqStack benchmarks and 81% on the jqSpider benchmarks. Additionally, we show (1) that access to the documentation surprisingly does not help, (2) jq lags behind Python, and (3) that automatic feedback (and therefore examples) is crucial. Besides the challenging benchmarks, we release 13K converted but filtered cases for training purposes.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14361
Loading