Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks
Keywords: Semantic Parsing, Text-to-Query, Natural Language Interfaces, Large Language Models (LLMs), Benchmarking, Execution-based Evaluation, Enterprise Data Systems, Jira Query Language (JQL), Query Generation, Text-to-SQL
TL;DR: We release Jackal, a 100,000 text-to-JQL benchmark with execution-based scoring and a Jira snapshot, evaluating 23 LLMs on execution accuracy, exact match, and canonical exact match.
Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira.
Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping
natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL
benchmark comprising 100,000 natural language (NL) requests paired with validated JQL
queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect
real-world usage, each JQL query is associated with four types of user requests: (i) Long NL,
(ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a
corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and
a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL
results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed
source models, across execution accuracy, exact match, and canonical exact match. In this
paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best
overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally
across four user request types. Performance varies significantly across user request types:
(i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv)
Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and
executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and
sets a new, execution-based challenge for future research in Jira enterprise data.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21994
Loading