ThinkJSON: Multi-Reward GRPO for Reliable JSON Schema Adherence in Small Language Models

ThinkJSON: Multi-Reward GRPO for Reliable JSON Schema Adherence in Small Language Models

ACL ARR 2026 January Submission2143 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, structured JSON, schema adherence, parseability, GRPO, multi-reward reinforcement learning, Group Relative Policy Optimization, schema-faithful generation, JSON validation, LoRA fine-tuning, Qwen models, small language models, open models, biomanufacturing, pharma, regulatory traceability, schema compliance, structured extraction, constrained decoding, auditable outputs, on-prem deployment, adjusted match, adjusted noise, parse success, synthetic data, semantic judge reward, reinforcement learning from feedback, lightweight training, model interpretability, JSON generation

Abstract: We present a method that teaches small and medium-sized language models to generate perfectly valid and schema-correct JSON without relying on slow grammar-based decoding. Using multi-reward reinforcement learning, the models learn to follow structural rules, match keys and values accurately, and stay consistent with human-defined schemas. Our approach works efficiently on limited hardware and produces auditable, parseable outputs suitable for real-world use in regulated industries like biomanufacturing and pharma.

Paper Type: Long

Research Area: Hierarchical Structure Prediction, Syntax, and Parsing

Research Area Keywords: quantization; pruning; distillation; parameter-efficient-training; data-efficient training; data augmentation; LLM Efficiency; NLP in resource-constrained settings;

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English;

Submission Number: 2143

Loading