FLUSHPuzzle: Fine-grained Logical Understanding through Structural Visual Reasoning Benchmark

FLUSHPuzzle: Fine-grained Logical Understanding through Structural Visual Reasoning Benchmark

ACL ARR 2026 January Submission2507 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Reasoning, Hierarchical Benchmark

Abstract: Despite the rapid advancement of Multimodal Large Language Models (MLLMs), their reasoning capabilities are often constrained by perceptual fragility and a lack of transparent logical derivation. This frequently leads to cascaded failures, where minor perceptual inaccuracies propagate through the reasoning chain. We propose a novel, automated rule-based generation framework FLUSH-Gen that ensures rigorous logical consistency by decoupling visual synthesis from visual attributes. Leveraging this framework, we introduce FLUSHPuzzle, a hierarchical 20,000 instances benchmark comprising 30 perception primitives and 200 reasoning subclasses. Unlike existing benchmarks, each of our 20,000 samples is paired with a verifiable reasoning trace explicitly mapped to low-level visual elements, enabling fine-grained diagnostic evaluation. Our experiments demonstrate that fine-tuning 8B-parameter models in the FLUSHPuzzle train set yields significant performance gains, achieving an absolute accuracy improvement of 15.8% and achieving competitiveness with proprietary models such as Gemini 3 Pro.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation: benchmarking,

Contribution Types: Data resources

Languages Studied: English

Submission Number: 2507

Loading