Keywords: Visual Reasoning, Hierarchical Benchmark
Abstract: Despite the rapid advancement of Multimodal Large Language Models (MLLMs), their reasoning capabilities are often constrained by perceptual fragility and a lack of transparent logical derivation. This frequently leads to cascaded failures, where minor perceptual inaccuracies propagate through the reasoning chain. We propose a novel, automated rule-based generation framework FLUSH-Gen that ensures rigorous logical consistency by decoupling visual synthesis from visual attributes. Leveraging this framework, we introduce FLUSHPuzzle, a hierarchical 20,000 instances benchmark comprising 30 perception primitives and 200 reasoning subclasses. Unlike existing benchmarks, each of our 20,000 samples is paired with a verifiable reasoning trace explicitly mapped to low-level visual elements, enabling fine-grained diagnostic evaluation. Our experiments demonstrate that fine-tuning 8B-parameter models in the FLUSHPuzzle train set yields significant performance gains, achieving an absolute accuracy improvement of 15.8% and achieving competitiveness with proprietary models such as Gemini 3 Pro.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation: benchmarking,
Contribution Types: Data resources
Languages Studied: English
Submission Number: 2507
Loading