CombiGraph-Vis: A Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mathematical reasoning, Multimodal benchmark, Discrete mathematics Benchmark, Olympiad Benchmark
TL;DR: We introduce CombiGraph-Vis, a 1,135-problem discrete mathematics benchmark from Iranian Informatics Olympiad that achieves strong model separation with 16-76% accuracy across leading AI systems.
Abstract: Progress on math-reasoning benchmarks such as GSM8K and MATH500 has eroded their ability to discriminate among strong systems, motivating harder tests that separate capabilities more sharply. We introduce CombiGraph-Vis, an Olympiad-style benchmark of 1,135 short-answer, multiple-choice, and yes/no problems drawn from the first and second rounds of the Iranian Informatics Olympiad, with 35% multimodal items containing images. The benchmark focuses on discrete mathematics with a computer-science accent, combinatorics, algorithmic techniques, and graph theory, along with probability, discrete and computational geometry, combinatorial game theory, formal languages and automata, conceptual data structures, and logic-driven puzzles. To make the benchmark immediately accessible, we include corrected official solutions, fixed via an agentic pipeline with human oversight, plus clear, classroom-style rewrites using Gemini 2.5 Pro that elaborate on terse reasoning. Our evaluation suite covers standard accuracy across formats and includes protocols for test-time scaling and self-verification spanning model families from Google, OpenAI, and DeepSeek. On single-sample accuracy, models range from 16.15% (gemma-3-4b-it) to 75.82% (gemini-2.5-pro), demonstrating strong separation compared to saturated benchmarks. We also introduce an LLM-assisted problem-solving technique-labeling scheme designed to enable quantitative and qualitative comparisons between human and model solutions across topics and difficulty bands. We release all data, corrected solutions, classroom-style rewrites, evaluation code, and synthetic technique labels under an open-source license to facilitate advances in multimodal algorithmic reasoning.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 22792
Loading