CombiGraph-Vis: A Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi; Pouria Mahdavinia; Alireza Farhadi; Pegah Mohammadipour; Samira Malek; Pedram Mohammadipour; Majid Daliri; Alireza Hashemi; Amir Khasahmadi; Vasant G. Honavar

CombiGraph-Vis: A Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi, Pouria Mahdavinia, Alireza Farhadi, Pegah Mohammadipour, Samira Malek, Pedram Mohammadipour, Majid Daliri, Alireza Hashemi, Amir Khasahmadi, Vasant G. Honavar

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical reasoning, Multimodal benchmark, Discrete mathematics Benchmark, Olympiad Benchmark

TL;DR: We introduce CombiGraph-Vis, a 1,135-problem discrete mathematics benchmark from Iranian Informatics Olympiad that achieves strong model separation with 16-76% accuracy across leading AI systems.

Abstract: Progress on math-reasoning benchmarks such as GSM8K and MATH500 has eroded their ability to discriminate among strong systems, motivating harder tests that separate capabilities more sharply. We introduce CombiGraph-Vis, an Olympiad-style benchmark of 1,135 short-answer, multiple-choice, and yes/no problems drawn from the first and second rounds of the Iranian Informatics Olympiad, with 35% multimodal items containing images. The benchmark focuses on discrete mathematics with a computer-science accent, combinatorics, algorithmic techniques, and graph theory, along with probability, discrete and computational geometry, combinatorial game theory, formal languages and automata, conceptual data structures, and logic-driven puzzles. To make the benchmark immediately accessible, we include corrected official solutions, fixed via an agentic pipeline with human oversight, plus clear, classroom-style rewrites using Gemini 2.5 Pro that elaborate on terse reasoning. Our evaluation suite covers standard accuracy across formats and includes protocols for test-time scaling and self-verification spanning model families from Google, OpenAI, and DeepSeek. On single-sample accuracy, models range from 16.15% (gemma-3-4b-it) to 75.82% (gemini-2.5-pro), demonstrating strong separation compared to saturated benchmarks. We also introduce an LLM-assisted problem-solving technique-labeling scheme designed to enable quantitative and qualitative comparisons between human and model solutions across topics and difficulty bands. We release all data, corrected solutions, classroom-style rewrites, evaluation code, and synthetic technique labels under an open-source license to facilitate advances in multimodal algorithmic reasoning.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 22792

Loading