Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Sadegh Mahdavi; Muchen Li; Kaiwen Liu; Christos Thrampoulidis; Leonid Sigal; Renjie Liao

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a new dataset and benchmark for training and evaluating mathematical reasoning in LLMs.

Abstract: Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in **AoPS-Instruct**, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces **LiveAoPSBench**, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain.

Lay Summary: Most existing LLMs struggle with advanced math problems because there is very little high‑quality training data for Olympiad‑level questions, and existing benchmarks often include problems the models have already seen during pre‑training, making evaluations unreliable. To address this, we built an automated pipeline that mines the Art of Problem Solving forum for genuine competition‑level problems and community‑provided solutions, then uses open‑source LLMs to extract and clean more than 600,000 question–answer pairs, creating the AoPS‑Instruct dataset. We also developed LiveAoPSBench, an evolving evaluation set drawn from the latest forum posts, which filters out any overlap with earlier data to avoid contamination. By fine‑tuning various LLMs on AoPS‑Instruct, we observed marked improvements in their ability to solve challenging math problems. Furthermore, tracking performance over time on LiveAoPSBench revealed that many models perform worse on newer questions, indicating that past successes often stemmed from having seen similar problems during pre‑training rather than genuine reasoning skills. This work offers a scalable way to generate and maintain large, reliable datasets for advanced mathematical reasoning, helping researchers better understand and push the true capabilities of LLMs in this domain.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/DSL-Lab/aops

Primary Area: Deep Learning->Large Language Models

Keywords: Mathematical Reasoning, Large Language Models, Evaluation

Submission Number: 13108

Loading