MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Ruida WANG; Rui Pan; Yuxin Li; Jipeng Zhang; Yizhen Jia; Shizhe Diao; Renjie Pi; Junjie Hu; Tong Zhang

MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Ruida WANG, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This research presents the MA-LoT framework, a novel framework that enhances the Lean4 reasoning ability of LLMs by allowing collaboration between prover and Lean executors using Long CoT with formal reasoning ability.

Abstract: Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

Lay Summary: How can we make AI models to reason like humans? Not just produce some words based on an educated guess of answers based on training data, but truly ground the reasoning in some concrete foundation. Researchers answer this question by putting math reasoning into verifiable frameworks and using LLMs to solve the problem. While traditional methods either educated guess the entire proof in one shot, often making mistakes and hard to correct, or tediously check each step, which slows everything down. Our work introduces **MA-LoT**, a new kind of AI framework that learns to do both: plan and double-check the proof like a mathematician. One model sketches the big idea of a proof, and another model revises it based on feedback, just like a student learning from a tutor’s comments. To make this work, we created a new training-inference method called *LoT-Transfer Learning*. It helps the model to develop deep thinking skills without needing huge amounts of hand-labeled data. In the sense of the result, our framework solved some of the toughest problems from international math contests. We provide a more accurate and efficient effort. This isn’t just about math. It’s a step toward AI systems that can think carefully, learn from mistakes, and help in fields where correctness really matters, like software verification and education.

Link To Code: https://github.com/RickySkywalker/LeanOfThought-Official.git

Primary Area: Deep Learning->Large Language Models

Keywords: LLM reasoning, math QA, Lean4

Submission Number: 9122

Loading