\documentclass{article}

\usepackage{microtype}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{enumitem}
\usepackage{multirow}

\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{icml2026}

\icmltitlerunning{Candidate Exposure Before Lean}

\begin{document}

\twocolumn[
  \icmltitle{Before Lean Checks: Candidate Exposure in Proof-Action Ranking}

  \begin{icmlauthorlist}
  \icmlauthor{Anonymous Authors}{anon}
  \end{icmlauthorlist}

  \icmlaffiliation{anon}{Anonymous Institution}
  \icmlcorrespondingauthor{Anonymous Author}{anon.email@domain.com}
  \icmlkeywords{AI for mathematics, theorem proving, Lean, proof-state representation, verification}

  \vskip 0.3in
]

\printAffiliationsAndNotice{}

\begin{abstract}
Formal proof agents do not fail only when they cannot finish a theorem; they can fail earlier, when ranking hides the tactics Lean would have accepted. This paper studies that pre-check handoff: which proof-state representations and family priors keep useful candidate tactics inside the short list sent to Lean? We separate inputs visible before the next action from future-tactic metadata, then compare unguided retrieval, hard family routing and a soft family prior on a curated mathlib4 subset. Trace matches suggest that soft family guidance is competitive with unguided retrieval and that hard routing loses alternatives. A 500-state Lean check gives the more consequential picture: unguided retrieval has the strongest top-five Lean acceptance, while soft guidance is close but not better. The result reframes family prediction as a ranking-control problem: useful proof agents should surface several legal next moves for Lean, not merely imitate one recorded trace step.
\end{abstract}

\section{Introduction}

Formal proof assistants make feedback for proof agents concrete: a proposed action can be checked by Lean~\cite{lean4}, Coq or Isabelle rather than judged only by surface plausibility. This verifier-backed setting underlies CoqGym, LeanDojo/ReProver and recent formal-mathematics language-model systems~\cite{coqgym,leandojo,gptf,formalstatementcurriculum}. It matters for mathematical agents because Lean can provide feedback only on actions that the learned system actually sends for checking.

Many proof-agent systems focus on model scale, search width, premise retrieval or tactic generation~\cite{tactictoe,deepguidance,rlcop,hypertree}. The proof state itself often enters as whatever text or metadata the environment happens to expose. In a verifier-backed loop, that choice is consequential. If a representation hides useful hypotheses, available premises or goal shape, the system may never place a verifiable action near the top. If a representation overfits to incidental names and formatting, it may learn surface regularities that fail on new theorems.

We call this handoff \emph{candidate exposure}: whether a learned ranking places Lean-acceptable actions inside the short list that will be checked. Classification accuracy asks whether a representation predicts a coarse action type; similarity retrieval asks whether a candidate resembles a past proof step; exposure asks whether those signals deliver useful actions to Lean.

This framing matters beyond the small family-prior method used here. A proof agent that never exposes legal alternatives receives no accepted branches and no Lean feedback for repairing local identifiers or tactic arguments, so exposure decides whether verifier feedback can reach the next round of modeling.

A Lean proof state contains the target proposition, local hypotheses, types, dependencies, notation, library references and syntactic structure. Treating its pretty-printed text as a neutral input risks confusing proof-relevant signal with formatting, local names or incidental notation. Replacing the state with handcrafted summaries may remove useful context. The representation choice is therefore part of the proof-search design.

There is also a sharper availability rule. Before the next tactic is chosen, a prover may see the current goal, local context, theorem name, file path, imports and premises retrieved from the library. The prover has not yet seen the premises that the next tactic will actually use, and it has not seen the syntax tree of that future tactic. Those fields are easy to extract from a replayed trace, but using them as input changes the task into hindsight prediction. We therefore keep them separate throughout the paper.

We make this question concrete through tactic-family prediction. Instead of immediately generating a complete tactic string, the learner predicts a coarse family such as \texttt{intro}, \texttt{rw}, \texttt{simp}, \texttt{exact}, \texttt{apply}, \texttt{cases} or \texttt{constructor}. This is the level at which many proof agents make their first routing decision: introduce variables, rewrite, simplify, split a goal, apply a theorem or close by assumption. We then use Exposure-Preserving Family Prior (EPFP) as a small test of how family information changes the list sent to Lean. EPFP softly reorders retrieved candidate tactics while keeping off-family alternatives available for checking.

The study has four parts. First, it treats candidate exposure as the handoff between ranking and Lean checking. Second, it separates state-visible representations from future-tactic premises and tactic AST metadata. Third, it uses EPFP as a compact family-prior example: soft guidance changes rankings without discarding off-family candidates, whereas hard family gates can hide plausible tactics. Finally, it checks the ranked candidates in Lean on held-out intermediate states and compares that result with trace and family matches.

The experiments follow the handoff itself: choose a state representation, use family predictions to reorder retrieved tactics, and then check the resulting lists in Lean. Figure~\ref{fig:candidate_path} summarizes this path and the fields kept out of the main inputs.

\begin{figure*}[t]
\centering
\includegraphics[width=0.9\textwidth]{figures/candidate_path.pdf}
\caption{Candidate path before Lean checking. The main experiments use only state-visible inputs; future-tactic premise and AST fields are kept as hindsight-only comparisons.}
\label{fig:candidate_path}
\end{figure*}

\section{Related Work}

\paragraph{Learning-guided proof search.}
Learning-guided theorem proving has long connected proof states to action ranking. TacticToe learns tactic-level guidance for HOL4 and uses it inside Monte Carlo tree search~\cite{tactictoe}. Deep network-guided proof search showed that learned guidance can improve first-order automated theorem proving~\cite{deepguidance}. Reinforcement-learning systems such as rlCoP learn from previous proof attempts to guide large-scale proof search~\cite{rlcop}. GamePad studies proof-step prediction and proof-state evaluation in Coq~\cite{gamepad}. These systems demonstrate the value of learned search guidance; the present paper focuses on the representation-to-exposure step before the search policy consumes it.

\paragraph{Lean environments and theorem-proving datasets.}
Proof-assistant datasets and environments provide the substrate for this kind of study. CoqGym provides a Coq proof environment and ASTactic model for generating Coq tactics as abstract syntax trees~\cite{coqgym}. HOList introduced a higher-order theorem-proving environment for machine learning over HOL Light~\cite{holist}, while miniF2F emphasizes cross-system formal mathematical problem solving~\cite{minif2f}. Lean's mathematical library makes large-scale Lean studies possible~\cite{mathlib}. Lean-based systems show that proof traces can improve language-model theorem proving~\cite{pact}, and LeanDojo/ReProver provides tools for retrieval-augmented theorem proving, premise selection and interaction with Lean proof states~\cite{leandojo}.

\paragraph{Language models and premise selection.}
Language-model theorem provers such as GPT-f frame formal proof search as action generation against a verifier~\cite{gptf}; expert-iteration and curriculum approaches use proof search itself to generate new training signal~\cite{formalstatementcurriculum}; and HyperTree Proof Search connects neural proposal models with online search across formal environments~\cite{hypertree}. Draft, Sketch, and Prove uses informal proof sketches to guide formal proving~\cite{draftsketchprove}. Recent Lean 4 work also studies statement autoformalization and theorem retrieval, including ATLAS~\cite{atlas}, Mathlib4 semantic search~\cite{mathlibsemanticsearch} and LeanSearch v2~\cite{leansearchv2}. Premise selection is a related hard part: DeepMath studies neural premise selection for large formal corpora~\cite{deepmath}, while LeanDojo/ReProver combines premise retrieval with Lean interaction~\cite{leandojo}. Our representation protocol follows the same availability principle: retrieved premises may be used only when they are selected from information available before the next tactic, whereas annotated premises from the target tactic are future information.

\paragraph{Trace supervision and verifier feedback.}
Learning-to-rank methods such as RankNet optimize ranking functions from supervised signals~\cite{ranknet}. In proof-action ranking, the supervised signal is often an exact match to the traced tactic, but the verifier accepts a wider set of valid continuations. The same distinction applies to exploration in verifier-backed agents: a ranking method can be evaluated by both offline trace matches and the candidates Lean accepts.

Our study narrows this line of work to the last ranking step before a tactic reaches Lean, where trace matching and direct Lean checking can be compared on the same candidate lists.

\section{Problem Setup}

\paragraph{Verifier-backed proof-action ranking.}
A proof assistant gives a sharp feedback signal. A proof-action ranking method proposes a short list of candidates, and Lean determines which, if any, are valid continuations. A learned system rarely searches over all possible tactic strings. It first maps a proof state \(s\) to a representation \(\phi(s)\), scores a candidate set \(C(s)\), and only then sends a few candidates to the verifier. Representation is therefore part of the search policy: it helps determine which proof information can affect the actions Lean sees.

This motivates candidate exposure. For a ranked candidate list, exposure asks whether a Lean-accepted tactic appears among the top \(k\) candidates. Large LeanDojo traces record the tactic that appeared in the proof while omitting other alternatives Lean might have accepted. We keep trace matches, family matches and Lean acceptance as separate measurements. This connects representation learning to verifier feedback: Lean can accept an action only after it appears in the checked list.

A trace match is a narrow event. A proof may record \texttt{rw [h]}, while \texttt{simpa [h]} or an \texttt{exact} proof term may also be legal in the same state. The reverse can happen after transfer: a candidate can look textually close to the trace but fail because a local identifier, implicit argument or type-class instance is reconstructed differently. For this reason, the execution section checks the same ranked lists in Lean rather than treating a trace miss as a useless candidate.

In the retrieval study, an action \(a\) has its own stored training state \(s_a\) and family \(f(a)\). Unguided retrieval uses a similarity score \(\mathrm{sim}(\phi(s),\phi(s_a))\). EPFP adds a family prior:
\[
  r_\lambda(s,a)=\mathrm{sim}(\phi(s),\phi(s_a))
  +\lambda\,p_\theta(f(a)\mid\phi(s)),
\]
where \(p_\theta\) is the tactic-family model and \(\lambda\) controls how much family information can reorder candidates. EPFP keeps every candidate rankable by similarity; family scores only perturb the order. Hard family guidance is a lexicographic gate: candidates from the top predicted family are ranked before candidates from lower predicted families, regardless of small similarity differences. True-family guidance is included as a ceiling.

\paragraph{Why this layer matters for AI4Math.}
Self-improving mathematical agents need a reliable feedback loop: propose actions, verify them, learn from successes and failures, and improve future rankings. Candidate exposure is the point where this loop can start or stall. Rankings that bring plausible actions forward give the agent accepted actions, rejected actions and Lean error messages to learn from. The representation-to-ranking layer tells system builders whether the failure happens before Lean is allowed to judge the action.

\section{Task and Dataset}

\paragraph{Proof-step examples.}
The unit of analysis is a checked Lean proof step immediately before a tactic is applied. For each LeanDojo theorem file, the dataset records the source location, theorem name, step index, main goal, local context and optional metadata:
\[
  x_i=(\mathrm{file},\mathrm{theorem},\mathrm{step},\mathrm{goal},
  \mathrm{context},\ldots).
\]
The target is the family of the next tactic,
\(y_i=\mathrm{family}(\mathrm{next\_tactic}_i)\). The full tactic string is retained for next-action retrieval and execution checks.

\paragraph{Why tactic families.}
Full tactic prediction is a structured-generation problem involving theorem names, rewrite lists, local identifiers and nested proof terms. Tactic families expose the first routing decision in that problem: does the represented state contain enough visible signal to choose the kind of next proof move? The abstraction is still nontrivial because similar state text may support \texttt{rfl}, \texttt{rw}, \texttt{simp}, \texttt{ring} or \texttt{exact}, depending on local context and available facts.

\paragraph{Controlled mathlib4 subset.}
The main dataset is mixed enough to stress first-stage routing while staying small enough to inspect. It combines the earlier S3 arithmetic, mixed-library and proof-dense subset with four S4 dense-core modules chosen by source-level screening and small-batch trace checks. The resulting collection has 3,723 proof steps across 46 traced files and 81 tactic-family labels. The label distribution is challenging: \texttt{rw}, \texttt{simp}, \texttt{exact} and \texttt{simpa} dominate, leaving a long tail of rare families. An S4-E extension adds two dense-core modules and is kept in the appendix.

\paragraph{Theorem-level splits.}
All reported splits are by theorem name, not by individual proof step. This prevents adjacent steps from the same proof from appearing in both training and test sets, which would leak local identifiers, intermediate facts and proof style. The protocol is stricter than random step-level splitting and better matches the intended use case: ranking actions for theorems not seen during training.

\paragraph{State-visible premise information.}
Premises are useful only if they are obtained in the same order a prover would obtain them. Retrieved-premise features are built by searching the training corpus with the current proof state and then appending the returned premise names. They are allowed because the retriever uses only information available before the next action. Future-premise features come from the recorded tactic itself; they answer a different question, namely how much family signal would be present if premise selection and tactic-argument discovery were already solved. This distinction is syntactically small but interpretively large.

\section{Method}

\paragraph{Overview.}
The method has two layers. First, we materialize several representations of the same checked proof states and train lightweight tactic-family predictors under by-theorem splits. Second, we attach the learned family distribution to retrieval through EPFP, which is designed to preserve candidate exposure rather than gate candidates away. This ordering separates two questions: classification measures whether a representation carries family information, while EPFP asks whether that information helps Lean see useful actions.
EPFP is deliberately small: learned family information affects priority while keeping plausible alternatives visible to the verifier. Comparing EPFP with unguided retrieval, hard family gates and true-family guidance shows when a family signal helps the ranking and when a discrete routing decision blocks it.

\paragraph{Representations.}
We compare proof-state views that differ in what is available before the next tactic is known. Raw and normalized text preserve the pretty-printed goal and local context as exposed by Lean, with normalized text removing selected names, numbers and whitespace variation. Structured summaries keep shallow goal features such as goal shape, hypothesis count, token counts, connective counts, arithmetic-operator counts and coarse head symbols. State-metadata views add file, module and theorem identifiers. Retrieved-premise views add premise names retrieved from the training corpus using the current proof state; they never read the held-out step's gold tactic premises. Future-premise views use the next tactic's annotated premises and tactic AST metadata and are reported only as a ceiling.

The state-metadata view is intentionally modest. File and theorem names can encode domain, notation and local proof idiom, but they also tempt memorization of nearby library style. We use theorem-level splits as the main guard and keep file-level, duplicate and grouped-family supporting checks in the artifact.

\paragraph{Leakage check.}
LeanDojo can annotate a tactic with the premises it actually used and with the syntax tree of that tactic. Those fields become available only after the next tactic is known. All deployable representations therefore exclude \texttt{premises}, \texttt{annotated\_tactic}, \texttt{ast\_summary}, \texttt{state\_after} and \texttt{next\_tactic}. The former premise-aware representation is renamed future-premise in the paper logic and kept out of the main conclusions.

\paragraph{Models.}
We use small classifiers: majority class, a keyword heuristic, bag-of-words Naive Bayes, TF-IDF logistic regression and TF-IDF linear SVM. This keeps the representation comparison readable: each proof-state view is evaluated with the same model families, splits and targets.

\paragraph{Evaluation protocol.}
All splits are by theorem name, so proof steps from the same theorem never appear in both training and test sets. For tactic-family prediction, accuracy reflects common families, macro-F1 exposes rare-family weakness, and top-\(k\) accuracy measures whether the correct family remains visible beyond the first guess.

We then ask how that information changes the actions that Lean would see. For each held-out proof state, training-set tactics form the candidate set and unguided retrieval ranks them by TF-IDF similarity. EPFP adds the learned family prior from the setup above; hard family guidance puts all candidates from the single predicted family ahead of all others; top-\(m\) and reciprocal-rank-fusion variants provide less brittle family-aware baselines; true-family guidance gives a ceiling. EPFP weights are selected on a validation split. The ranking tables track trace matches, family matches and the mean rank of the first true-family candidate.

We keep three measurements separate. Trace and family match ask whether a retrieved candidate equals the recorded next tactic or shares its family. Lean applicability asks whether Lean accepts a transferred candidate as a legal one-step tactic in a reconstructed intermediate state. Full LeanDojo proof search would execute tactics in the original interactive environment across multiple search steps; here we focus on the one-step check.

The behavior is direct. When the top predicted family \(\hat y\) is wrong, a hard gate can fill the top-\(k\) list with candidates from that wrong family and hide a nearby candidate that Lean might accept. EPFP avoids this all-or-nothing behavior because every candidate remains rankable by similarity and family information only changes priority. For two candidates \(a,b\), EPFP ranks \(a\) above \(b\) when
\[
  \Delta_{\mathrm{sim}}+\lambda \Delta_p > 0,
\]
where \(\Delta_{\mathrm{sim}}\) is their similarity gap and \(\Delta_p\) is their family-probability gap. Since \(\Delta_p\in[-1,1]\), candidates separated by more than \(\lambda\) in similarity keep their order. The family prior therefore acts as a local preference when \(\lambda\) is small, rather than as a rule that removes alternatives.

To connect this layer to verification, we check ranked tactics on held-out mathlib states and record each \((\mathrm{query}, \mathrm{strategy}, \mathrm{rank}, \mathrm{tactic})\) item sent to Lean. This lets us compare trace matching, family matching and Lean acceptance for the same candidate lists.

\section{Results}

\paragraph{Dataset shape.}
The dataset is a compact routing set. Its 3,723 proof steps cover many theorem names and 81 tactic families, with most probability mass in a few common families. Head-family accuracy is therefore easy to improve by exploiting common tactics, while macro-F1 reflects how little data exists for many rare proof moves.

\paragraph{RQ1: state-visible representations carry family signal.}
The first result is that representation changes the proof signal visible to even small models. The leakage-checked study separates deployable state-only and retrieved-premise inputs from the future-premise ceiling. Inputs that use the next tactic's annotated premises or tactic AST are best read as hindsight information.

Here we report by-theorem results; the artifact contains file-level, per-family and rare/mid/head supporting checks. State-visible models recover head-family signal, but macro-F1 remains low, confirming that rare proof moves are still difficult with shallow state text. Future-premise numbers estimate the headroom left for premise selection and tactic-argument generation.

The deployable conclusion therefore rests on state-visible features and on premises retrieved without reading the held-out tactic.

This pattern separates head-family signal from long-tail weakness. The head families are learnable from visible state text; the tail remains weak; and premise-bearing features help only when the premises are obtained by a legal retrieval step. The path for improvement is therefore concrete: better premise retrieval and argument construction are likely to matter more than adding another shallow classifier on the same pretty-printed state.

\input{generated_tables}

\paragraph{RQ2: soft family priors are safer than hard gates.}
The main ranking result is that hard symbolic routing is brittle. In mixed proof-step retrieval, an early family mistake can hide otherwise similar candidates. Across by-theorem splits, hard gating pushes the first true-family candidate from roughly rank 41 to above rank 500, while validation-selected EPFP stays near unguided retrieval. Its exact@5 is essentially tied with unguided retrieval. The true-family row clarifies the ceiling: even knowing the true family leaves within-family premise and argument selection as the next problem.

Validation selection also prevents over-reading the family prior. The selected EPFP weights vary across splits, and the paired tests show only a small exact@1 gain. The mechanism is easy to see: when the top family is wrong, hard gating displaces the true family by hundreds of ranks, while EPFP usually moves it by only a few positions. Uncertain family predictions work better as soft preferences than as filters that delete plausible alternatives before Lean can check them.

Family prediction is most useful here as a soft ranking signal; trace-match gains are only the first part of the story.

\paragraph{RQ3: Lean checking changes the picture.}
With Lean 4.28, we check the top five tactics from each strategy on 500 held-out S4 intermediate states. Almost all candidates run normally; the remaining failures are state-reconstruction issues, mainly unresolved local identifiers or type-class elaboration after tactic transfer.

Trace matching and Lean checking answer different questions. Trace matching asks whether a retrieved tactic is the one recorded in the proof. Lean checking asks whether the tactic is a legal one-step continuation in the reconstructed state. Lean often accepts alternatives to the trace tactic, including different simplification, rewrite, closure or construction moves.

Table~\ref{tab:micro_execution} and Figures~\ref{fig:trace_lean} and~\ref{fig:trace_scatter} change the ranking-only story. Soft family guidance is tied with unguided retrieval on exact@5, but unguided retrieval is strongest after Lean checking: Accept@5 is 0.672 for unguided and 0.652 for soft family guidance. The paired comparison gives a \(-2.0\) percentage-point difference for soft guidance, with an interval crossing zero (Appendix Table~\ref{tab:execution_significance}). The execution result is therefore a tie with a lower point estimate for EPFP, rather than a top-five Lean-acceptance improvement.

Accepted alternatives explain why trace matching understates what Lean can use. For unguided retrieval, 62.8\% of queries have a Lean-accepted top-five tactic different from the recorded tactic, and 53.4\% have Lean acceptance without any trace hit. These are one-step continuations rather than complete proofs, but they show why a trace miss can still be useful to Lean.

Retrieval matches help; Lean acceptance tells us which candidates actually run.

The accepted non-trace cases are mostly ordinary proof engineering rather than surprising proof discoveries. Many are variants of simplification, rewriting, constructor use or closure by a nearby hypothesis. This is exactly the region where a ranking method can help an interactive prover: the trace gives one path through the proof, while Lean may accept several small local moves that keep the proof moving. A ranking table that sees only the traced string throws those alternatives away.

\paragraph{What the Lean check reveals.}
The 500-state run changes the interpretation in three ways. First, exact trace matching is a conservative reading of usefulness: it rewards recovering the logged proof step, but Lean often accepts a neighboring tactic that makes progress from the same state. Second, family-aware ranking mainly changes which alternatives survive to the top five. Soft guidance preserves many of the same alternatives as unguided retrieval, while hard routing replaces them with candidates from one family and loses useful off-family moves. Third, most accepted non-trace tactics are not exotic; they are the kind of local rewrite, simplification, constructor or closure move that an interactive user might try when the traced proof is unavailable. The practical message is that a prover needs several legal next moves, not only the exact move written in mathlib.

The failures are equally informative. Unknown identifiers point to local-context transfer, not to a failure of the mathematical idea behind the tactic. Elaboration failures often arise when the reconstructed state lacks the precise implicit arguments or instances present in the original proof script. Ordinary Lean rejections, by contrast, say the candidate is simply the wrong move for that state. Separating these cases makes the execution run more useful than a single accept/reject number.

\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{figures/trace_vs_lean_acceptance.pdf}
\caption{Top-five trace exact match and Lean acceptance on the same 500 held-out states. Lean accepts many non-trace tactics, and unguided retrieval gives the highest top-five acceptance in this run.}
\label{fig:trace_lean}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.92\columnwidth]{figures/trace_lean_scatter.pdf}
\caption{Trace exact@5 versus Lean accept@5 by ranking strategy. Larger markers indicate more accepted non-trace top-five tactics; all strategies have many Lean-accepted tactics that are not exact trace matches.}
\label{fig:trace_scatter}
\end{figure}

\paragraph{Errors concentrate around overloaded families.}
The largest confusions identify the next problem after family exposure. Most errors center on \texttt{rw}, often confused with simplification-adjacent or construction-oriented families such as \texttt{simpa}, \texttt{constructor}, \texttt{cases} and arithmetic automation. These plausible alternatives point to the next modeling layer: within-family premise selection and tactic-argument generation.

Qualitative inspection gives the same pattern. Some failures are simplification/rewrite ambiguities, others are construction/refinement ambiguities, and a third class depends on exact local premises or arguments. These errors explain why even the true-family row in Table~\ref{tab:search} is far from perfect.
The family labels are therefore best read as a first routing layer, not as the final proof action. The \texttt{rw}/\texttt{simp}/\texttt{simpa}/\texttt{rwa}/\texttt{simp\_rw} group is especially important: these tactics often differ in how they combine rewriting, simplification and goal closure, but a theorem prover ultimately needs the right lemma list and local identifiers. Treating each surface family as fully separate can exaggerate some mistakes while hiding the harder argument-selection problem.

\section{Discussion and Implications}

\paragraph{Exposure is the handoff to Lean.}
The main design lesson is that proof-state representation is part of the search problem. In this dataset, state-visible representations carry usable lightweight family signal, while future-premise results show the headroom that remains when the next tactic's actual premises are known. The retrieval study favors using family predictions as ranking hints rather than unconditional filters when uncertainty is high. This matters for verifier-backed mathematical agents because correctness is established when Lean checks a candidate action, and Lean can check only actions that the ranking method exposes.

Representation errors are often silent. If a ranking hides a useful local hypothesis or overcommits to one tactic family, Lean only accepts or rejects the submitted tactics. The ranking layer should therefore be judged by the legal moves it brings within reach, not only by imitation of the traced proof.

\paragraph{Trace matches are a starting point.}
Self-evolving proof agents need inexpensive measurements, and trace or family matches are useful because they are cheap and reproducible. Lean checking adds the missing view: it shows which retrieved tactics are accepted even when they differ from the trace. When trace matches and Lean acceptance diverge, the hard part has often moved from family exposure to argument choice, premise selection, local identifier recovery or proof-script transfer.

This also changes how to read a negative ranking result. If a method improves exact trace hits but lowers Lean-accepted top-five actions, it has become more faithful to the recorded proof without becoming more useful to the checker. If it preserves Lean acceptance while changing the trace-hit rate only slightly, it may still be a good search component. For proof agents, the ranking layer is successful when it gives Lean several plausible moves, not when it imitates the single logged move at all costs.

\paragraph{Hard symbolic gates can damage learning-to-rank.}
The retrieval result turns a common engineering choice into a visible design decision. A stronger family predictor can become a worse search rule when it is used as a hard symbolic gate: its mistakes can prevent the verifier from seeing candidates that similarity retrieval would have kept near the top. EPFP is a gentler option for mixed proof-state ranking because it lets family information influence rank while leaving lower-probability families in the list.

\paragraph{What the 500-state Lean run adds.}
The execution run is small enough to inspect and large enough to change the reading of the ranking tables. It shows that the best-looking family prior under trace matching need not be the best list for Lean checking. It also shows that transferred tactics fail for understandable reasons: unresolved local names, elaboration after reconstruction, or a tactic that is simply inapplicable to the reconstructed state. These failure modes are exactly where a future prover needs tighter coupling between retrieval, local context repair and Lean feedback.

\section{Limitations}

The study uses a curated mathlib4 subset, coarse tactic-family labels and lightweight models. The curated subset supports manual inspection and rapid iteration; full-mathlib evaluation remains future work. Tactic-family prediction isolates first-stage action routing, while the next modeling layer is within-family premise selection, rewrite-list construction and local-identifier recovery.

The execution study measures one-step applicability in reconstructed intermediate states, not theorem-level solving. Full LeanDojo replay remains future work, as does tighter local-identifier and tactic-argument repair inside the proof-search loop.

\paragraph{Reproducibility.}
The artifact includes checked tables, splits, scripts, cached one-step Lean records and JSONL records for rebuilding aggregate tables. Reviewers can inspect accepted non-trace tactics and reconstruction failures from the cache.

\section{Conclusion}

We studied candidate exposure as the handoff between Lean proof-state representation and verifier-backed action ranking, instantiated with EPFP. On a 3,723-step mathlib4 subset, state-visible family signals and future-premise ceilings separate deployable representation quality from next-tactic leakage. The practical lesson is that representation decides which tactics Lean sees, so trace scores should be read together with the legal alternatives left in the submitted list at the verifier handoff.

\newpage
\bibliographystyle{icml2026}
\bibliography{references}

\appendix
\section{Classification Details}
\label{app:full_results}

Table~\ref{tab:classification_full} gives the full tactic-family classification table behind the compact main-text comparison. The main reading is the same across models: head families are learnable from visible proof-state text, while macro-F1 remains low because many tactic families appear only a few times. The future-premise rows are kept as a hindsight comparison rather than as deployable inputs.

\input{generated_appendix_tables}

\section{Execution Details}
\label{app:execution_details}

The tables below give the details behind the Lean checking experiment in RQ3. Table~\ref{tab:execution_significance} gives the paired comparison against unguided retrieval. Table~\ref{tab:accepted_alternatives} separates exact trace recovery from Lean acceptance, showing how often Lean accepts a different top-five tactic. Table~\ref{tab:execution_failure_classes} separates reconstruction failures from ordinary Lean rejections.

The main comparison counts every sampled state. This matches what a search system would experience: a tactic that is not reconstructed successfully for Lean checking provides no verifier feedback. Candidate-level rates over successfully reconstructed items are reported only for interpreting reconstruction quality.

\input{generated_execution_appendix_tables}

\section{S4-E Extension}
\label{app:s4e}

S4-E appends two additional dense-core modules. Classification metrics are six-seed by-theorem means with population standard deviations; retrieval metrics are exact-tactic success at rank 1 on the corresponding main split.

The extension checks whether the same pattern survives after adding two more dense-core modules: state-visible features still carry family signal, future-premise inputs remain a hindsight comparison, and soft family guidance is best read as a mild ranking preference rather than a reliable Lean-acceptance improvement.

\begin{table}[!htbp]
\centering
\footnotesize
\caption{S4-E comparison. The extension preserves the classification pattern while leaving soft-guided retrieval tied with unguided retrieval.}
\label{tab:s4e_appendix}
\begin{tabular}{lrr}
\toprule
Metric & S4-main & S4-main + E \\
\midrule
Steps & 3723 & 4464 \\
Theorems & 1702 & 1979 \\
Labels & 81 & 90 \\
NB accuracy & 0.352 $\pm$ 0.004 & 0.357 $\pm$ 0.010 \\
SVM macro-F1 & 0.129 $\pm$ 0.008 & 0.149 $\pm$ 0.017 \\
Unguided E@1 & 0.120 & 0.141 \\
Soft E@1 & 0.123 & 0.141 \\
Hard E@1 & 0.069 & 0.075 \\
\bottomrule
\end{tabular}
\end{table}

\section{Practical Design Notes}
\label{app:design_notes}

\paragraph{Keep the checker close to the ranker.}
Offline trace matches are useful for fast iteration, but the ranked list has to be sampled under Lean checking before small differences are trusted. A small reordering can leave the traced tactic unchanged while changing which alternative tactics Lean sees.

\paragraph{Use family predictions as nudges.}
Tactic families summarize the first choice a prover makes: simplify, rewrite, introduce, split, apply or close. They are dangerous as discrete routers. A rewrite-looking state may still close by \texttt{simpa}; a construction-looking state may be finished by \texttt{exact}; and arithmetic automation can replace several local moves.

\paragraph{Make premise retrieval part of the input story.}
Premise information is useful only when it is obtained before the target tactic is known. Retrieved premises are therefore part of the prover's view, while tactic-annotated premises belong to hindsight comparison. The engineering target is to improve retrieval first, then spend model capacity on tactic arguments and local names.

\paragraph{Read accepted alternatives as search fuel.}
Accepted non-trace tactics are not proof completions, but they give an agent legal branches, Lean errors and chances to learn local repairs. A method that exposes several acceptable first moves may be more useful than one that ranks a single traced move higher while hiding nearby alternatives.

\end{document}
