Track: long paper (up to 10 pages)
Keywords: contrastive learning, prompt optimization, agentic retry loops
TL;DR: We use contrastive dyadic reasoning trace analysis for prompt optimization and outperform state-of-the-art methods.
Abstract: Prompt optimization methods either analyze individual failures in isolation or
compare prompt variants across examples, operating on single execution traces
with no access to the reasoning process that distinguishes success from failure
on the same input. We introduce \textbf{ContraPrompt}, built on the observation
that when a language model fails on a task but succeeds on a subsequent retry
with feedback, the difference between its two \emph{chain-of-thought traces}
constitutes an optimization signal not captured by prior prompt-optimization
methods operating on single traces or on final-output comparisons. Unlike
prior contrastive methods that compare final outputs or prompt variants, we
compare complete intermediate reasoning processes: the two traces share model,
input, and base prompt, so the differences that remain are the reasoning
strategy and (as a consequence of the retry mechanism) the appended error
feedback. We call this operation \emph{dyadic reasoning trace analysis}. The
multi-attempt solving phase is structured as an instrumented agentic retry
loop that generates this contrastive data automatically without human
annotation. Extracted rules are organized into an input-aware decision tree
that routes instructions by observable input characteristics. Evaluated on
four reasoning and compliance benchmarks, ContraPrompt outperforms
GEPA~\citep{agrawal2025gepa} on all four, with absolute gains of
$+8.29$ pp on HotPotQA ($+20.8\%$ rel.), $+2.21$ pp on GDPR-Bench
($+18.2\%$ rel.), $+7.14$ pp on GPQA~Diamond ($+10.6\%$ rel.), and
$+0.74$ pp on BBH ($+0.85\%$ rel.). Ablations confirm that dyadic trace
contrastivity is the critical component, with a $-16\%$ relative average
performance drop upon its removal. The mechanism generalizes beyond prompt
optimization: on 53 EvalSet black-box optimization problems, ContraPrompt beats
GEPA head-to-head on 11 problems, ties on 41, and loses on 1 at equal budget;
and on FiNER-139 financial named entity recognition~\citep{loukas2022finer}
(a 139-class high-cardinality classification task), ContraPrompt achieves
$+7.77$ pp over the unoptimized baseline ($+11.6\%$ rel.) and $+1.94$ pp
over GEPA ($+2.66\%$ rel.), with the input-aware tree producing branch
conditions that align with standard US GAAP financial-instrument categories.
We release artefacts such as optimized prompts for reproduction
here\footnote{\href{https://github.com/rishvv/contraprompt_artefacts/}{https://github.com/rishvv/contraprompt$\textunderscore$artefacts/}}.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 201
Loading