ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: contrastive learning, prompt optimization, agentic retry loops
TL;DR: We use contrastive dyadic reasoning trace analysis for prompt optimization and outperform state-of-the-art methods.
Abstract: Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process that distinguishes success from failure on the same input. We introduce \textbf{ContraPrompt}, built on the observation that when a language model fails on a task but succeeds on a subsequent retry with feedback, the difference between its two \emph{chain-of-thought traces} constitutes an optimization signal not captured by prior prompt-optimization methods operating on single traces or on final-output comparisons. Unlike prior contrastive methods that compare final outputs or prompt variants, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so the differences that remain are the reasoning strategy and (as a consequence of the retry mechanism) the appended error feedback. We call this operation \emph{dyadic reasoning trace analysis}. The multi-attempt solving phase is structured as an instrumented agentic retry loop that generates this contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree that routes instructions by observable input characteristics. Evaluated on four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA~\citep{agrawal2025gepa} on all four, with absolute gains of $+8.29$ pp on HotPotQA ($+20.8\%$ rel.), $+2.21$ pp on GDPR-Bench ($+18.2\%$ rel.), $+7.14$ pp on GPQA~Diamond ($+10.6\%$ rel.), and $+0.74$ pp on BBH ($+0.85\%$ rel.). Ablations confirm that dyadic trace contrastivity is the critical component, with a $-16\%$ relative average performance drop upon its removal. The mechanism generalizes beyond prompt optimization: on 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA head-to-head on 11 problems, ties on 41, and loses on 1 at equal budget; and on FiNER-139 financial named entity recognition~\citep{loukas2022finer} (a 139-class high-cardinality classification task), ContraPrompt achieves $+7.77$ pp over the unoptimized baseline ($+11.6\%$ rel.) and $+1.94$ pp over GEPA ($+2.66\%$ rel.), with the input-aware tree producing branch conditions that align with standard US GAAP financial-instrument categories. We release artefacts such as optimized prompts for reproduction here\footnote{\href{https://github.com/rishvv/contraprompt_artefacts/}{https://github.com/rishvv/contraprompt$\textunderscore$artefacts/}}.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 201
Loading