Self-Improving Clinical Reasoning via Textual Gradients

Sean Wu; Fabien Scalzo; Ira Kurtz

Self-Improving Clinical Reasoning via Textual Gradients

Sean Wu, Fabien Scalzo, Ira Kurtz

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0

Keywords: Medical Question Answering, Prompt Optimization, Textual Gradients, Recursive Self-Improvement, Clinical Reasoning, Large Language Models, Llama 3, AutoMedPrompt

TL;DR: We introduce AutoMedPrompt, a framework that uses textual gradients to automatically optimize system prompts for medical QA, enabling open-source Llama 3 to outperform GPT-4 and achieve SOTA on PubMedQA.

Abstract: Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6\%, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7\%) and NephSAP (63.8\%).

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 2

Loading