Multi-Agent LLMs for Style-Controlled and Faithful Scientific Conclusion Generation
Keywords: Stylometry, Text Generation, Prompt-based Generation, Large Language Models (LLMs), Agentic AI, Multi-Agent
TL;DR: In this work, we present a modular, multi-agent LLM architecture for generating venue-aware conclusions for scientific papers.
Abstract: Generating a conclusion for a scientific paper is a challenging task, since a strong conclusion must accurately reflect the paper’s goals, methods, and findings, while avoiding hallucinated or unsupported claims and matching the discourse conventions of the target venue. In practice, directly prompting a single large language model (LLM) with an entire paper often yields conclusions that omit key results, overstate contributions, or drift stylistically from venue expectations. These problems are especially important in scientific writing, where factual faithfulness, interpretability, and rhetorical precision are essential.
We present a modular, multi-agent architecture [1] for venue-aware conclusion generation for scientific papers. Our approach is implemented in DSPy [2], which enables LLM systems to be expressed as declarative programs and optimized with respect to task-specific metrics. Our input documents are processed using a two-stage extraction pipeline. For text, we use Science Parse [3] to convert PDFs into structured JSON containing title, abstract, and section-level content. For visual material, we use PyMuPDF [4] to identify and export embedded figures for downstream multimodal analysis. The retrieval component draws on a large scholarly corpus containing approximately 66,000 papers (about 200 GB). In the current study, this corpus is used primarily as a retrieval pool for venue-aware style modeling and as an evaluation resource.
To improve the pipeline, we integrate an LLM-as-a-Judge [5] objective that scores generated conclusions against gold conclusions in terms of writing-style similarity and content preservation. DSPy optimizers then compile the multi-agent program to maximize this evaluation signal. This design allows us to investigate an important research question: whether a multi-agent architecture built on a lower-capacity model (e.g., GPT-4o-mini), compiled and optimized with DSPy, can achieve higher performance in venue-aware conclusion generation than a single higher-capacity model (e.g., GPT-5.2) prompted directly?
Preliminary experiments on a small set of research papers have produced acceptable results. In addition, the RAG-based [6] editorial agent provides explicit control over venue-specific writing style, making the system a useful testbed for studying multi-agent collaboration, multimodal evidence integration, automated evaluation, and style-aware scientific writing assistance. While our current extraction pipeline handles text and figures, table extraction and interpretation remain open. A natural extension is to add a table parser and a dedicated table-interpretation agent, allowing the conclusion generator to reason over quantitative results directly. On the retrieval side, we plan to move from lexical similarity to embedding-based retrieval using cosine similarity over vector representations of conclusions or entire papers. Also, we intend to explore fine-tuning of the underlying LLMs on our corpus and systematic few-shot learning strategies to further improve key-finding extraction, goal identification, and style transfer. Finally, we will expand the evaluation section and incorporate a human-in-the-loop approach to obtain more reliable results.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 153
Loading