Hype versus reality of artificial intelligence (AI) platforms: unmasking the limitations of large language models in the use of scientific writing and reporting
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. However, their application to complex, domain-specific summarization, such as scientific conference presentations, remains constrained by limitations in long-context understanding, factual accuracy, and content attribution. In this study, we systematically evaluated $5$ state-of-the-art LLMs, including ChatGPT, DeepSeek, Gemini, Grok, and Qwen, each tested in both standard and reasoning-augmented configurations. All models were tasked with summarizing a full-length audio transcript comprising approximately $\sim 160{,}000$ words from $64$ speakers at the $2024$ annual meeting of the American Association of Extracellular Vesicles (AAEV $2024$). While the models were capable of extracting high-level themes and generating readable summaries, we observed persistent deficiencies in speaker coverage, affiliation attribution, and reference citation. Gemini $2.5$ Pro achieved the best overall performance, yet even the top-performing models failed to summarize up to $\frac{1}{3}$ of the speakers and did not produce accurate or complete reference citations. Incorporating reasoning processes led to measurable improvements in summarization quality across most LLMs. These findings underscore that current LLMs are not yet capable of fully autonomous scientific summarization. Our results highlight the need for more advanced reasoning mechanisms and the development of multi-agent architectures composed of specialized modules for speaker classification, citation verification, and content synthesis. Until such systems mature, expert oversight remains essential to meet the rigorous standards of biomedical communication.
Loading