Keywords: Prompt Optimization, Large Language Models, Large Reasoning Models, Event Extraction, Natural Language Processing
TL;DR: Prompt optimization benefits all models, but reasoning models gain more. With optimized prompts, they outperform their non-optimized counterparts and LLMs on Event Extraction
Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Beyond event extraction, we replicate our findings on two very different tasks: Geometric Shapes and NCBI Disease NER. Prompt optimization improves all models, with LRMs benefiting most. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20842
Loading