
\documentclass{article} % For LaTeX2e
\usepackage{iclr2026_conference,times}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}

\usepackage{xcolor} 
\usepackage{tcolorbox}
\usepackage{listings}
\usepackage{caption}     
\usepackage{geometry}
\usepackage[utf8]{inputenc}

\usepackage{algorithm}
\usepackage{algpseudocode}

\usepackage{booktabs}
\usepackage{multirow}
\usepackage{array} 
\usepackage{siunitx} 
\usepackage{caption}

% 定义颜色
\definecolor{titlebg}{RGB}{0,0,0}      % 黑色背景
\definecolor{titletext}{RGB}{255,255,255} % 白色文字
\definecolor{codebg}{RGB}{240,240,240}   % 代码区域浅灰色背景

% 设置 listings
\lstset{
    backgroundcolor=\color{codebg},
    basicstyle=\ttfamily\footnotesize,
    breaklines=true,
    frame=none,    % 移除默认边框
    columns=fullflexible
}

\title{Revolutionizing Event Detection:\\A novel Prompt-Driven Method Enhanced by Retrieval-Augmented Paradigm}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

\author{Antiquus S.~Hippocampus, Natalia Cerebro \& Amelie P. Amygdale \thanks{ Use footnote for providing further information
about author (webpage, alternative address)---\emph{not} for acknowledging
funding agencies.  Funding acknowledgements go at the end of the paper.} \\
Department of Computer Science\\
Cranberry-Lemon University\\
Pittsburgh, PA 15213, USA \\
\texttt{\{hippo,brain,jen\}@cs.cranberry-lemon.edu} \\
\And
Ji Q. Ren \& Yevgeny LeNet \\
Department of Computational Neuroscience \\
University of the Witwatersrand \\
Joburg, South Africa \\
\texttt{\{robot,net\}@wits.ac.za} \\
\AND
Coauthor \\
Affiliation \\
Address \\
\texttt{email}
}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to \LaTeX{} to determine where to break
% the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
% puts 3 of 4 authors names on the first line, and the last on the second
% line, try using \AND instead of \And before the third author name.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}

\maketitle

\begin{abstract}
Event Detection (ED) task involves extracting event triggers from sentences and classifying them into predefined event types. While large language models (LLMs) have become widely adopted across various NLP tasks, their application to ED remains relatively unexplored. All existing LLM-based approaches follow a traditional prompt-based paradigm, which requires designing distinct prompts for each event type. This strategy, however, suffers from a fundamental limitation: as the number of event types grows, the number of prompts needed increases linearly, resulting in significant manual effort and computational costs. To overcome this limitation, we propose a novel approach that integrates a retrieval-augmented mechanism with a redesigned cascading prompt-based framework. Specifically, the prompt-based component is employed to extract candidate triggers, while the retrieval-augmented module applies heuristic filtering strategies to coarsely eliminate irrelevant candidates. In addition, we put forward an innovative automated prompt-design method to accurately match valid triggers with their corresponding event types based on retrieved information. Experimental results on ACE-05 benchmark demonstrate the state-of-the-art performance under our scheme. Furthermore, the approach remains highly effective when using lightweight LLMs, indicating its strong potential for efficient large-scale data processing. This capability may have profound implications and become a fundamental work for future research.
\end{abstract}

\section{Instruction}
\label{inst}

Nowadays, Event Detection (ED) \citet{ahn2006stages} \citet{wadden2019entity} is playing an increasingly vital role with the rapid expansion of social media and various other forms of content proliferating on the Internet. The Event Detection task involves identifying words that explicitly signal the occurrence of an event within a sentence, known as event triggers, and classifying them into predefined event types. Until now, current research methodologies in event detection remain predominantly focused on supervised learning. For instance, \citet{liao2021learning} and \citet{liu2023learning} applied contrastive learning to better capture the discriminative features of trigger words. \citet{abdi2024enhancing} improved recognition performance by incorporating external corpora. \citet{kan2024lfde} proposed a novel pre-training strategy that enhances the ability of BERT-based models to identify triggers. However, all of the supervised learning studies require large volumes of annotated data for effective training. This dependency entails significant manual effort for data labeling. Moreover, such models exhibit inherent limitations in comprehension capacity, resulting in performance bottlenecks - particularly in the challenging environments that require effective knowledge transfer and strong generalization.

By comparison, the existing LLM-based approaches, notwithstanding their scarcity, are all underpinned by a shared foundational methodology. \citet{gao2023benchmarking} and \citet{ma2024star} adopted strategies that rely on careful prompt design. These prompts primarily consist of event-type definitions and contextual examples. The definitions describe what each event type means: for instance, "Business:Start-Org" is defined as "Involves an organizational business, recording the formation of an organization." This approach enables LLMs to better comprehend the semantic information of structured triggers and their relevant event type. A notable drawback, however, is the need to pre-write a large number of prompts to cover all event types in advance. \citet{shiri2024decompose} proposed to utilize cosine similarity combined with FAISS to automatically generate examples, which are then used to incorporate into prompts as contextual few-shot demonstrations. Nevertheless, this approach focuses exclusively on sentence-level contextual semantics, overlooking the specific semantics of the extracted triggers themselves, which could lead to suboptimal performance.

Based on the aforementioned analysis, our study still focuses on LLMs, the current mainstream research direction, and proposes a revolutionary paradigm to address the above existing issues. Our pipeline contains three main modules: Candidate Trigger Extractions, Chroma Database Constructions, as well as Trigger Judgments and Classification. The Candidate Trigger Extraction module employs a novel multi-step cascading prompt strategy. By leveraging an iterative refinement process based on prior extractions, this module enables the discovery of candidate triggers that prior steps failed to detect. It consists of three sub-modules: the Basic Extraction Module, the Supplementary Extraction Module, and the Iterative Refinement Module. These three components(Section \ref{prompt_extraction}) employ systematic and logical design principles, achieving architectural consistency while facilitating LLM-based comprehensive extraction. Subsequently, our work proposes a retrieval-based method by leveraging the training dataset to construct two chroma databases, which are founded on word-level contextual semantics and sentence-level semantics, respectively. To improve the retrieval quality, arguments and non-ground-truth candidate triggers are incorporated as negative samples, thereby increasing the precision of candidate filtering. This is further supported by a parallel strategy that integrates sentence-level and contextual word-level similarities from both chroma databases to form a comprehensive score. Together, these components form the foundation for identifying and excluding non-trigger candidates accurately and subsequently judged by a three-stage filtering pipeline: an initial threshold filter (set at 0.7), followed by two heuristic rules, and an iterative automated prompt design process powered by Qwen3-1.7B, which finally determines the results. The experimental results demonstrate the effectiveness of our approach on the ACE-05 benchmark\citet{lu2021text2event}, outperforming previous state-of-the-art methods by $2.78\%$ in trigger extraction and $4.74\%$ in event type classification. Notably, the entire process requires only 9 prompts with a total of pre-designed 44 demonstrations, highlighting the efficiency and practicality of our method. In addition, we need to mention that the evaluation criterion we use is \textbf{strict matching F1-score}. To the best of our knowledge, only one latest state-of-the-art work \citet{ma2024star} adopts the same evaluation criterion. Besides surpassing their performance, our experiments reveal that when deployed on lightweight LLMs, even Qwen3-1.7B, our framework can also achieve competitive performance. This finding has important implications for efficiently processing large-scale and dynamically-updated data, which may inspire further research in low-resource and time-sensitive scenarios.

Overall, the main contributions can be summarized as follows:
\begin{itemize}
    \item We propose a novel LLMs-based approach for event detection (ED), which consists of a multi-step extraction module for candidate extraction and a fine-grained retrieval-augmented mechanism. The experimental results demonstrate the effectiveness of our proposed method.
    
    \item We further introduce an automated prompt-design approach driven by large language models (LLMs). By leveraging LLMs to judge the final outcomes iteratively, our method achieves notable filtering improvements, demonstrating the efficacy of this innovative strategy.
    
    \item We show that our framework, when deployed on lightweight LLMs (such as Qwen3 series), achieves performance beyond the state-of-the-art results implemented in pioneer models (GPT-4.0). The design principles underlying our approach pave the way for efficiently processing real-time data under resource restrictions, which may have profound meaning for future research. 
\end{itemize}


\section{Related Work}
\label{re_work}
There is a growing body of research applying supervised learning to Event Detection (ED) task. DNR  \citet{liao2021learning} employs a contrastive learning scheme with a MixSpan strategy to improve the accuracy of boundary detection in event extraction. CorED \citet{sheng2022cored} proposes  to learn latent relationships among event types. They capture the underlying connections between event types through the training process. Additionally, a masked attention mechanism is employed to enhance the model's focus on masked triggers, further improving the overall performance. PromptLoc \citet{liu2023learning} introduces a contrastive learning strategy regularized by Gaussian distribution-based distance. It also adopts a self-correction mechanism based on MC dropout to increase the model's confidence in correct predictions. TAE \citet{guan2023trigger} observes that existing methods primarily focus on learning target triggers' features for extraction, while overlooking the potential of semantic relationships among all tokens in the sentence. Their work conducts experiments to capture meaningful associative knowledge and achieves competitive results. \citet{abdi2024enhancing} employs an ontology corpus and aligns event relations with ontological relations via optimal transport. LFDe framework \citet{kan2024lfde} introduces a novel pre-training strategy that equips the bert-based models with the ability to identify event triggers effectively and efficiently.

Research using LLMs for event detection remains relatively limited. \citet{gao2023benchmarking} utilizes fine-grained enhanced instructions, involving designing tailored prompts for specific event types to enhance LLMs' overall comprehension on triggers. \citet{shiri2024decompose} proposes an automated technique for constructing in-context examples. Through using FAISS retrieval, their work identifies training examples with high sentence-level semantic similarity to inferred samples, which are then integrated into the prompt. However, a limitation of this approach is that similarity operates mainly at the sentence level but fails to capture trigger-specific contextual semantics. STAR \citet{ma2024star} proposes a structure-to-text data generation framework accompanied by a self-refinement strategy. The synthesized data are combined with the original dataset to form k representative examples per event type, incorporated into prompts sequentially as demonstrations to promote event identification and classification. However, the length of the prompts and their constituent demonstrations scales linearly with the number of event types increases. Nearly all existing methods fundamentally rely on conventional prompt engineering strategies like \citet{ma2024star}. As the number of predefined event types grows, these methods require considerable manual efforts, posing a significant constraint to their scalability and generalization to a broader set of event types.

\section{METHODOLOGY}

\label{methodology}
\begin{figure}[ht]
  \begin{center}
    \includegraphics[width=0.9\textwidth]{iclr2026/workflow.png} 
  \end{center}
  \caption{Pipeline of REVO-ED.}
  \label{fig:workflow}
\end{figure}

The pipeline of our proposed method, REVO-ED, is presented in Figure~\ref{fig:workflow}. It comprises three core modules: \textit{Prompts of Candidate Trigger Extractions}, \textit{Chroma Database Construction}, \textit{Triggers Judgments and Classifications}. \textit{Prompts of Candidate Trigger Extractions} is designed to extract potential candidate triggers through crafted prompts. It consists of three sub-modules: (1) the \textit{Basic Extraction Module}, which identifies general, special, and pronoun-based triggers in straightforward semantic contexts; (2)the \textit{Supplementary Extraction Module}, which will recover potential triggers missed by the \textit{Basic Extraction Module}; (3) the \textit{Iterative Refinement Module}, which handles complex semantic and syntactic structures that still remain unaddressed after the previous two stages. \textit{Chroma Database Construction} aims to construct two vector databases using the training set. The final pipeline \textit{Triggers Judgments and Classifications} employs filtering procedures to determine the final triggers and their event types.

\subsection{Prompt Extraction of Candidate Triggers}
\label{prompt_extraction}
There has been some researches \citet{li2023evaluating} \citet{chen2024large} demonstrates that large language models have limited capabilities in information extraction tasks. This phenomenon is particularly evident in event detection, where the complex semantic environments surrounding trigger words pose significant challenges for accurate identification through LLMs. From a syntactic perspective, triggers can include verbs (e.g., “kill”, “sentence”), nouns (e.g., “meeting”, “summit”), pronouns (e.g., “it”, “this”), and multi-word expressions (e.g., “World War Two”, “smash through”). To effectively handle this diversity, we first extract candidate triggers aimed at achieving high recall, covering nearly all ground-truth labels, and filter out negative samples subsequently. Note that the extracted results will inevitably introduce arguments and other irrelevant words. In addition, to enhance the LLMs' capacity for comprehending semantic knowledge of candidates and ensure their effective extraction, we propose an innovative hierarchically cascading prompt approach. In this framework, the extraction process of each succeeding sub-module is critically dependent on the accumulated results from preceding sub-modules, as described in \ref{basic_extraction}, \ref{supply_extraction}, \ref{iter_extraction}.

\subsubsection{Basic Extraction Module}
\label{basic_extraction}
The Basic Extraction Module is designed to handle three category cases: “general cases,” “special cases,” and “pronoun cases.” “General cases” refer to extract candidates in simple semantic scenarios, such as the sentence: “Giuliani, 58, proposed to Nathan, a former nurse, during a November business trip to Paris five months after he finalized his divorce from Donna Hanover after 20 years of marriage.” In this case, the extracted triggers should be: “proposed, trip, finalized, divorce, marriage”. “Special cases” involve more complex linguistic constructs, including candidates enclosed in quotation marks, verb phrases, noun phrases, and instances where no trigger is existed. For example, from the sentence: “As the US-led coalition troops are reportedly \textbf{thrusting into} Baghdad and the second Iraqi city of Basra, Blair and Bush agreed there would be a ‘vital \textbf{role}’ for the United Nations in post-war Iraq.” The extractions should be: “thrusting into, agreed, role, war”. To address both the general and special cases, we designed two separate prompts containing 7 and 6 demonstrations, respectively. These prompts help LLMs better comprehend a wider range of semantic contexts. The prompt template is provided in Appendix \ref{prompt_simple_special}. In addition, the "pronoun cases" refer to sentences where the trigger is expressed through a pronoun, such as “it” in the example: “Yeah, I heard something about it.” A prompt with 8 demonstrations is delicately designed(The instruction part is in Appendix \ref{prompt_pronoun}).

In this step, we encounter two main challenges. The first is the computationally expensive and time-consuming nature of processing large volumes of data three times within the training dataset. To mitigate this issue, we propose skipping the current LLM extraction step for instances where all label triggers have already been successfully extracted by earlier prompts. The second challenge relates to inconsistencies in tense or singular/plural forms between the extracted triggers and the original sentences(especially Qwen3-1.7B). To address this, we implement an etymological repair mechanism based on a heuristic strategy: if an extracted word does not appear verbatim in the sentence but shows at least $40\%$ character similarity to a word in the sentence, we compare the etymons of both words. If the etymons are identical, the candidate trigger is replaced with the corresponding word from the original sentence. This approach reduces the time complexity from $O(N^2)$ to a more acceptable level. 

\subsubsection{Supplementary Extraction Module}
\label{supply_extraction}
After previous step, we identify two common scenarios that would lead to incomplete candidate extraction. First, LLMs may over-attend to certain parts of speech while overlooking other identically critical elements in semantics. For instance, nouns might be extracted while significant verbs are neglected, or vice versa. Consider the sentence: “North Korea on Sunday rejected the U.N. Security Council's plan to discuss the standoff over its suspected nuclear weapons development, calling it ‘a prelude to war.’” The initial extraction yielded “nuclear, war, rejected, standoff”, but missed a key trigger: “discuss”. Second, when a sentence contains multiple event statements linked by subordinate clauses or adverbial phrases, LLMs may fail to capture crucial candidate triggers. For example, in the sentence: “It was a false choice to debate whether Iraq should be run by coalition forces or the United Nations, said Blair, who was believed to be in favor of a stronger UN role in post-conflict Iraq than Bush.” The previous extraction included “said, UN, coalition, choice, run, was, debate, United Nations”, but omitted the trigger “conflict” and candidate "believed". To mitigate these issues and improve the coverage of golden triggers, we designed two additional prompts specifically targeting the above two scenarios(Appendix \ref{prompt_supplementary_extraction_module}). 

\subsubsection{Iterative Refinement Module}
\label{iter_extraction}
After the previous two sub-modules, we have observed that certain important triggers, particularly those expressed as adverbial phrases or embedded within intricate clauses, are still missed(In the sentence: "The scientific conference, attended by Nobel laureates and plagued by heated debates over ethics, reached a consensus with the lead researcher \textbf{releasing} a dataset that overturned prior assumptions." The significant adverbial phrase “releasing” was overlooked). To address this limitation, we design three new prompts containing samples with complex syntactic and semantic structures, which are sequentially inferred by LLMs with each input built upon the previous outputs. The three prompts, containing 6, 4, and 5 demonstrations respectively, possess the same prompt template(presented in Appendix \ref{prompt_iterative_refinement_module}).

The above whole sub-modules fully leverage the hierarchical, step-wise methodology inspired by 'simple-to-complex' scaffolding approach, which expands the coverage of potential candidate triggers as a result. Within the cascading scheme, every extraction step can cover missing authentic triggers that previous steps lost (Experiment \ref{importance_extract_prompt} validates this point). In addition, this method avoids chaotic prompt design, offering a more structured and logical alternative to arbitrary prompt construction.

\subsection{Chroma Database Construction}
\label{chroma_constraction}
This section proposes the construction of two chroma databases using the training dataset, with the goal of fully leveraging both sentence-level semantic knowledge and word-level contextual knowledge to filter out irrelevant candidate triggers when processing new data. During the retrieval process, seven predefined samples are drawn from the word contextual chroma database and three samples are from the sentence chroma database. Hyperparameters are then used to combine the cosine similarities from both sources, just as Equation \ref{equa_1} illustrates. Specifically, for samples retrieved from the word contextual database, the cosine similarity at the sentence level needs to be computed sequentially to form the composite score. Similarly, for samples where word-level information is not stored from the sentence chroma database, the cosine similarity of relevant word pairs is also calculated and incorporated.

\begin{equation}
\label{equa_1}
    \mathrm{Score} = \alpha \cdot \mathrm{Cosine\_similarity_{words}} + (1 - \alpha) \cdot \mathrm{Cosine\_similarity_{sentence}}
\end{equation}

In addition, previous sections have illustrated that the extractions will inevitably contain arguments and other irrelevant words. Due to the inherent semantic relations among all words in one sentence \citet{vaswani2017attention}, some irrelevant words or arguments may exhibit strong semantic connections to triggers. This hampers the effective filtration of some negative candidates that possess high semantic similarities with real retrieved triggers. To mitigate this issue, we propose treating arguments and negative candidate triggers in training set as false samples and labeling them with event type “trigger:None”, which thereby favors the removal of these types. Furthermore, to enhance the attention on extracted words or triggers in relevant sentences, our method repeats each of these words six times (performance is the best by repeating six times) and places them in the beginning of original sentences, separated by "\texttt{<SEP>}". We use mean embeddings of these words within sentences (skipping the six-times prefixes) combined with metadata to construct Word Contextual Chroma Database. Only sentences containing real triggers are stored in Sentence Chroma Database in the form of mean embeddings, with their triggers as one component of metadata.

\subsection{Triggers Judgments and Classifications}
\label{trigger_jud_cls}
This part contains threshold filtering, two rule-based filtering, along with an automated filtering process powered by Qwen3-1.7B. The threshold filtering strategy excludes retrieval samples with low similarity scores(0.7). Rule-based filtering methods are designed to further eliminate unreasonable retrieved samples. The first rule is to filter candidate triggers for which the number of distinct clusters exceeds 4. It is motivated by the thought that if a candidate trigger exhibits semantic relationships with multiple event types, this candidate should semantically not belong to any specific event type. The second rule is to remove candidate triggers for which the event type of the top-1 retrieval is "trigger:None". This helps eliminate cases that are highly related to arguments or negative triggers presented in the training set. The retrieved samples will be fixed after two filtering processes. Then, if all the retrievals are belonging to one event type, it indicates that the filtering rules applied earlier are sufficiently confident to regard the candidate as a valid trigger. In this case, the candidate is predicted to be a trigger and assigned the same event type as the retrieved instances, without further LLM judgment. Else, the prompts will be generated automatically and sent to Qwen3-1.7B iteratively to acquire the final inferences. The iterative process is based on sorted clusters(sort by element numbers within clusters) formed by different event types from filtered retrievals.

\subsubsection{Automated Process of Prompt Generation and LLM Judgment}
\label{llm_auto_filter}
The main underlying idea of this innovative process is to generate positive and negative samples within prompts automatically on the basis of event-type clusters, and iteratively inferred by LLM to obtain the final judgment, as outlined in Algorithm \ref{alg:my-algorithm}: First, the prompt includes one instruction and two fixed examples in advance to help Qwen3-1.7B grasp basic task motivation(line 1). In an iterative process, from the largest cluster to the smallest, we automatically revise and augment the prompt with additional examples (lines 2-18). For each current cluster, the event type, along with its corresponding sentences and triggers, are treated as positive samples (lines 5-7). Negative samples are drawn from all clusters ranked lower than the current one (lines 8-12). Thus, the final prompt comprises one instruction, two predefined examples, and automatically selected positive and negative examples, which are sequentially sent to LLM. If the output of LLM is "Yes", this candidate will be predicted as one trigger and assigned the event type of current cluster (lines 14-16). Otherwise, the process continues to the next cluster with the prompt dynamically updated. If, after iterating through all clusters, the candidate is not associated with any event type, it is discarded(line 19). Because through the inference of LLM, the most similar sentences and their event types are not considered relevant to the candidate, indicating that this candidate does not possess enough semantic information to become a trigger and is therefore considered invalid.(The automated prompt is in Appendix \ref{prompt_auto})

\begin{algorithm}[htbp]
    \caption{Automated Prompt Design for Trigger Judgment} 
    \label{alg:my-algorithm}
    \begin{algorithmic}[1] 
        \Require \text{Clustered Elements} $C$ = \{$c_1,c_2,...,c_m$\}, \text{Corresponding Elements from Sorted Cluster} $c_i$ = \{$e_1,e_2,...,e_k$\}, \text{Main Metadata from Element} $e_i$ = \{$sentence_i,trigger_i,event\_type_i$\}, \text{Predefined Instruction $p_{inst}$}, \text{Predefined Two Demonstrations $p_{demon}$}
        \Ensure True or False
        \State \text{p = $p_{inst} + p_{demon}$}
        \For{$l = 1 \to m$}
            \State \text{$c_i$} $= c_l$
            \State \text{$p_{temp}=''$}
            \For{$e_k \in c_i$}
                \State \text{$p_{\mathrm{temp}} += \mathrm{form\_positive\_prompt\_demons}(sentence_k, trigger_k, event\_type_k)$}
            \EndFor
            \For{$c_j = l + 1 \to m$}
                \For{$e_k \in c_j$}
                    \State \text{$p_{\mathrm{temp}} += \mathrm{form\_negative\_prompt\_demons}(sentence_k, trigger_k, event\_type_k)$}
                \EndFor
            \EndFor
            \State \text{$p\ +=\ p_{temp}$}
            \If {\text{LLM\_Judgment(p)}}
                \State \Return \text{True}
            \EndIf
            \State \text{$p\ -=\ p_{temp}$}
        \EndFor
        \State \Return \text{False} 
    \end{algorithmic}
\end{algorithm}


\section{EXPERIMENT}
\label{experiment}

\subsection{Experimental Setup}
\label{exp_set}
Our experiments are conducted on ACE-05 dataset, which was first introduced by \citet{ahn2006stages}. We follow the same data split as \citet{lin-etal-2020-joint}. The dataset comprises 33 distinct event types. For evaluation, we use the strict matching F1 score for both trigger identification and trigger classification. The hyperparameter $\alpha$, which controls the weight of cosine similarity between dual semantics, is set to 0.8. The score threshold is set to 0.7. Additionally, the number of prefix words that were added before the head position of sentences is set to 6(Comparison of different prefix numbers is in Appendix \ref{diff_prefix_result}).

\subsection{Compared baselines}
\label{comp_baseline}
We compare our experimental results with several state-of-the-art generative approaches, including three mainstream generative model-based methods, Text2Event, DEGREE, and DICE, as well as the LLM-based pipeline system STAR.

\begin{itemize}
    \item \textbf{Text2Event} \citet{lu2021text2event} A controllable event extraction framework implemented via a generative approach, with control achieved through a trie-based constrained decoding algorithm and curriculum learning. 
    
    \item \textbf{DEGREE} \citet{hsu2021degree} A template-enhanced generative modeling approach enables the extraction of event triggers and arguments with fewer samples.
    
    \item \textbf{DICE} \citet{ma2023dice} A generative model with template which is similar to DEGREE but employs distinct queries for different argument roles.

    \item \textbf{STAR} \cite{ma2024star} A structure-to-text data generation framework accompanied by a self-refinement strategy for in-context-learning by LLMs.
\end{itemize}

  
\subsection{Main Results}
\label{main_result}

Table \ref{tab:main_results} presents a comparative analysis between our method and other baselines. The best and second-best results were both achieved under our proposed framework. Our top-performing model, based on Qwen3-4B, surpasses the previous state-of-the-art by $2.78\%$ in trigger identification and $4.74\%$ in trigger classification. With the exception of identification performance based on Llama2-7B model, which falls slightly below the current best, all the other seven metrics evaluated exceed the state-of-the-art levels. These results demonstrate the effectiveness of our novel approach. The significant improvement in classification performance, in particular, highlights the benefit of incorporating the dual retrieval mechanism. Furthermore, the superior performance of the much smaller Qwen3-1.7B compared to GPT-3.5 demonstrates the high effectiveness of our method. We also observed the performance gap between Qwen3 and Llama2, suggesting the superior comprehension capabilities of Qwen3.

\begin{table}[ht]
\centering
\captionsetup{justification=centering}
\caption{Main Results. Baselines (lines 2-7) use GPT-3.5 and GPT-4.0 to generate data within the STAR architecture.}
\begin{tabular}{@{}c c c c c@{}}
\toprule
\multirow{2}{*}{Dataset} & \multirow{2}{*}{Models} & \multirow{2}{*}{Method} & \multicolumn{2}{c}{Performance (\%)} \\
\cmidrule(lr){4-5}
& & & Trigger Identification & Trigger Classification \\
\midrule
\multirow{6}{*}{ACE-05} 
& \multirow{5}{*}{GPT-3.5} & Text2Event & 11.30 & 3.47 \\
& & DEGREE & 17.52 & 6.21 \\
& & DICE & 16.94 & 7.09 \\
& & Instruction & 18.31 & 8.37 \\
& & Instruction+Examples & 59.71 & 53.29 \\
& GPT-4.0 & Instruction+Examples & 62.12 & 56.46 \\
\midrule
\multirow{4}{*}{ACE-05} 
& Llama2-7B & \multirow{4}{*}{REVO-ED} & 61.92 & 58.25 \\
& GPT-3.5 & & 63.01 & 58.13 \\
& Qwen3-1.7B & & \underline{63.75} & \underline{59.93} \\
& Qwen3-4B & & \textbf{64.90} & \textbf{61.20} \\
\bottomrule
\end{tabular}
\label{tab:main_results}
\end{table}

\subsection{Importance of Extraction Prompt}
\label{importance_extract_prompt}
To assess the performance of our proposed candidate extraction pipeline, we store all extraction results across sub-modules and count the number of samples in which the golden triggers are not identified, shown as a line graph in Figure \ref{fig:chart_com}. We can observe that Llama2-7B performs well in the Basic Extraction Module but its extraction coverages in three refinement steps are nearly not changed, which seems to imply that Llama2-7B's ability in semantic comprehension is great, but it cannot comprehend the demonstrations in complicated semantic scenarios, which restricts its final performance. In contrast, the comparatively opposite change performed in Qwen3-1.7B appears to indicate its distinguished comprehension in examples but low capacity in understanding instructions. The performance of GPT-3.5 is satisfactory, but its final result is slightly inferior to Qwen3-1.7B. The outstanding performance in Qwen3-4B validates its superior ability under our scheme. Furthermore, from the above comparison and the experimental results, we can conclude that there is a positive correlation between the final evaluation metrics and the quality of the extracted results, demonstrating the effectiveness of our fine-grained retrieving process.

\begin{figure}[ht]
  \begin{center}
    \includegraphics[width=0.55\textwidth]{iclr2026/performance_comparison.png} 
  \end{center}
  \caption{Uncovered Golden Trigger Count by Every Extraction Prompt Step. ”Basic” column means the Basic Extraction Module and the ”Sup-
plement” indicates the Supplementary Extraction Module. The ”Iter-1”, ”Iter-2”, and ”Iter-3” individually represents per loop in the Iterative Refinement Module.}
  \label{fig:chart_com}
\end{figure}

\subsection{Ablation Study}
\label{abl_st}

We conducted six ablation experiments to validate the effectiveness of each component in the scheme, as illustrated in Table \ref{tab:ablation_study}. Significant performance declines were observed when removing word embeddings, the score threshold, or negative sample filtering, confirming their indispensable roles in the framework. These components represent crucial and innovative contributions of our work. A minor decrease in performance after removing sentence embeddings suggests that relying solely on word-level textual semantics is insufficient for optimal retrieval, which confirms the effectiveness of semantics in sentences to some extent. The slight drop upon removing Cluster Count Limitation indicates its certain importance in the filtering process. We assume its special influence in scenarios that involve large-scale retrieval sets with high semantic diversity and multiple event types. The improvement achieved by LLM Automated Filtering highlights the value of LLMs in the final determination and the effectiveness of this automated process.

\begin{table}[ht]
\centering
\caption{Ablation Study Results on ACE-05 Dataset using Qwen3-1.7B Model}
\label{tab:ablation_study}
\begin{tabular}{@{}lcc@{}}
\toprule
\textbf{Configuration} & \textbf{Trigger ID} & \textbf{Trigger CLS} \\
\midrule
\quad REVO-ED (Full Model) & 63.75 & 59.93 \\
\midrule
\multicolumn{3}{l}{\textit{Ablation Studies:}} \\
\midrule
\quad w/o Sentence Embeddings & 60.69 & 56.68 \\
\quad w/o Word Embeddings & 18.54 & 12.23 \\
\quad w/o Score Threshold Filtering & 31.26 & 30.60 \\
\quad w/o Cluster Count Limitation & 63.74 & 59.92 \\
\quad w/o Negative Cases Filtering & 27.50 & 25.64 \\
\quad w/o LLM Automated Filtering & 62.71 & 58.98 \\
\bottomrule
\end{tabular}
\end{table}


\subsubsection*{Acknowledgments}
This work was supported by the Natural Science Foundation of China (No. 62372057), and the Key Laboratory of Trustworthy Distributed Computing and Service (MOE).


\bibliography{iclr2026_conference}
\bibliographystyle{iclr2026_conference}

\appendix
\section*{Appendix}
%\subsection{the prompt of basic extraction module}
%\label{prompt_basic_extraction_module}
%\subsubsection{simple cases and special cases}
%\label{prompt_simple_special}
\paragraph{\large A \hspace{0.5em} The Prompts of Candidate Extraction}
\label{append_candidate_extraction}
\paragraph{A.1 \hspace{0.5em} Basic Extraction Module}
\label{prompt_basic_extraction_module}
\paragraph{A.1.1 \hspace{0.5em} Simple Cases and Special Cases}
\label{prompt_simple_special} 
The prompt of extracting simple cases and special cases in the Basic Extraction Module is as follows:
\begin{tcolorbox}[
    title=Prompt of Simple Cases and Special Cases,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
Extract the event triggers in the input sentence. An event trigger is an indicative word or phrase in a sentence that clearly signals the occurrence of one event. It can be a verb (e.g., "explode"), A noun (e.g., "summit"), or A pronoun (e.g., "it", "this"), and it can also be a phrase (e.g., "War World II") or A combination of words with different parts of speech (e.g., "shoot at", "took over"). For each input sentence, determine whether it describes one or more events or contains event information(The more events in the sentence can be mentioned in clauses or conjunctions, etc). If there exists one or more events, extract the corresponding event triggers. If no event trigger exists or the sentence doesn't contain event information, output :"trigger 1: None". Otherwise, output the one or more event triggers in the input sentence in the form of the following output format.
Output Format:
If no event trigger exists, output:
trigger 1: None
Otherwise, list all triggers in order:
trigger 1: XXXX
(trigger 2: XXXX)
(trigger 3: XXXX)
(...)
Input Examples:
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}

\paragraph{A.1.2 \hspace{0.5em} Pronoun Cases} 
\label{prompt_pronoun} 
The prompt of extracting pronoun cases in the Basic Extraction Module is as follows:
\begin{tcolorbox}[
    title=Prompt of Pronoun Cases,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
Find implicit event-triggering pronouns (it, this) in the given sentence. The pronouns should semantically refer to an event. If the required pronouns don't exist, output :"trigger 1: None". Otherwise, output the pronoun in the form of the following format.
Output Format:
If no implicit event-triggering pronouns exist, output:
trigger 1: None
Otherwise, output the implicit event-triggering pronouns:
trigger 1: XXXX
Input Examples:
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}


\paragraph{\large A.2 \hspace{0.5em} The Prompt of Supplementary Extraction Module}
\label{prompt_supplementary_extraction_module}
\paragraph{A.2.1 \hspace{0.5em} Missed Events Retrospections} 
\label{prompt_miss_retro}
The prompt of sub-module - Missed Events Retrospections - in the Supplementary Extraction Component is as follows:
% 创建带黑色标题栏的代码框
\begin{tcolorbox}[
    title=Prompt of Missed Events Retrospections,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
Extract the potential unextracted event structural words that are refering to the given event words(splitted by ",") in the input sentence. Such unextracted event structural words are important in indicating an event information. The corresponding number of such structural information may be one or more. Output the potential unextracted structural information in the following output format.
Output Format:
Information 1: XXXX
(Information 2: XXXX)
(Information 3: XXXX)
(...)
Input Examples:
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}

\paragraph{A.2.2 \hspace{0.5em} Event Information Completion} 
\label{prompt_info_completion}
The prompt of sub-module - event information completion - in the Supplementary Extraction Component is:
\begin{tcolorbox}[
    title=Prompt of Event Information Completion,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
Extract the event structural words that are not refering to but possess connections with the given event words(splitted by ","). Such event structural words are important in indicating an event information. The corresponding number of such structural information may be one or more. Output these event structural information in the following output format.
Output Format:
Information 1: XXXX
(Information 2: XXXX)
(Information 3: XXXX)
(...)
Input Examples:
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}

\paragraph{\large A.3 \hspace{0.5em} The Prompt of Iterative Refinement Module}
\label{prompt_iterative_refinement_module}
The prompt for the Iterative Refinement Module is as follows:
\begin{tcolorbox}[
    title=Prompt of Iterative Refinement Module,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
It is known that event structural words are important informations that can indicate an event occurrence. Given the known structural words of one sentence, there may still exsit some potentially unextractced event structural words. The corresponding number of such unextracted structural words may be one or more. Extract them and output them in the following output format.
Output Format:
Information 1: XXXX
(Information 2: XXXX)
(Information 3: XXXX)
(...)
Input Examples:
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}

\paragraph{\large B \hspace{0.5em} The Automated Prompt Design}
\label{prompt_auto}
The automated prompt design by fined-graind LLM judgment is as follows:

\begin{tcolorbox}[
    title=Prompt of Automated Prompt Design,
    colback=codebg,
    colframe=titlebg,
    colbacktitle=titlebg,
    coltitle=titletext,
    fonttitle=\bfseries\large,
    arc=0pt,
    outer arc=0pt,
    boxrule=0.8pt,
    titlerule=0pt,
    % 调整这里的 after skip 参数来增加下方间距
    after skip=15pt,  % 从原来的 5pt 增加到 15pt 或更大
    top=2pt,
    bottom=2pt,
    left=3pt,
    right=3pt,
    before skip=5pt,
    boxsep=2pt,
    leftrule=0.8pt,
    rightrule=0.8pt
]
\begin{lstlisting}
<|im_start|>[user]:
{prompt}
Given an event statement and its corresponding event trigger word, where an event trigger refers to an indicative word or phrase that signals the occurrence of a specific event.Please determine whether the trigger word belongs to the specified event type based on its contextual meaning and the overall semantics of the statement. If it belongs, respond with "yes"; otherwise, respond with "no". Then, explain your answer.
Output Format:
Yes./No.
Input Examples:
### Input 1
Sam Waksal, founder of the US pharmaceutical company ImClone Systems was sentenced to 87 months in prison Tuesday for insider trading.
### Trigger
sentenced
### Event Type
Justice:Sentence
### Answer
Yes.
...
<|im_end|>
<|im_start|>assistant
<th>think>
\end{lstlisting}
\end{tcolorbox}

\paragraph{\large C \hspace{0.5em} Experimental Results with different prefix number}
\label{diff_prefix_result}

We conduct a series of experiments to evaluate the impact of varying the number of prefix candidates during the construction of the Word Contextual Chroma Database based on Qwen3-1.7B model. The F1-score results for different trigger numbers (ranging from 0 to 8) are presented in the Table \ref{tab:prefix}. The first column indicates different prefix counts, while the second and third columns present the corresponding results and comparisons between the optimal outcome (achieved with 6 prefixes) and those obtained with other counts up to 8. Minor variations are observed across prefix counts ranging from 0 to 7, with fluctuations within -1.10 to -0.31 for identification and -0.92 to -0.29 for classification, demonstrating the robustness of our approach to various prefix quantities. However, beyond 7 prefixes, both performances decline sharply, with 36.31 decline in identification and 27.04 deduction in classification compared to the best results. We speculate that an excessive number of prefix words may disrupt the original sentence semantics, resulting in the performance decline of retrieval process.

\begin{table}[htbp]

\centering

\captionsetup{justification=centering}

\caption{Compared Results of Trigger Identification and Classification with Different Prefix Number When Constructing Word Contextual Chroma Database Based On Qwen3-1.7B}

\label{tab:prefix}
\begin{tabular}{c c c}
\toprule
\textbf{Diff. Pre. Num.} & \textbf{Trigger ID (\%)} & \textbf{Trigger CLS (\%)} \\
\midrule
6 (Optimal Results) & 63.75 & 59.93 \\
\midrule
\multicolumn{3}{c}{\textit{Compared Results}} \\
\midrule
0 & 62.80 (-0.95) & 59.01 (-0.92) \\
1 & 62.65 (-1.10) & 59.17 (-0.76) \\
2 & 62.88 (-0.87) & 59.64 (-0.29) \\
3 & 63.40 (-0.35) & 59.40 (-0.53) \\
4 & 63.22 (-0.53) & 59.45 (-0.48) \\
5 & 63.44 (-0.31) & 59.64 (-0.29) \\
7 & 62.91 (-0.84) & 59.31 (-0.62) \\
8 & 27.44 (-36.31) & 27.04 (-32.89) \\
\bottomrule
\end{tabular}
\end{table}


\end{document}
