
\section{Acknowledgement}
The authors of this paper were supported by the NSFC Fund (U20B2053) from the NSFC of China, the RIF (R6020-19 and R6021-20) and the GRF (16211520 and 16205322) from RGC of Hong Kong, the MHKJFS (MHP/001/19) from ITC of Hong Kong and the National Key R\&D Program of China (2019YFE0198200) with special thanks to HKMAAC and CUSBLT, and the Jiangsu Province Science and Technology Collaboration Fund (BZ2021065). We also thank the support from the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21, RMGS23CR05, RMGS23EG08).
\section{Comparison between Different Prefix Prompts}\label{sec:prefix_comp}
In this section, we conduct experiments on ACE-2005 dataset to compare the effectiveness of using different prefix prompts in our models. We compare the following prefix prompts with the one discussed in Section \ref{sec:prompting_module}: (1) ``\textbf{This is a [] event whose trigger is "[]".}''; (2) ``\textbf{The event type is [], and its occurrence is most clearly expressed by "[]".}''; (3) ``\textbf{The event type is [] and the trigger is "[]".}''. The results are shown in Table \ref{tab:diff_prefix}, where ``Prefix (0)'' refers to the prefix prompt discussed in Section \ref{sec:prompting_module}, whereas ``Prefix (1)'' refers to the first prefix prompt described in this section, and so on.
\begin{table}[!h]
    \centering
    \begin{tabular}{cc}
    \toprule
        Prefix Prompt & F1\\
        \midrule
        Prefix (0) & \textbf{66.1}\\
        Prefix (1) & 65.2\\
        Prefix (2) & 65.6\\
        Prefix (3) & 63.0\\
        \bottomrule
    \end{tabular}
    \caption{Performance of different prefix prompts.}
    \label{tab:diff_prefix}
\end{table}
From the table we can see that the prefix prompt described in Section \ref{sec:prompting_module} is the most effective one, which might be due to the fact that the prefix prompt not only is based on the definitions of events and triggers \citep{grishman2005nyu}, but also has a natural and smooth expression. 

\section{Results of all Other Global Constraints}\label{sec:comp_other_cons}
\begin{table*}[!h]
    \centering
    \resizebox{2\columnwidth}{!}{
    \begin{tabular}{cc}
    \toprule
    Global constraint & Effect on overall performance \\
    \midrule
    There is at most one Time-Arg in each event. & 0.4  \\
    There is at most one Place-Arg in each event. & 0.1 \\
    A TRANSPORT event has at most one Destination argument. & -0.2 \\
    A TRANSPORT event has at most one ORIGIN argument. & -0.1 \\
    A START-POSITION event has at most one Person argument. & 0.2 \\
    A START-POSITION event has at most one Entity argument. & -0.1 \\
    A START-POSITION event has at most one Position argument. & 0.1 \\
    A End-POSITION event has at most one Person argument. & -0.2 \\
    If a Start-Position event and an End-Position event share & \multirow{5}{*}{
    0.1} \\
    arguments, then Start-Position.Person is the same as & \\
    End-Position.Person, and Start-Position.Entity is the same & \\
    as End-Position.Entity, Start-Position.Position is the same & \\
    as End-Position.Position. & \\
    If an Arrest-Jail event and a Charge-Indict event share arguments, & \multirow{3}{*}{0.3} \\
    Arrest-Jail.Person is the same as Charge-Indict.Defendant, they & \\
    share the same Crime argument. & \\
    If a Die event and an Attack event share arguments, then & \multirow{5}{*}{-0.2} \\
    Die.Place is the same as Attack.Place, Die.Victim is the &\\
    same as Attack.Target, Die.Instrument is the same as &\\
    Attack.Instrument, Die.Time is the same as Attack.Time, &\\
    Die.Agent is the same as Attack.Attacker. & \\
    \bottomrule
    \end{tabular}
    }
    \caption{Other global constraints and corresponding effects on overall performance.}
    \label{tab:comp_other_cons}
\end{table*}

In this section, we present the results of all other global constraints. The results are shown in Table \ref{tab:comp_other_cons}.

\section{Conclusion}
We propose a zero-shot EAC model using global constraints with prompting.
Compared with previous works, our model does not require any annotation or manual prompt design, and our constraint modeling method can be easily adapted to any other datasets.
Hence, our model can be easily generalized to any open-world event ontologies.
Experiments on two standard event extraction datasets demonstrate our model's effectiveness.

\section{Experiments}
We first present the experimental settings, baselines used for comparison, and some implementation details. Next, we show and analyze the experiment results. Then we present a detailed analysis of the prompting module and global constraints regularization module. Finally, we conduct an error analysis.

\subsection{Settings}\label{sec:settings}
We use ACE (2005-E$^+$)\footnote{https://www.ldc.upenn.edu/collaborations/past-projects/ace} \citep{DoddingtonMPRSW04,LinJHW20} and ERE(-EN) \citep{SongBSRMEWKRM15} as datasets. In total, ACE has 33 event types and 22 roles, whereas ERE has 38 event types and 21 roles. We pre-process all events to keep only the event subtypes whenever applicable, as done in \citep{LinJHW20}.
Following the pre-processing in \citep{ZhangWR21}, for each dataset, we merge all splits into one test set since our approach is zero-shot.
When argument spans are not given, we pipeline our model with an argument identification module adapted from \citep{LyuZSR20}. Specifically, we replace the QA model in \citep{LyuZSR20} with a more powerful PTLM with a span classification head on top, and the whole model has been fined-tuned for extractive QA tasks. Then for a passage, we prompt each role using the new QA model as in \citep{LyuZSR20}. We collect the prompt results for all roles (ignoring the ``None'' result) as candidate spans for the passage.
We use the F1 score for evaluation following \citep{JiG08}, where argument spans are evaluated on the head level when not given.
Regarding PTLMs, We use GPT-J (6B) \citep{gpt-j} instances from Huggingface \citep{WolfDSCDMCRLFDS20}, where an instance for causal language modeling is used for prompting, and an instance for QA is used for argument identification.
In all the following sections except Section \ref{sec:main_res}, we conduct experiments on ACE, assuming that argument spans are given.

\subsection{Main Results}\label{sec:main_res}
\begin{table*}[!ht]
    \centering
    \resizebox{2\columnwidth}{!}{
    \begin{tabular}{ccccc}
    \toprule
        \multirow{2}{*}{Model} & \multicolumn{2}{c}{ACE} & \multicolumn{2}{c}{ERE} \\%\cline{2-5}
        & argument span given & argument span not given & argument span given & argument span not given \\
        \midrule
        \citep{naacl2022degree} (supervised) & 79.3 & 71.8 & 79.8 & 72.5 \\
        \midrule
        \citep{LiuCLBL20} & 46.1 & 24.2 & 40.9 & 22.8 \\
        \citep{LyuZSR20} & 47.8 & 26.9 & 44.5 & 26.3 \\
        \citep{ZhangWR21}  & 53.6 & 23.5 & 51.9 & 20.2 \\
        Ours & \textbf{66.1} & \textbf{31.2} & \textbf{62.8} & \textbf{29.6} \\ \bottomrule
    \end{tabular}
    }
    \caption{Performance of supervised model, zero-shot baselines, and our model. The best scores among the ones of zero-shot methods are in bold font.}
    \label{tab:main_res}
\end{table*}
We report the main results comparing our models with three previous powerful zero-shot models \citep{LiuCLBL20,LyuZSR20,ZhangWR21}. Moreover, we also report the results of a SOTA supervised model \citep{naacl2022degree}. We obtain the results of all compared methods from our own experiments to ensure a fair comparison on the same datasets and same settings.
From Table \ref{tab:main_res}, we have the following observations:
\begin{itemize}
    \item Our model achieves superior performance on both datasets under both settings compared with all zero-shot baselines. Specifically, our model surpasses the best zero-shot baselines \citep{ZhangWR21} by 12.5\% and 10.9\% on ACE and ERE, respectively. Without argument spans, our model outperforms the respective best zero-shot baselines \citep{LyuZSR20} by 4.3\% and 3.3\% on ACE and ERE, respectively, which is also a noticeable gap. Such large performance improvements can be attributed to the following: (1) the prefix prompt guides the PTLM to effectively capture input's event-related perspective and trigger; (2) the cloze prompt leverages linguistic and commonsense knowledge stored in PTLM to improve its contextual understanding of event arguments; (3) the global constraints regularization incorporate global information and domain knowledge in inference. In Section \ref{sec:analyze_prompt}, we compare the effects of using different PTLMs like BERT in the prompting module, and the results show that our model consistently outperforms previous zero-shot models, as shown in Table \ref{tab:main_res} and Figure \ref{fig:analysis_prompt_ptlm}.
    \item Compared with the supervised SOTA model \citep{naacl2022degree}, there is still a significant gap between our model's performance and that it. Specifically, \citep{naacl2022degree} outperforms our model by 13.2\% and 17.0\% on ACE and ERE, respectively. When argument spans are not provided, \citep{naacl2022degree} outruns our model by 40.6\% and 42.9\% on ACE and ERE, respectively. We can see that the advantage of supervised SOTA over our zero-shot method is much more distinct when argument spans are not given in advance. This is probably because our zero-shot argument identification module described in Section \ref{sec:settings} is not powerful enough, which causes severe error propagation to our EAC model.
\end{itemize}


\subsection{Analysis of Prompting Module}\label{sec:analyze_prompt}
We conduct experiments to examine the effects of different configurations of prefix prompt templates. 
Specifically, we compare our model's complete prefix prompt with the following configurations: (1) removing event type information from the prefix; (2) removing trigger information from the prefix; (3) removing the whole prefix.
For instance, suppose the passage is ``In Baghdad, a bomb was fired at 17 people.'' mentioned in Section \ref{sec:prompting_module}, the prefix in configuration (1) would be \textbf{``This event's occurrence is most clearly expressed by `fired'.''}, the prefix in configuration (2) would be \textbf{``This is a Attack event.''}, and in configuration (3) there would be no prefix.
\begin{table}[t]
    \centering
    \begin{tabular}{ccc}
    \toprule
        Configurations & F1 & $\Delta$ \\
        \midrule
        complete prefix prompt & 66.1 & - \\
        \midrule
        w/o event type & 64.4 & -1.7 \\
        w/o trigger & 64.9 & -1.2 \\
        w/o prefix prompt & 62.8 & -3.3 \\
        \bottomrule
    \end{tabular}
    \caption{Results of using different configurations of prefix prompt.}
    \label{tab:prefix}
\end{table}
The corresponding results are shown in Table \ref{tab:prefix}, where we have the following observations. First, removing either event type or trigger from the prefix prompt will cause a performance drop, which indicates that both kinds of information have contributions to the prompting process. Second, event type plays a more significant role than trigger does in prefix prompt, and the joint effect of them is greater than the sum of their respective effects.

In addition, we examine the effects of using different PTLMs in the prompting module.
We compare the following PTLMs with GPT-J (6B): BERT (large, uncased) \citep{DevlinCLT19}, RoBERTa (large) \citep{roberta}, BART (large) \citep{LewisLGGMLSZ20}, GPT-2 (xl) \citep{radford2019language}, T5 (11B) \citep{RaffelSRLNMZLL20}. The results are shown in Figure \ref{fig:analysis_prompt_ptlm}, where we have the following observations. 
\begin{figure}[t]
    \centering
    \resizebox{\columnwidth}{!}{
    \includegraphics{Figures/analysis_prompt.png}
    }
    \caption{Comparison between the performance of using different PTLMs in prompting module.}
    \label{fig:analysis_prompt_ptlm}
\end{figure}
First, the instance using GPT-J has the best performance, surpassing other instances by 4.2\% to 7.9\%. This shows that GPT-J has a better ability to understand events and their associated arguments compared to other PTLMs. Second, as PTLMs are listed in ascending order based on their numbers of parameters, we can see that for the first five models, the performance increases as the sizes of PTLMs become larger, which is consistent with the widely accepted notion that the larger model has a better capability of solving language tasks. However, the instance using the largest PTLM, T5 (11B), has a worse performance than GPT-2 and GPT-J. This is probably because autoregressive language modeling is more suitable for capturing information related to event arguments than mask language modeling is. 


\subsection{Analysis of Global Constraints Regularization Module}\label{sec:analyze_constraint}
We conduct experiments to study the individual effect of each global constraint on the overall performance.
The results are shown in Table \ref{tab:used_constraints}, where we have the following observations. 
\begin{table}[!h]
    \centering
    \begin{tabular}{cccc}
    \toprule
        ~ & Model & F1 & $\Delta$ \\ 
        \midrule
        ~ & Full model & 66.1 & - \\ 
        \midrule
        ~ & w/o cross-task constraint & 60.5 & -5.6 \\
        ~ & w/o cross-argument constraint & 64.8 & -1.3 \\
        ~ & w/o cross-event constraint & 63.6 & -2.5 \\
        \bottomrule
    \end{tabular}
    \caption{Results of using different configurations of global constraints.}
    \label{tab:used_constraints}
\end{table}
First, every global constraint used by our model is beneficial to overall performance, which demonstrates that exploiting the domain knowledge about cross-task, cross-argument, and cross-event relations indeed provides our model with global understanding of event arguments. Second, the contribution of cross-task constraint is the most significant, which suggests that the global insights from the entity typing tasks are more effective in improving our model's reasoning ability about event arguments. Third, the cross-argument constraint is less effective than the other constraints, which shows that the global insights provided by the cross-argument constraint is less informative than those provided by the other constraints. 

Apart from the three global constraints described above, we have designed another 11 global constraints, which rely on cross-argument or cross-event relations. We add each of them into our model to check their respective effects on the overall performance.
The results of three of them are in Table \ref{tab:other_constraints}, whereas the results of all of them are in Section \ref{sec:comp_other_cons}. 
From the results, we can find that each of these constraints either brings minor improvement or even has a negative influence on the overall performance. 
Hence, we do not incorporate these constraints in our model to maintain our model's efficiency and effectiveness. %
\begin{table*}[t]
    \centering
    \resizebox{2\columnwidth}{!}{
    \begin{tabular}{cc}
    \toprule
    Global constraint & Effect on overall performance \\
    \midrule
    There is at most one Time-Arg in each event. & 0.4  \\
    A TRANSPORT event has at most one ORIGIN argument. & -0.1 \\
    If an Arrest-Jail event and a Charge-Indict event share arguments, & \multirow{3}{*}{0.3} \\
    Arrest-Jail.Person is the same as Charge-Indict.Defendant, they & \\
    share the same Crime argument. & \\
    \bottomrule
    \end{tabular}
    }
    \caption{Results of three other global constraints. Results of all other global constraints are in Section \ref{sec:comp_other_cons}}
    \label{tab:other_constraints}
\end{table*}



\subsection{Error Analysis}\label{sec:error_analysis}
We manually checked 100 wrong predictions of our model and found that most of the errors are caused by too general roles of some event types. Specifically, some roles' linguistic meanings are so general that a model, not knowing their detailed event-type-dependent semantics, tends to assign them to some arguments which should have been assigned other roles. An example is shown in Figure \ref{fig:error_analysis}.
\begin{figure}[t]
    \centering
    \resizebox{0.85\columnwidth}{!}{
    \includegraphics{Figures/Error_Analysis.png}
    }
    \caption{An Example of the wrong prediction caused by too general argument roles. The text in \textbf{bold face} denotes trigger and the \underline{underlined text} denotes target argument span.}
    \label{fig:error_analysis}
\end{figure}
The example describes a Justice:Arrest-Jail event, which is associated with the following roles: ``Person,'' ``Agent,'' ``Crime,'' ``Time,'' and ``Place.''  ``Person'' refers to the person who is jailed or arrested, whereas ``Agent'' refers to the jailer or the arresting agent. In the example, the argument span's true role should be ``Agent'' according to the detailed event-type-dependent semantics of ``Person'' and ``Agent.''
However, our approach is zero-shot and directly models all role labels as natural language words, without incorporating the detailed event-type-dependent semantics of those roles, which are too general (e.g., ``Person''). 
Therefore, our model assigns ``Person'' to ``Police'' since it is reasonable from the perspectives of linguistic and commonsense knowledge, and ``Person'' is much more common than ``Agent'' in the pre-training corpus of the PTLM in the prompting module, which makes it have much higher likelihood in the language modeling process.
Incorporating event-type-dependent semantics of the roles which are too general into our model is left as future work.







\section{Introduction}
Event Argument Classification\footnote{We focus on event argument because existing zero-shot trigger extraction models like \citet{ZhangWR21} are already strong enough, but the arguments remain a challenge. Our argument identification approach is described in Section \ref{sec:settings}.} (EAC), finding the roles of event arguments, is an important and challenging event extraction sub-task. 
\begin{figure}
    \centering
    \resizebox{0.85\columnwidth}{!}{
    \includegraphics{Figures/EAC_Example.png}
    }
    \caption{An example of EAC. The trigger is in \textbf{bold face}. Arguments are \underline{underlined} and connected to their roles by arrows.}
    \label{fig:eac_example}
\end{figure}
As shown in Figure \ref{fig:eac_example}, a ``Transfer-Money'' event whose trigger is ``acquiring'' has several argument spans (e.g., ``Daily Planet''). By determining the role of these arguments (e.g., ``Daily Planet'' as ``Beneficiary''), we can obtain a better understanding of the event, thus benefiting related applications like stock price prediction \citep{DingZLD15} and biomedical research \citep{ZhaoZYHML21}.

Many previous EAC works require numerous annotations to train their models \citep{LinJHW20,naacl2022degree,LiuHSW22}, which is not only costly as the annotations are labor-intensive but also difficult to be generalized to datasets of novel domains.
Accordingly, some EAC models adopt a few-shot learning paradigm \citep{MaW0LCWS22,naacl2022degree}. However, they are sensitive to the few-shot example selection and they still require costly task-specific training, which hinders their real-life deployment.
There have been some zero-shot EAC models based on transfer learning \citep{DaganJVHCR18}, or label semantics \citep{ZhangWR21, WangYCSH22}, or prompt learning \citep{LiuCLBL20,LyuZSR20,HuangHNCP22,abs-2204-02531}. 
However, these models' corresponding limitations impede their real-life deployment. The model based on transfer learning can be ineffective when new event types are very different from the observed one. As for models using label semantics, they require a laborious preparation process and have unsatisfactory performance. Regarding models adopting prompt learning, they need tedious prompt design customized to every new type of events and arguments, and their performance is also limited.

To address the aforementioned issues, we propose an approach using global constraints with prompting to tackle zero-shot EAC. 
Global constraints can be viewed as a type of supervision signal from domain knowledge, which is crucial for zero-shot EAC since supervision from annotations is inaccessible. Moreover, our model's constraints module provides abundant global insights across tasks, arguments, and events.
Prompting can also be regarded as a supervision signal as it induces abundant knowledge from Pre-Trained Language Models (PTLM). 
Unlike previous zero-shot EAC works, which need a tedious prompt design for every new type of events and arguments, the novel prompt templates of our model's prompting module can be easily adapted to all possible types of events and arguments in a fully automatic way.
Specifically, given an event and its passage, our model first adds prefix prompt, cloze prompt, and candidate roles into the passage, which creates a set of new passages. The Prefix prompt describes the event type and trigger span. Cloze prompt connects each candidate to the target argument span. Afterwards, our model adopts a PTLM to compute the language modeling loss for each of the new passages, whose negative value would be the respective prompting score.
The role with the highest prompting score is the initial prediction.
Then, our model uses global constraints to regularize the initial prediction. The global constraints are based on the domain knowledge of the following relations: (1) cross-task relation, where our model additionally performs another one or more classification task on target argument span, and our model's predictions on EAC and other task(s) should be consistent; (2) cross-argument relation, where arguments of one event should collectively abide by certain constraint(s); (3) cross-event relation, where some argument playing a certain role in one event should play a typical role in another related event.

We conduct comprehensive experiments to demonstrate the effectiveness of our model. Particularly, our approach surpasses all zero-shot baselines by at least 12.5\% and 10.9\% F1 on ACE and ERE, respectively. When argument spans are not given, our model outperforms the best zero-shot baseline by 4.3\% and 3.3\% F1 on ACE and ERE, respectively. 
Besides that, we also conduct experiments to show that both the prompting and constraints modules contribute to the final success.

\section{Limitations}
Our work has the following limitations. 
One limitation is that our model is not aware of the detailed event-type-dependent semantics of those roles which are too general, as discussed in Section 3.5. In the future, we will work on enabling our model to capture the event-type-dependent semantics of the roles which are too general.
Another limitation is that our model's performance is still unsatisfactory compared with SOTA supervised model when argument spans are not given, as discussed in Section 3.2. 
In the future, we will work on designing a more powerful zero-shot event argument identification module for our model, so that we can obtain satisfactory zero-shot EAC performance even when argument spans are not given.


\section{Methodology}
We first present an overview of our approach. Then we introduce the details by describing its prompting module and global constraints regularization module.
We follow \citep{DBLP:journals/corr/abs-2107-13586} to name a prompt inserted before input text as \textit{prefix prompt}, and a prompt with slot(s) to fill in and insert in the middle of input text as \textit{cloze prompt}.

\subsection{Overview}\label{sec:model_overview}
\begin{figure*}[t]
    \centering
    \resizebox{2\columnwidth}{150pt}{
    \includegraphics{Figures/Model_Overview.png}
    }
    \caption{Model overview using prediction for one argument span as an example. $[T]_1$ and $[T]_2$ are the parts of the input passage before and after the span, respectively. $k$ is the number of candidate roles of the event type.}
    \label{fig:model_overview}
\end{figure*}
As shown in Figure \ref{fig:model_overview}, given a passage with a target argument span, our model infers the target's role without annotation and task-specific training. Our model has two modules.
The first module is the prompting module that creates and scores several new passages. During creation, the model adds prefix prompt, cloze prompt, and candidate roles into the passage, where the prefix prompt contains information about event type and trigger, and the cloze prompt joins each candidate with a target argument span.\footnote{Since we focus on event argument classification, we assume that the event types and trigger spans are given. The settings without given argument spans will be discussed in Section \ref{sec:settings}.} Afterwards, the model uses a PTLM to score the new passages. 
Our novel prompt templates can easily adapt to all possible events and arguments without manual work.
Initial prediction is the role with the best prompting scores.
The second module is the global constraints regularization module, where the model regularizes the prediction by three types of global constraints: cross-task constraint, cross-argument constraint, and cross-event constraint. All global constraints are based on event-related domain knowledge about inter-task, inter-argument, and inter-event relations.

\subsection{Prompting Module}\label{sec:prompting_module}
In this section, we describe the prompting module in detail.
Given a passage, we first add a prefix prompt containing information about the event type and trigger span to the beginning. Such a prompt can guide a PTLM to: (1) accurately capture the input text's perspective related to the event; (2) have a clear awareness of the trigger. 
Based on the definitions of events and triggers \citep{grishman2005nyu}, we create the following prefix prompt: ``\textbf{This is a [] event whose occurrence is most clearly expressed by [].}'' where the first and second pairs of square brackets are the placeholders of event type and trigger span respectively. We also conducted some experiments comparing different prefix prompts in Section \ref{sec:prefix_comp}, and the results showed that the prefix above is the most effective.

Second, for each candidate role, the module inserts the cloze prompt behind the target argument span, and the role fills the prompt' slot. 
The cloze prompt adopts the hypernym extraction pattern ``\textbf{M and any other []}'' \citep{DaiSW20}, where ``M'' denotes the argument span and the square bracket is the placeholder of the candidate role. We did not try other hypernym extraction patterns as \citep{DaiSW20} had shown that our pattern is the most effective.
The motivation for adopting the hypernym extraction pattern for cloze prompt is that, to some extent a role can be regarded as a context-specific hypernym of the respective argument span of the associated event (e.g., ``Beneficiary'' can be seen as a context-specific hypernym of ``Daily Planet''of the Transfer-Money event described by the example in Figure \ref{fig:eac_example}). Hence, such a prompt induces the linguistic and commonsense knowledge stored in PTLM to help identify which candidate role is the most reasonable. 

After adding the previous two types of prompts, we get several new passages. For instance, suppose the passage is``In Baghdad, a bomb was fired at 17 people.'' whose event type is ``Conflict:Attack'', trigger is ``fired'', target argument span is ``bomb'', and candidate roles are \{``Attacker'', ``Instrument'', ``Place'', ``Time'', ``Target''\}. 
The created passages would be: (1) ``\textit{This is a Attack event whose occurrence is most clearly expressed by ``fired.''} In Baghdad, a bomb \textit{and any other \underline{attacker}} was fired at 17 people.''; (2) ``\textit{This is a Attack ... ``fired.''} ... bomb \textit{and any other \underline{instrument}} was ...''; and similar text for other roles.\footnote{We only use the subtype of all events following the pre-processing done by \citep{LinJHW20}}

For each new passage, we apply a PTLM to compute the language modeling loss.
The negative value of the loss would be the prompting score of the respective passage, where a higher value indicates higher plausibility according to the PTLM.
\textbf{Since our model's prompt templates are independent of event type and argument role, their adaptation to any new type of events and arguments is trivial and fully automatic.} Hence, our prompting method is more scalable and generalizable than those of previous zero-shot EAC models, since, for every new type of events and arguments they need to design a customized prompt. For instance, for every type of events/arguments, \citet{LyuZSR20} manually design a unique prompt as text entailment/question answering template.
The initial prediction would be the role with the highest prompting score.
Since the steps of obtaining scores for each candidate role are independent of other candidate roles, we implement the steps of different candidate roles in parallel. Such a parallel implementation significantly improves our model's efficiency.


\subsection{Global Constraints Regularization Module}\label{sec:constraints}
This module regularizes the prediction by the following three types of global constraints.\footnote{We designed 14 global constraints in total and we used preliminary experiments to choose the three most effective ones. In the preliminary experiments, we randomly sample 1k instances covering all trigger and argument types. We then evaluate each constraint on the sampled subset.}

\textbf{Cross-task constraint} exploits the label dependency between EAC and auxiliary task(s) so that our model can get global information from the auxiliary task(s) about event arguments. We use \textbf{Event Argument Entity Typing (EAET)} as the auxiliary task. The task aims to classify an argument into its context-dependent entity type (e.g., PER). \textbf{As specified in ACE2005 ontology, an argument of a certain role in an event can only be one of several respective entity types (e.g., an argument of ``Attack'' role in a Conflict:Attack event can only be ``ORG,'' ``PER,'' or ``GPE'').} Based on this domain knowledge, we design the cross-task constraint as follows: (1) For each input passage, our model performs prompting for EAET, where the prompting is the same as in Section \ref{sec:prompting_module} except that candidate entity types replace the candidate roles in cloze prompt.; (2) After obtaining the scores and prediction of EAET, the model check the consistency between the predictions of EAC and EAET; (3) If the consistency is violated and the score of EAC's predicted role is lower, then discard the current role, use the role with the highest score in the remaining ones, and check the consistency again; (4) The constraint ends when the labels of two tasks are consistent. An example illustrating this type of constraint is shown in Figure \ref{fig:cross-task_constraint}.
\begin{figure}[t]
    \centering
    \resizebox{0.8\columnwidth}{!}{
    \includegraphics{Figures/Cross-task_constraint.png}
    }
    \caption{An Example of cross-task constraint. The text in \textbf{bold face} is the trigger, \underline{underlined text} is target argument span, and a tuple denotes a predicted label with its prompting score (e.g.,  ``(Target, -3.5)’’ denotes the predicted label ``Target’’ with its prompting score``-3.5’’). Similar notations are adopted in all remaining figures.}
    \label{fig:cross-task_constraint}
\end{figure}

\textbf{Cross-argument constraint} is based on domain knowledge about relationships between arguments within an event. Specifically, our model constrains the number of particular arguments for some or all events. For instance, it is very unlikely that an event mentioned is associated with multiple ``Time'' arguments. Such constraints offer a global understanding of event arguments to our model. The cross-argument constraint we adopt is ``\textbf{A Personnel:End-POSITION event has at most one Position argument}.'' Given a Personnel:End-POSITION event, our model first checks the number of ``Position'' argument. If the number is more than one, then our model will first collect the arguments whose roles are ``Position'' and remove the one with the highest score among these arguments. Then for each remaining argument, our model would change the role to its candidate with the second highest score. An example illustrating this type of constraint is shown in Figure \ref{fig:cross-argument_constraint}.
\begin{figure}[t]
    \centering
    \resizebox{0.6\columnwidth}{!}{
    \includegraphics{Figures/Cross-argument_constraint.png}
    }
    \caption{An Example of cross-argument constraint.}
    \label{fig:cross-argument_constraint}
\end{figure}

\textbf{Cross-event constraint} regularizes predicted roles of arguments shared by related events. A model with such a constraint can have global insights into event arguments, because while they are making inferences for the arguments of one event, they are aware of the information of other related event(s) and cross-event relations.
The cross-event constraint we adopt is ``\textbf{If a Life:Injure event and a Conflict:Attack event share arguments, then Injure.Place is the same as Attack.Place, Injure.Victim is the same as Attack.Target, Injure.Instrument is the same as Attack.Instrument, Injure.Time is the same as Attack.Time, Injure.Agent is the same as Attack.Attacker}''. Given a passage containing an Injure and an Attack event sharing arguments, the model imposes the constraint by checking the consistency between the respective roles of each shared argument as specified in the constraint. Any inconsistency would be fixed by changing the role with a lower prompting score to the new one satisfying the consistency. An example illustrating this type of constraint is shown in Figure \ref{fig:cross-event_constraint}.
\begin{figure}[t]
    \centering
    \resizebox{\columnwidth}{!}{
    \includegraphics{Figures/Cross-event_constraint.png}
    }
    \caption{An Example of cross-event constraint.}
    \label{fig:cross-event_constraint}
\end{figure}

Our constraint modeling method can be easily generalized to other datasets/ontologies by simply using the knowledge about corresponding cross-task, cross-argument, and cross-event relations to design new constraints. The design processes are not costly as we could easily find such knowledge from the guidelines of the target dataset.


\section{Preliminaries}
In this paper, we denote the sets of event types and argument roles as $\mathcal{E}$ and $\mathcal{R}$ respectively. Given an event type $E \in \mathcal{E}$ like ``Justice:Sue,X'' $\mathcal{R}_E \subset \mathcal{R}$ denotes the set of roles the arguments in $E$ can possibly have.
Pre-trained language model is abbreviated as PTLM.
Given a passage $S$, an event trigger $g$ in $S$, the event type $E_g$ of $g$, argument spans $\mathcal{A}_g$ of $g$ in $S$, the aim of zero-shot EAC is: for every argument $a \in \mathcal{A}_g$, find its role $r_a \in \mathcal{R}_{E_g}$ without annotation and task-specific training. We follow \citep{DBLP:journals/corr/abs-2107-13586} to name a prompt inserted before input text as \textit{prefix prompt}, and a prompt with slot(s) to fill in and inserted in the middle of input text as \textit{cloze prompt}.
\section{Related Work}
In this section, we introduce related works about constraint modeling, event extractions, and prompt-based Information Extraction (IE).

\subsection{Constraint Modeling}
Constraint modeling, as an important technique in machine learning and NLP, aims to improve a model's performance by incorporating domain knowledge as constraints~\cite{ganchev2010posterior,chang2012structured,ChangSR13,DeutschUR19,ChangRRR08,CGRS10,GracaGT10}.
One of the most significant advantages of constrained modeling is that it enables a model to capture the expressive and complex dependency structure in structured prediction problems like EAC~\cite{chang2012structured}.
Especially in zero-shot scenarios, constrained modeling can provide useful indirect supervision to a model, which further boosts performance~\cite{ganchev2010posterior}.
Some previous works have adopted constraints based on event-related domain knowledge to classify event arguments~\cite{LinJHW20,ZhangWR21}.
However, their constraints either require labor-intensive annotations~\cite{LinJHW20} or consider limited global information (e.g., cross-event relations)~\cite{ZhangWR21}.
In this paper, our model uses global constraints to regularize prediction by incorporating global insights from cross-task, cross-argument, and cross-event relations.

\subsection{Event Extraction}

Event extraction is a fundamental information extraction task~\cite{DBLP:conf/muc/Sundheim92,DBLP:conf/coling/GrishmanS96,DBLP:conf/aaai/Riloff96,grishman2005nyu,chen2021event,DuC20,LiuCLBL20}, which can be further divided into four sub-tasks: trigger identification, trigger classification, argument identification, and argument classification.
Traditional efforts mostly focus on the supervised setting~\cite{JiG08,DBLP:conf/acl/LiaoG10,DBLP:conf/acl/LiuCHL016,DBLP:conf/acl/ChenXLZ015,DBLP:conf/naacl/NguyenCG16,DBLP:conf/emnlp/LiuLH18,DBLP:journals/dint/ZhangJS19,DBLP:conf/emnlp/WaddenWLH19,LinJHW20}.
However, these works could suffer from the huge burden of human annotation.
In this work, we focus on the argument classification task and propose a model using prompting and global constraints, without annotation and task-specific training.

\subsection{Prompt-based IE}

With the fast development of large PTLMs like T5~\cite{RaffelSRLNMZLL20}, GPT-3~\cite{DBLP:conf/nips/BrownMRSKDNSSAA20}, and Pathway Language models~\cite{DBLP:journals/corr/abs-2204-02311}, the prompt-based method has been an efficient tool of applying those giant models into downstream NLP tasks~\cite{DBLP:journals/corr/abs-2107-13586}. IE is not an exception.
People have been using leverage prompts and giant models to solve IE tasks like named entity recognition~\cite{DBLP:conf/acl/CuiWLYZ21}, semantic parsing~\cite{DBLP:conf/emnlp/ShinLTCRPPKED21}, and relations extraction~\cite{DBLP:journals/corr/abs-2202-04824,DBLP:journals/corr/abs-2105-11259} in a zero-shot or few-shot way.
However, previous prompting methods for IE need a tedious prompt design for every new type of events and arguments.
In contrast, our model's prompt templates can be adapted to all possible types of events and arguments in a fully automatic way.


