%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{amsmath}
%% auto break lines
\lstset{breaklines=true}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{ATE-IT at EVALITA 2026: Overview of the Automatic Term Extraction Italian Testbed Task}


%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1]{Nicola Cirillo}[%
orcid=0000-0002-2107-1313,
email=nicirillo@unisa.it,
]
\cormark[1]
%\fnmark[1]
\address[1]{Department of Political and Communication Sciences, University of Salerno,
  132 Via Giovanni Paolo II, Fisciano (SA), 84084, Italy}

\author[2]{Giorgio Maria Di Nunzio}[%
orcid=0000-0001-7116-9338,
email=giorgiomaria.dinunzio@unipd.it
]
\fnmark[1]
\address[2]{Department of Information Engineering, University of Padova, Via Gradenigo 6/b, 35131 Padova, Italy}

\author[3]{Federica Vezzani}[%
orcid=0000-0003-2240-6127,
email=federica.vezzani@unipd.it,
]
\fnmark[1]
\address[3]{Department of Linguistic and Literary Studies, University of Padova, Via Elisabetta Vendramini, 13 35137 Padova, Italy}


%% Footnotes
\cortext[1]{Corresponding author.}
\fntext[1]{These authors contributed equally.}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
This paper presents an overview of the Automatic Term Extraction Italian Testbed (ATE-IT) shared task, organised within the EVALITA 2026 evaluation campaign. 
The task addresses the scarcity of benchmarks for Italian Automatic Term Extraction (ATE) by proposing a challenge focused on the domain of municipal waste management. 
Participants were invited to tackle two subtasks: (A) \textit{Term Extraction}, aiming to identify domain-specific terms in institutional texts, and (B) \textit{Term Variants Clustering}, focusing on grouping morphological and semantic term variants. 
Nine teams participated, submitting a total of 13 runs. The comparative analysis reveals that fine-tuned Transformer architectures generally outperform naive zero-shot Large Language Model (LLM) prompting, while hybrid approaches appear most effective for semantic clustering.
\end{abstract}
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  Automatic Term Extraction \sep
  Terminology \sep
  Italian NLP \sep
  Shared Task \sep
  EVALITA 2026
\end{keywords}
%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Introduction}

Automatic Term Extraction (ATE) is a foundational task in NLP and terminology work~\cite{kageuraMethodsAutomaticTerm1996a,dinunzioSystematicReviewAutomatic2023}. Its goal is to identify domain-specific terms that designate key concepts within a specialised field of knowledge.

Although it shares some similarities with Named Entity Recognition (NER)~\cite{jehangirSurveyNamedEntity2023}, ATE differs from it. NER involves identifying and classifying mentions of named entities in running text (e.g., people, organisations, places, dates, etc.). Its focus is usually on proper names or unique references that have a clear instance-level referent, and the output is often linked to knowledge bases (e.g., ``Barack Obama \rightarrow Person'', ``Google \rightarrow Organization'').
ATE, by contrast, aims to extract domain-specific terms from a corpus. This means identifying both multi-word and single-word terms that are relevant to a specialised field of knowledge (e.g., ``informed consent'', ``cryptic species'', ``blockchain consensus algorithm'').
The terms extracted through ATE serve as essential building blocks for downstream tasks such as information retrieval, machine translation, ontology construction, knowledge graph enrichment, and domain adaptation of large language models (LLMs).

In this paper, we describe the Automatic Term Extraction Italian Testbed (ATE-IT) shared task, organised in the context of EVALITA 2026, the 9th  evaluation campaign of NLP and speech tools for Italian~\cite{evalita2026overview}.
ATE-IT is the first large-scale evaluation campaign on Italian ATE, centred on a clearly defined real-world scenario: terminology extraction from institutional texts in the domain of waste management. This domain presents a wide variety of derived terms (e.g., ``ecodizionario'', ``biodigestore''), synonyms (e.g., ``indifferenziato'' and ``secco residuo''), abbreviations (``TARI'', ``RAEE''), and multiword expressions (``mastello contenitore'', ``raccolta porta a porta''), making it an important testbed for assessing the robustness of different approaches.

All datasets, evaluation scripts, and baseline code are publicly available at the task repository.\footnote{\url{https://github.com/nicolaCirillo/ate-it}}

The remainder of this paper is organized as follows: Section~\ref{sec:related} and Section~\ref{sec:motivation} discuss the related work and the motivation behind the task. Sections~\ref{sec:task} through \ref{sec:baseline} detail the experimental setup, including the task definition, dataset construction, evaluation measures, and the baseline system. Section~\ref{sec:systems} introduces the participating systems, while Section~\ref{sec:results} and Section~\ref{sec:discussion} present the official results and a discussion of the key findings. Finally, Section~\ref{sec:conclusion} concludes the paper.

\section{Related Work}
\label{sec:related}

Automatic Term Extraction (ATE) has evolved significantly over the last few decades, transitioning from rule-based and statistical pipelines to deep learning and, most recently, prompt-based approaches.
Traditional ATE systems generally follow a three-step pipeline: candidate extraction, feature calculation, and probability estimation~\cite{blandonandradeApproachesToolsAlgorithms2025,astrakhantsevMethodsAutomaticTerm2015}. Candidate extraction typically relies on linguistic filters~\cite{justesonTechnicalTerminologyLinguistic1995,meyersTermolatorTerminologyRecognition2018}, while feature calculation and probability estimation often exploit statistical measures~\cite{DBLP:conf/trec/AhmadGT99,drouinTermExtractionUsing2003}.
With the advent of deep learning, ATE has increasingly been framed as a sequence labeling task, bypassing the three-step pipeline. Early neural approaches utilized LSTMs~\cite{kuczaTermExtractionNeural2018}, but the state-of-the-art shifted rapidly toward Transformer-based models~\cite{hazemTermEval2020TALNLS2N2020,langTransformingTermExtraction2021}.
Despite their effectiveness, supervised deep learning models require substantial labeled datasets, which are often unavailable for specific languages or domains. This limitation has led to the exploration of LLMs and prompting techniques. Recent studies by~\cite{banerjeeLargeLanguageModels2024} and~\cite{tranPromptingWhatTerm2024} compared fine-tuned encoders (like XLM-RoBERTa) against generative models (like GPT-3.5) in few-shot scenarios. Their findings suggest a clear trade-off: while fine-tuned models excel when data is abundant, prompt-based approaches offer a more robust solution in low-resource and few-shot settings.

Despite the notable performances reached by state-of-the-art ATE techniques, there is still room for improvement, especially in multilingual and domain-specific contexts~\cite{dinunzioSystematicReviewAutomatic2023}. Moreover, performance remains modest on complex datasets, and most approaches still struggle with domain sensitivity.

Evaluation of ATE techniques relies on benchmark corpora such as GENIA~\cite{kim2003genia} and ACL RD-TEC~\cite{Qasemizadeh2016}, for English, or the multilingual ACTER~\cite{rigoutsterrynNoUncertainTerms2020}, for English, French, and Dutch. Although Italian has been underrepresented in ATE benchmarks, existing bilingual corpora such as BitterCorpus~\cite{Arcan2014} and MAGMATic~\cite{scansani2019magmatic} include annotated terms. However, they are primarily designed to evaluate domain-specific machine translation rather than term extraction.

Several shared tasks addressed ATE. Notably, the TermEval 2020 Shared Task on Automatic Term Extraction~\cite{rigoutsterrynTermEval2020Shared2020} compared a range of different techniques using the multilingual ACTER corpus. More recently, shared tasks have evolved beyond simple term recognition to emphasize deep semantic understanding and domain-specific relevance. For example, SimpleText Task 2 at CLEF 2024~\cite{ermakovaOverviewCLEF20242024,dinunzioOverviewCLEF20242024b} focused on identifying and explaining complex concepts in scientific abstracts. Participants were asked not only to extract challenging terms but also to generate informative definitions or explanations. Similarly, the ongoing GutBrain Interplay Task3~\cite{martinelliOverviewGutBrainIECLEF20252025} at BioASQ CLEF 2025~\cite{nentidisBioASQCLEF2025Thirteenth2025b} targets the extraction of structured biomedical knowledge related to the gut-brain axis. Its subtasks include term span classification and relation identification, emphasizing fine-grained categorisation and concept linking in highly specialized biomedical texts.

\section{Motivation}
\label{sec:motivation}

ATE systems support terminologists, translators, and technical communicators in building and maintaining controlled vocabularies and thesauri for various specialized domains. Moreover, these systems contribute to improving domain-specific language resources used by AI systems for regulatory compliance, automated indexing, and smart information retrieval.

The international interest in ATE is evidenced by the increasing number of European and global research initiatives. These range from conferences and summer schools, such as MDTT\footnote{\url{https://mdtt2026.dei.unipd.it/}} and TSS,\footnote{\url{https://www.termnet.org/english/products_service/summer_school.php}}, to international ATE shared tasks~\cite{rigoutsterrynTermEval2020Shared2020,dinunzioOverviewCLEF20242024b}, and the development of numerous gold-standard datasets in English and other languages ~\cite{kim2003genia,Qasemizadeh2016,rigoutsterrynNoUncertainTerms2020}.

Despite the growing momentum in multilingual terminology extraction, the availability of standardized evaluation benchmarks for Italian remains limited. While many shared tasks have been organized for English and select other languages, Italian still lacks a well-defined evaluation framework and publicly available datasets to foster comparative research.

Within this landscape, ATE-IT aims to advance ATE research by introducing a dedicated benchmark for the Italian language, focused on the domain of waste management.

Moreover, by combining term extraction with term variants clustering, ATE-IT aligns with ongoing efforts in terminology standardization, machine translation adaptation, and knowledge base population. Crucially, the proposed evaluation framework facilitates comparability and fosters the application of zero-shot and few-shot learning methods, which are fundamental for low-resource languages like Italian. The task will also promote methodological comparisons between rule-based systems, statistical models, and LLM-based architectures in the context of under-resourced institutional Italian.

\section{Definition of the Task}
\label{sec:task}

The ATE-IT shared task comprises two subtasks of increasing complexity: Term Extraction and Term Variants Clustering. Both subtasks are designed to be linguistically and computationally challenging. The former requires models to generalize from sparse domain-specific training examples. The latter requires semantic comparison and abstraction over morphologically and syntactically diverse variants.


\subsection{Subtask A: Term Extraction}

Participants receive a set of sentences drawn from a specialized corpus related to municipal waste management. For each sentence, the goal is to identify and extract the terms that are relevant to the waste management domain. Terms may consist of single words (single-word terms) or multiword expressions (multi-word terms), including nouns, verbs, and adjectives.

\subsection{Subtask B: Term Variants Clustering}

From the list of unique extracted terms, participants are then required to cluster together those terms that refer to the same underlying concept. For example, ``raccolta porta a porta'' and ``raccolta domiciliare'' should be placed in the same cluster. Each cluster should represent a single, coherent concept within the waste management domain. This subtask focuses on synonymy, lexical variation, and compositional semantics.

\section{Dataset}
\label{sec:dataset}

The dataset for the Term Extraction subtask comprises sentences paired with the corresponding waste management terms. This dataset is partitioned into a training set of 2,308 sentences, a development set of 577, and a test set of 1,142.

The dataset for the Term Variants Clustering subtask is derived directly from the unique terms identified in the extraction task. Within this dataset, each term is mapped to a cluster ID representing a single, coherent concept in the waste management domain. The clustering data includes 713 terms for training and 242 for development, with a test set of 378. Notably, 118 terms from the test set also appear in either the training or development sets.

Both the training and development sets are sourced from the publicly available ItaIst-TermRifiuti corpus and the ItaIst-WasteLexicon termbase~\cite{itaist_gru2025,cirillo2025tesi}. Conversely, the test set is specifically designed for the ATE-IT task. 

\begin{table*}[t]
\centering
\caption{ATE-IT dataset composition.}

\label{tab:dataset_summary}
\begin{tabular}{@{}llrlp{6cm}@{}}
\toprule
\textbf{Subtask} & \textbf{Split} & \textbf{Size} & \textbf{Items} & \textbf{Geographic and Temporal Coverage} \\ \midrule
\multirow{3}{*}{Term Extraction} & Training & 2,308 & Sentences & Campania 2005-2023 \\
 & Development & 577 & Sentences & Campania 2005-2023 \\
 & Test & 1,142 & Sentences & Italy 2025 \\
 \midrule
\multirow{3}{*}{Term Variants Clustering} & Training & 713 & Terms & Campania 2005-2023 \\
 & Development & 242 & Terms & Campania 2005-2023 \\
 & Test & 378 & Terms & Italy 2025 \\ \bottomrule
\end{tabular}
\end{table*}

\subsection{Training and Development Sets}

The training and development sets are derived from the ItaIst-TermRifiuti corpus and the ItaIst-WasteLexicon termbase~\cite{itaist_gru2025,cirillo2025tesi}. ItaIst-TermRifiuti is a stratified sample of the broader ItaIst-DdAC\_GRU corpus~\cite{vellutino2024corpus}, which contains institutional texts regarding municipal waste management. The corpus is carefully balanced between:

\begin{itemize}
    \item \textbf{Administrative acts:} Ordinances, service charters, and tenders.
    \item \textbf{Informative texts:} Public notices, guides, and press releases.
\end{itemize}

This balance facilitates the study of terminological variation across different registers and target audiences. 
Four trained annotators manually identified terminological units and categorised them into domains such as waste management, law/administration, and environment.

The ItaIst-WasteLexicon termbase provides the conceptual backbone for the dataset, containing about 950 terms organised within a framework that encodes relations such as generic/specific and comprehensive/partitive. Concepts are further enriched with definitions sourced from European and Italian legislation.

In the final Term Extraction dataset, terms are associated with a sentence only if they were identified by at least one annotator and exist within ItaIst-WasteLexicon. Consequently, the dataset may exhibit certain inconsistencies. These are intentionally preserved to provide a ``realistic" manually annotated environment, challenging participating systems to demonstrate robustness and generalizability to real-world data.

\subsection{Test Set}
The test set was specifically curated as a benchmark for the ATE-IT task. To evaluate system generalisation, it incorporates more recent documents (primarily from 2025) and covers a broader geographic scope than the training data. Specifically, while the ItaIst-TermRifiuti dataset is composed primarily of documents from the Campania region, the test set consists of documents collected from 32 municipalities that were sampled with probability proportional to their population, promoting nationwide coverage of Italy.

The initial annotation of the test set was performed by 59 students from the Translation-oriented Terminography of the Department of Linguistic and Literary Studies of the University of Padova. Each student was assigned a unique corpus segment and instructed to adhere to the annotation guidelines already established for the ItaIst-TermRifiuti dataset. Subsequently, sentences were sampled to ensure that the register balance remained consistent with the training and development sets. To finalize the corpus, the authors conducted a validation and correction phase. Through discussion, annotation principles were harmonized to resolve inconsistencies and establish a reliable gold standard.

The test set for the Term Variants Clustering subtask was generated from the list of unique terms identified within this corpus. The terms were then manually clustered by the three authors after a discussion phase.

\subsection{Format}
The dataset for the term extraction task is a collection of records where each entry maps a specific sentence to its corresponding domain-specific terms. Each record contains the source metadata (consisting of a unique \verb|document_id|, a \verb|paragraph_id|, and a \verb|sentence_id|) alongside the raw \verb|sentence_text|. The target annotations are encapsulated in a terms field, which lists the single-word and multi-word terms identified as domain-relevant concepts within that sentence. To maintain technical consistency, the dataset adheres to a ``longest match" constraint, where nested terms are excluded in favour of the most complete expression (e.g., ``impianto di trattamento rifiuti" is extracted while its constituent ``trattamento rifiuti" is omitted). Furthermore, all terms are normalised to lowercase and appear without duplicates for any given sentence.

This schema is implemented in both CSV and JSON formats. Notably, in the CSV format, there is a separate row for each term, whereas in the JSON format, the terms are stored within a \verb|term_list| array.

\section{Evaluation Measures}
\label{sec:evaluation}

The Term Extraction subtask is evaluated using two separate scores: 
Micro F1 score~\cite{Verborgh-2018-gerbil}, which evaluates Precision and Recall across all term occurrences in the dataset, and Type F1 score, which assesses the ability to identify unique term types correctly. The Term Variants Clustering subtask is evaluated by using BCubed F1 ~\cite{bagga-baldwin-1998-entity,amigoComparisonExtrinsicClustering2009}, a metric that assesses clustering quality.

\subsection{Evaluation of Subtask A}

To provide a comprehensive evaluation of Subtask A, we distinguished between two separate capabilities of the system: first, its ability to identify individual term mentions as they appear in running text, and second, its ability to successfully extract a unique set of terms from the corpus as a whole. Consequently, we produced two distinct rankings based on Micro F1 and Type F1 scores to reflect these separate goals.

Micro F1 is calculated by aggregating the counts of true positives, false positives, and false negatives across all sentences $s$ in the dataset $D$. Specifically:

\begin{itemize}
    \item $TP_s$: number of terms correctly extracted from sentence $s$;
    \item $FP_s$: number of terms incorrectly extracted from sentence $s$;
    \item $FN_s$: number of gold standard terms in $s$ that were missed.
\end{itemize}

The micro-averaged Precision, Recall, and F1 are defined as shown in equation (\ref{eq:micro_metrics}).

\begin{equation}
    P_m = \frac{\sum TP_s}{\sum (TP_s + FP_s)}, \quad
    R_m = \frac{\sum TP_s}{\sum (TP_s + FN_s)}, \quad
    F1_m = \frac{2 \cdot P_m \cdot R_m}{P_m + R_m}
    \label{eq:micro_metrics}
\end{equation}

The Type F1 score is computed over the set of unique term types (i.e., distinct term forms appearing at least once in the dataset). We define 
the following variables:

\begin{itemize}
    \item $TP_t$: number of unique extracted terms that match the gold standard;
    \item $FP_t$: number of unique extracted terms that do not appear in the gold standard;
    \item $FN_t$: number of unique gold standard terms that were not extracted.
\end{itemize}

The Precision, Recall, and F1 for term types are defined as in equation (\ref{eq:type_metrics}).
\begin{equation}
    P_t = \frac{TP_t}{TP_t + FP_t}, \quad
    R_t = \frac{TP_t}{TP_t + FN_t}, \quad
    F1_t = \frac{2 \cdot P_t \cdot R_t}{P_t + R_t}
    \label{eq:type_metrics}
\end{equation}

\subsection{Evaluation of Subtask B}

To assess the quality of the clustering in Subtask B, we employed the BCubed metric. This measure is calculated by computing Precision and Recall at the item level and subsequently averaging these scores across all items. 
Crucially, participants were required to cluster the unique terms extracted by their own systems in Subtask A. Consequently, the set of items in the predicted clustering does not perfectly align with the gold standard. To address this issue, our evaluation framework explicitly accounts for the discrepancy between the two sets.

To compute the BCubed scores, we define the following variables:

\begin{itemize}
    \item $N_{pred}$: the total number of elements in the predicted clustering;
    \item $N_{gold}$: the total number of elements in the gold clustering;
    \item $C(x)$: the predicted cluster containing element $x$ (if $x$ is not present in the predicted clustering, $C(x) = \emptyset$);
    \item $L(x)$: the gold cluster containing element $x$ (if $x$ is not present in the gold clustering, $L(x) = \emptyset$).
\end{itemize}

For each item $x$, the item-level Precision and Recall are calculated as in equation (\ref{eq:bcubed_item}).

\begin{equation}
    P(x) = \frac{|\{ y \in C(x) : L(y) = L(x) \}|}{|C(x)|}, \qquad
    R(x) = \frac{|\{ y \in L(x) : C(y) = C(x) \}|}{|L(x)|}
    \label{eq:bcubed_item}
\end{equation}

Finally, the global scores are derived by averaging these item-level values over the respective total counts, and the harmonic mean is taken to produce the final F1 score, as shown in equation (\ref{eq:bcubed_global}).

\begin{equation}
    P_{b^3} = \frac{1}{N_{pred}} \sum_{x} P(x), \quad
    R_{b^3} = \frac{1}{N_{gold}} \sum_{x} R(x), \quad
    F1_{b^3} = \frac{2 \cdot P_{b^3} \cdot R_{b^3}}{P_{b^3} + R_{b^3}}
\label{eq:bcubed_global}
\end{equation}

\section{Baseline}
\label{sec:baseline}

To quantify the task's complexity and provide a reference point for all participating systems, we provide a baseline system built on the \texttt{gemini-2.5-flash} model in a zero-shot setup.
The choice of this model is motivated by its status as a state-of-the-art LLM with robust zero-shot capabilities. It reflects current general-purpose AI performance, allowing for a clear comparison between zero-shot LLM prompting and the specialised, domain-tuned approaches that the task seeks to promote.

\begin{figure*}[t]
\centering
\begin{minipage}{0.48\textwidth}
\begin{lstlisting}[frame=single, basicstyle=\footnotesize\ttfamily, breaklines=true]
You are an automatic term extraction agent. You will receive a list of sentences as input. 
Your role is to extract waste management terms from the sentences. Output a list of terms for each sentence.

strictly adhere to the Example Output Format:
Sentence 1: [term1; term2]
Sentence 2: [term5; term6]
...

Instructions: 
* Extract only terms, ignore named entities; 
* Do not extract nested terms; 
* Extract only terms related to waste management; 
* If no terms, output an empty list []; 
* You must output 20 lists of terms.
\end{lstlisting}
\end{minipage}
\hfill
\begin{minipage}{0.48\textwidth}
\begin{lstlisting}[frame=single, basicstyle=\footnotesize\ttfamily, breaklines=true]
You are a term clustering agent. 
You will receive a list of term clusters and a list of unclustered terms related to municipal waste management. 
Your task is to cluster together exact synonyms. Each cluster must represent a single concept.

Output: 
Return the list of clusters with the newly added terms. Each cluster on a new line.

Instructions:
* Group terms by meaning, not form. Use their lemma.
* Focus on meaning within waste management context.
* If a term does not belong to a cluster, create a new cluster.
\end{lstlisting}
\end{minipage}
\caption{System prompts used in the baseline implementation for Subtasks A and B.}
\label{fig:prompts}
\end{figure*}

The Term Extraction baseline model was instructed to identify and extract domain-specific terms from batches of 20 sentences. The system prompt, illustrated in Figure~\ref{fig:prompts}, establishes the extraction rules and the required output structure, while the user prompt provides 20 sentences per call.

For Term Variants Clustering, the model was instructed to group synonyms by comparing batches of 20 terms against an existing set of clusters. The system prompt, illustrated in Figure~\ref{fig:prompts}, provides the clustering rules while the user prompt feeds the current state of the clusters, followed by the 20 unclustered terms to be processed.

\section{Participating Systems}
\label{sec:systems}

%Note sui sistemi:
%Peacemaker: dbmdz/bert-base-italian-cased + fine-tuned linear layer
%TermNinjaS: dbmdz/bert-base-italian-uncased + rule-based filtering
%SMTE: dbmdz/bert-baseitalian-uncased + spaCy NER with merging and vocabulary filtering
%OA-TE: dbmdz/bert-base-italian-cased with CRF + Semantic Ranking
%TEXA: Italian BERT; zero-shot prompting
%MinseokKIM: CRF
%Valentinitalie: candidate extraction (rule-based+LLM) + Random  Forest
%Juliette Tonneau: CRF
%TrietNLP: XLM-RoBERTa + CRF

A total of 9 teams participated in the ATE-IT shared task. For Subtask A, 11 runs were submitted, while Subtask B saw a lower participation rate with only 2 submitted runs.

The majority of runs for Subtask A (7 out of 11) leveraged Transformer-based models, specifically the Italian versions of BERT and the multilingual XLM-RoBERTa. Within this group, two teams (TrietNLP~\cite{TrietNLP} and OA-TE~\cite{OA-TE}) placed a CRF (Conditional Random Field) layer on top of the model to better capture global dependencies in term spans, while the other four (SMTE~\cite{SMTE}, TEXA~\cite{TEXA}, Peacemaker~\cite{Pacemaker}, and TermNinjas~\cite{TermNinjas}) relied on a token classification layer.
The winning system, SMTE, combined BERT with a spaCy-based NER pipeline through a specific merging and vocabulary filtering strategy. In contrast, teams such as MinseokKIM~\cite{MinseokKIM} and Juliette Tonneau~\cite{JulietteTonneau} relied on traditional CRF classifiers, focusing on feature engineering. 
A hybrid approach was proposed by Valentinitalie~\cite{Valentinitalie}, which utilised a Random Forest classifier fed by candidates extracted through a combination of rule-based patterns and LLM prompting. Finally, zero-shot prompting was explored by the TEXA team.

Only two teams, TrietNLP and TermNinjas, participated in Subtask B. TrietNLP proposed a pipeline that integrates pre-clustering based on Levenshtein distance with prompting. Similarly, TermNinjas employed lemmatisation during pre-clustering and subsequently merged clusters based on word embedding similarity.

\section{Results}
\label{sec:results}

\begin{table*}[t]
\centering
\caption{Results for Subtask A: Term Extraction.}

\label{tab:results_a}
\begin{tabular}{ll|ccc|ccc|cc}
\toprule
\textbf{Team} & \textbf{Method} & \textbf{$P_m$} & \textbf{$R_m$} & \textbf{$F1_m$} & \textbf{$P_t$} & \textbf{$R_t$} & \textbf{$F1_t$} & \textbf{Rank ($F1_m$)} & \textbf{Rank ($F1_t$)}\\
\midrule
SMTE &  BERT+spaCy & \textbf{.656} & .577 & \textbf{.614} & .645 & .529 & \textbf{.581} & 1 & 1 \\
TrietNLP & RoBERTa+CRF & .634 & .568 & .599 & .599 & \textbf{.545} & .571 & 2 & 2 \\
TEXA (run 1) & BERT & .617 & \textbf{.578} & .597 & .576 & \textit{.460} & .512 & 3 & 5 \\
OA-TE (run 2) & BERT+CRF & .581 & .\textit{522} & .550 & .569 & \textit{.492} & .528 & 4 & 4 \\
MinseokKIM & CRF & .569 & .\textit{476} & \textit{.519} & .654 & \textit{.444} & .529 & 5 & 3 \\
OA-TE (run 1) & BERT+CRF & .560 & \textit{.446} & \textit{.497} & .595 & \textit{.415} & .489 & & \\
Juliette Tonneau & CRF & .555 & \textit{.448} & \textit{.496} & .561 & \textit{.447} & .498 & 6 & 6 \\
TEXA (run 2) & Gemini (zero-shot) & .\textit{471} & \textit{.514} & \textit{.492} & \textit{.425} & \textit{.489} & \textit{.455} & & \\
Peacemaker & BERT & .497 & .\textit{476} & \textit{.486} & \textit{.430} & \textit{.455} & \textit{.442} & 7 & 8 \\
TermNinjas & BERT & .\textit{489} & .\textit{395} & \textit{.437} & .528 & \textit{.404} & \textit{.458} & 8 & 7 \\
Valentinitalie & Random Forest & .\textit{364} & \textit{.473} & \textit{.411} & \textbf{.707} & \textit{.262} & \textit{.382} & 9 & 9 \\
\midrule
\textit{baseline} & Gemini (zero-shot) & .497 & .559 & .526 & .435 & .508 & .469 & & \\
\bottomrule
\end{tabular}
\end{table*}

\begin{table*}[t]
\centering
\caption{Results for Subtask B: Term Variants Clustering.}

\label{tab:results_b}
\begin{tabular}{ll|ccc|c}
\toprule
\textbf{Team} & \textbf{Method} & \textbf{$P_{b^3}$} & \textbf{$R_{b^3}$} & \textbf{$F1_{b^3}$} & \textbf{Rank ($F1_{b^3}$)}\\
\midrule
TrietNLP & levenshtein+Gemini & \textbf{.528} & \textit{.378} & \textbf{.441} & 1 \\
TermNinjas & lemmatization+embedding & .390 & .333 & \textit{.359} & 2 \\
\midrule
\textit{baseline} & Gemini (zero-shot) & .177 & \textbf{.396} & .245 & \\
\bottomrule
\end{tabular}
\end{table*}

The results for Subtask A (Term Extraction) and Subtask B (Term Variants Clustering) are presented in Table~\ref{tab:results_a} and Table~\ref{tab:results_b}, respectively.

\subsection{Results for Subtask A}

In Subtask A, overall performance was robust, with several systems exceeding an $F1_m$ score of .55. The system submitted by SMTE achieved the highest performance across both micro- and type-based metrics ($F1_m = .614$, $F1_t = .581$). This result highlights the efficacy of integrating BERT-based architectures with specialised NLP pipelines, such as spaCy, for refined terminology extraction.

While TrietNLP and TEXA recorded comparable Micro F1 scores (.599 and .597, respectively), their performance on unique term types ($F1_t$) diverged by a significant margin of .059. This discrepancy suggests that while standard BERT models are effective at identifying frequent terminological mentions, the RoBERTa+CRF architecture employed by TrietNLP is better suited for identifying the ``long tail" of rare variants and unseen terms.

The performance of Valentinitalie is noteworthy; despite recording the lowest overall F1 scores ($F1_m = .411, F1_t = .382$), it achieved the highest Type Precision across all participants ($P_t = .707$). However, subsequent analysis reveals that the extracted terms were almost exclusively present in the training set. This indicates a conservative extraction strategy that prioritized precision over the ability to generalise to novel terminology in the test set.

General trends across the submissions indicate that supervised approaches typically achieved higher Precision than Recall. Conversely, zero-shot models favoured Recall, which may suggest a superior ability to identify term boundaries and adhere to the strict ``longest match" constraint. Finally, the consistent disparity between Micro F1 and Type F1 scores across all teams confirms that identifying frequent term mentions remains significantly easier than discovering the full diversity of unique terms within a specialised corpus.

\subsection{Results for Subtask B}

Participation in the Term Variants Clustering subtask was significantly lower than in Subtask A, with only two teams submitting valid runs, reflecting the higher complexity of the challenge.

The best performance was achieved by TrietNLP, which recorded a BCubed F1 score of .441. Their hybrid approach, combining Levenshtein distance for morphological pre-clustering with a Gemini-based component for semantic aggregation, proved effective in maintaining high purity within clusters, as evidenced by the highest Precision score ($P_{b^3} = .528$).

TermNinjas ranked second with an F1 score of .359. Their strategy, which relied on lemmatization and word embedding similarity, struggled to match the Precision of the top system, achieving a $P_{b^3}$ of .390.

Notably, the zero-shot baseline achieved the highest Recall ($R_{b^3} = .396$) but the lowest Precision ($P_{b^3} = .177$). This inverse relationship suggests that the baseline tended to over-cluster, aggressively grouping terms together. This behavior underscores the primary challenge of the task: effectively distinguishing exact synonyms from other semantic relations, thereby ensuring that hyponyms and hypernyms are not erroneously merged into the same concept cluster.


\section{Discussion}
\label{sec:discussion}

The results of the ATE-IT shared task provide a snapshot of the current capabilities in Italian ATE. Beyond the individual rankings, several major trends emerged: the reliability of deep learning, the necessity of structural decoding for accurate term boundary detection, the challenge of generalisation over diatopic and diachronic variations, and the emergence of hybrid approaches for term variants clustering.

\subsection{Deep learning vs. LLMs prompting}

The campaign confirms the resilience of standard supervised deep learning architectures in the era of LLMs, aligning with findings already established in the broader ATE literature~\cite{blandonandradeApproachesToolsAlgorithms2025}. While the naive zero-shot baseline achieved a respectable performance, it was consistently outperformed by more lightweight systems based on BERT and RoBERTa architectures.

This result reinforces that for ATE, where a training set exists, fine-tuned models offer a better alternative to prompting LLMs, making them suited for real-world applications, where speed and cost are key factors.

\subsection{Term boundary detection}

Moreover, results suggest that Transformer architectures alone are often insufficient for accurate term boundary detection. In fact, the top-performing systems augmented it with structural decoding mechanisms like the spaCy-based pipeline of SMTE, or the CRF layer integrated by TrietNLP.
These choices proved decisive in enforcing the ``longest match" constraint.

Terminology in the waste management sector is highly compositional (e.g., \textit{``raccolta"} vs. \textit{``raccolta differenziata"} vs. \textit{``raccolta differenziata porta a porta"}). Therefore, purely statistical models may fragment these multi-word terms, whereas systems with explicit boundary modelling or post-processing heuristics successfully captured the full syntactic structure of these terminological units.

\subsection{Diatopic and diachronic variations}

Another crucial insight from this evaluation is the significant impact of temporal and geographic shifts on model performance. During the development phase, several systems reportedly achieved F1 scores exceeding .70; however, performance dropped noticeably on the test set ($F1_m \approx .60$ for the best system). This decline highlights the difficulty of the realistic scenario proposed by ATE-IT. 

This suggests that even robust deep learning models tend to overfit to specific regional denominations or temporal periods. The challenge for future Italian ATE lies in improving its generalization capabilities, since terminology may exhibit significant diatopic variation and evolve diachronically.

\subsection{The ``long tail" challenge}

The consistent disparity between Micro F1 and Type F1 scores across all teams indicates that identifying frequent term mentions is significantly easier than discovering the full diversity of unique terms.

This ``long tail" problem is particularly relevant for the intended application of ATE systems. For a terminologist or a translator, high-frequency terms (e.g., \textit{``rifiuto"}, \textit{``organico"}) are often already known and documented. The real value of an extraction system lies in its ability to surface rare, highly specific, or emerging terms (e.g., \textit{``presidi sanitari monouso"}, \textit{``pseudo-edili"}).

\subsection{Hybrid approaches for semantic clustering}

Finally, the results of Subtask B shed light on the complexity of Term Variants Clustering. The baseline performance was low ($F1_{b^3} = .245$), confirming that simple zero-shot prompting struggles to effectively group domain-specific synonyms without guidance. As observed in the results, the main challenge lies in the strict definition of synonymy required by the task; systems must navigate the subtle semantic boundary that separates synonyms from hypernyms and hyponyms, a distinction where the baseline frequently faltered.

The winning team, TrietNLP, achieved a significantly higher score by employing a hybrid approach. The result suggests that an effective strategy for addressing this task appears to be a composite pipeline: using symbolic methods to handle surface-level morphological variations and leveraging the reasoning capabilities of LLMs to enforce semantic equivalence.

\section{Conclusion}
\label{sec:conclusion}

The ATE-IT shared task addressed the need for a dedicated benchmark for Automatic Term Extraction in Italian. By introducing a dataset characterized by marked terminological variation, the campaign allowed for a comparative analysis of diverse computational approaches.

The results from the participation of 9 teams and the evaluation of 13 runs point to three main findings. First, regarding Term Extraction, fine-tuned Transformer architectures outperformed the naive zero-shot baseline provided for the task. This indicates that while simple prompting strategies are insufficient for high-precision extraction, supervised models remain a reliable standard for this specific workload. However, given the rapid evolution of generative AI, these results should not discount the potential of LLMs; rather, they suggest that future research should investigate more advanced prompting techniques to bridge the gap with dedicated supervised systems.

Second, the performance gap observed between the development and test sets indicates that geographic and temporal generalisation remains a complex issue. Current models show a tendency to overfit to the specific term variants of the training data, resulting in lower F1 scores when applied to documents from different municipalities.

Finally, the low baseline results in Term Variants Clustering suggest that synonym clustering requires more than simple semantic similarity. The most effective system adopted a hybrid strategy, combining symbolic methods for morphological matching with LLMs for semantic disambiguation.

We hope the open release of the ATE-IT dataset will support the community in developing more robust Automatic Term Extraction methodologies specifically tailored to the complexities of the Italian language.

\section*{Declaration on Generative AI}

 During the preparation of this work, the authors used Gemini in order to: Drafting content, Improve writing style, and Abstract generation. Further, the authors used Gemini for equations 1 to 4 in order to: Formatting assistance. Finally, the authors used Gemini and Grammarly in order to: Grammar and spelling check. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. 

\bibliography{ate_it_references}

\end{document}

%%
%% End of file
