%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{multirow}
\usepackage{pifont}% http://ctan.org/pkg/pifont
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%

\usepackage{xcolor}
\usepackage{graphicx} % Required for inserting images
\usepackage{hyperref}
\usepackage{subcaption}
\usepackage{cleveref}
\usepackage[framemethod=tikz]{mdframed}
\usepackage{pgf} % for calculating the values for gradient
\definecolor{myblue}{RGB}{0, 105, 255}
\definecolor{mygray}{gray}{0.9}
\definecolor{myyellow}{RGB}{255, 143, 0}
\definecolor{mygreen}{RGB}{0, 168, 29}

\newmdenv[
    backgroundcolor=blue!30,
    linecolor=blue,
    linewidth=1pt,
    roundcorner=4pt, % Rounded corners
    innertopmargin=8pt,
    innerbottommargin=8pt,
    tikzsetting={fill=blue!5},
    frametitle={\textbf{Sub-Task A Example}} % Box title
]{genericprompt}

%% auto break lines
\lstset{breaklines=true}
\newcommand{\desegma}{DeSegMa-IT}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{DeSegMa-IT at EVALITA 2026: Overview of the "Detection and Segmentation of Machine Generated Text in Italian" Task}

% \tnotemark[1]
% \tnotetext[1]{You can use this document as the template for preparing your
%   publication. We recommend using the latest version of the ceurart style.}

%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1]{Giovanni Puccetti}[%
% orcid=0000-0002-0877-7063,
email=giovanni.puccetti@isti.cnr.it,
% url=https://gpucce.github.io/,
]
\cormark[1]
\fnmark[1]
\address[1]{Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo" (CNR-ISTI),
  Via G. Moruzzi 1, Pisa, 56124, Italy}

\author[1]{Andrea Pedrotti}[%
% orcid=0000-0001-7116-9338,
email=andrea.pedrotti@isti.cnr.it,
% url=https://andreapdr.github.io/,
]
\fnmark[1]

\author[1]{Andrea Esuli}[%
% orcid=0000-0002-9421-8566,
email=andrea.esuli@isti.cnr.it,
% url=https://www.esuli.it/,
]

%% Footnotes
\cortext[1]{Corresponding author.}
\fntext[1]{These authors contributed equally.}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
  DeSegMa-IT’s shared tasks aim to test the robustness of machine-generated text (MGT) detectors by evaluating their performance under settings where the IID assumption does not hold. While state-of-the-art MGT detectors report high accuracy, such results often rely on unrealistic experimental settings: for example, relying on prior knowledge of the text generator, or failing to consider domain shifts and efficient fine-tuning - or post-tuning - strategies.
  In DeSegMa-IT, participants are challenged with two sub-tasks: \textit{(i)} document-level detection of MGTs and the \textit{(ii)} human-machine text segmentation.
  This paper describes the released dataset, discusses the systems submitted by participants, and provides an initial analysis of the obtained results.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  Machine-Generated Text Detection \sep
  Text Segmentation \sep
  Text Classification
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Introduction and Motivation}


Recent advancements in Generative AI and Large Language Models (LLMs) have led to the development of systems, such as GPT-4 \citep{OpenAI_2023_GPT-4_Technical_Report}, Claude\footnote{\href{https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf}{anthropic.com/claude-model-card}}, Llama-3 \citep{dubey2024llama} and DeepSeek-V3 \citep{deepseekai2024deepseekv3technicalreport} among others, that can generate text that is often indistinguishable from human-written content \citep{Dugan_Ippolito_Kirubarajan_Shi_Callison-Burch_2023}.

This capability, along with the many beneficial applications of LLMs, also enables malicious actors to create Machine Generated Text (MGT) for deceptive purposes. For example, it can be used to manipulate online traffic and spread misinformation through content farms \citep{Puccetti_2024} or to influence human revisions of sensitive documents in critical domains, such as scientific peer review.\footnote{\href{https://www.aclweb.org/adminwiki/index.php/ACL_Policy_on_Publication_Ethics}{aclweb.org/genai-peerreview-guidelines}} 

Beyond malicious use, the widespread adoption of LLMs into everyday tools raises concerns related to authorship attribution, intellectual property, and the transparency of human-AI collaboration. In journalistic \citep{Puccetti_2024}, educational \citep{KASNECI2023102274}, and governmental settings \citep{corazza2025hybrid}, the ability to distinguish between human-written and machine-generated content has become of uttermost importance to preserve the essential features of trust for such sensible domains, while also allowing for responsible AI deployment.

As a consequence, the task of machine-generated text (MGT) detection has received increasing attention throughout the years. 
With the rapid proliferation of AI-based assistants, the need for automatic tools to detect their outputs has become more urgent, leading to the proposal of numerous detection methods \citep{li-etal-2024-mage,DBLP:conf/nips/HuCH23,abassy-etal-2024-llm}. However most existing work focuses on English, highlighting the importance to develop MGT detection systems for other languages as well. 
Furthermore, the widespread availability of open-weight LLMs poses existing detection methods with the challenge of an ever-growing array of fine-tuned models, making the detection task increasingly non-IID (independent and identically distributed) and stressing their generalization capabilities to account for subtle shifts in writing style \citep{pedrotti2025stresstestingmachinegeneratedtext}.

To address this gap, the community proposed shared tasks focused on the detection of MGT texts mainly focused on English or on multilingual settings \citep{DBLP:conf/semeval/WangMISSTAMPA24, wang-etal-2025-genai} along with the release of benchmark datasets \citep{wang2024m4gtbench}. To extend these efforts to the Italian language, the EVALITA 2026 \citep{evalita2026overview} shared task \desegma{} aims to foster research on Italian MGTs by providing the research community with a benchmark dataset and a standardized evaluation framework.
% With the recent advancements of Generative AI and the proliferation of AI-based assistants, the need for automatic tools for the detection of their outputs has become even more urgent. 
% As a result, many detectors have been already proposed.

\section{Tasks Definitions}
% \desegma{}\footnote{\url{https://desegma.github.io/}} aims to test the robustness of machine-generated text detectors by evaluating their performance under settings where the IID assumption does not hold.
% While state-of-the-art MGT detectors have reported high accuracy, such results often stem from unrealistic experimental settings: for example, relying on prior knowledge of the text generator (e.g., \cite{mitchell_detectgpt_2023}), or failing to consider domain shifts and efficient fine-tuning (or post-tuning) strategies. \desegma{} is structured into two sub-tasks: (\textit{i}) MGT detection and (\textit{ii}) human-machine text segmentation. The first sub-task is evaluates the accuracy of detection systems at the document level, while the latter focuses on the fine-grained identification of a machine-generated text within a human-written document.
% While in the first setting, we simulate non-IID conditions by using different LLM generators for the training and testing datasets, in the latter the two dataset splits share the same LLM generators given the inherently difficulty of the segmentation task. 
\desegma{}\footnote{\url{https://desegma.github.io/}} shared task is organized into two sub-tasks: \textit{(i)} MGT detection and \textit{(ii)} human–machine text segmentation. The first sub-task evaluates detection accuracy at the document level and aims to test the robustness of machine-generated text detectors under realistic, non-IID conditions. While state-of-the-art MGT detectors report high accuracy, such results often rely on unrealistic experimental settings—for example, depending on prior knowledge of the text generator (e.g., \cite{mitchell_detectgpt_2023}) or ignoring domain shifts and fine-tuning strategies. To address this, we simulate real-world non-IID conditions by generating the training and testing datasets from different LLMs.

In contrast, the segmentation sub-task focuses on identifying machine-generated text within a human-written document. For this task, both dataset splits share the same LLM generators due to its inherently higher difficulty.


\subsection{Sub-task A: MGT Detection in the Wild}

In \textbf{sub-task A}, we simulate the challenge posed by the ever-shifting domain of MGT detection by sampling train and test documents from two disjunct sets of generating LLMs.
The task is structured as a binary-classification problem and defined as follows: \textit{``Given a piece of text $t$, assign it the label 0, if the text is written by a human, and 1 otherwise.''}
We provide an example in \Cref{fig:subtask_a}. 

\begin{figure}[t]
    \centering
\begin{genericprompt}
\textbf{Human Text:} Viktor Orban, da quando il leghista è diventato ministro dell'Interno, si mandano segnali di apprezzamento reciproco. Un'alleanza che potrebbe portare l'Italia al fianco dei Paesi di Visegrad e che già - in occasione della discussione sulla riforma di Dublino - ha messo in difficoltà gli altri partner dell'Ue. Il primo ad allacciare rapporti era stato Salvini. Da Frosinone il ministro aveva parlato dell'Ungheria come di un paese con cui l'Italia potrà cambiare l'Europa. Entrambi, in fondo, sono dichiaratamente euroscettici. E sia Orban che il leghista ...
\textbf{Label:} 0
\newline
\newline
\textbf{Machine Text:} Viktor Orban, dopo anni di duri scontri diplomatici, sono pronti a unire le loro forze per riscrivere l'agenda di Bruxelles. Il leader di Fratelli d'Italia Giorgia Meloni e il presidente del governo ungherese si sono visti a Vienna, in un vertice di "centrodestra, identità italiana e sovranità italiana". Dopo un incontro di ben tre ore i due hanno spiegato come si possano conciliare politicamente le due visioni d'Europa ...
\textbf{Label:} 1
\end{genericprompt}
    \vspace{-10px}
    \caption{Sub-task A Example.}
    \vspace{-10px}
    \label{fig:subtask_a}
\end{figure}

\begin{figure}[t]
    \centering
\begin{genericprompt}[frametitle={\textbf{Sub-task B Example}}]
\textbf{Text: }Il presidente del Tribunale internazionale del diritto del mare (Itlos), Vladimir Golitsyn, ha fissato \textit{al 10 agosto la data in cui il tribunale arbitrale di Amburgo esaminerà le informazioni che l'Italia intende raccogliere in India per scagionare i Marò . Nei giorni scorsi su vari quotidiani erano uscite indiscrezioni circa la data di un eventuale incontro tra i due fucilieri di Marina e i loro avvocati e i funzionari del ministero dell'Interno indiano, che dovrebbero rilasciare a loro una sorta di "licenza" temporanea così che i due marinai possano recarsi in India.} \newline\newline \textbf{Target Character Index:} 103
\end{genericprompt}
    \vspace{-10px}
    \caption{Sub-task B Example. Non-italic text represents the human segment, while the \textit{italic} text denotes the continuation generated by a LLM.}
    \label{fig:subtask_b}
    \vspace{-10px}
\end{figure}

\subsection{Sub-task B: Human - Machine Text Segmentation}
In the \textbf{second sub-task}, participants are required to detect the boundary between the human-written text and the machine-generated continuation by identifying the index of the character that marks the beginning of the MGT content. Each data sample consists of a variable-length human-written prompt, always followed by a variable-length continuation produced by the model.
Unlike traditional MGT detection tasks that require document-level binary classification, this sub-task focuses on segmentation: participants must pinpoint 
the beginning of the text generated by the LLM.

The task is defined as follows: \textit{``Given a piece of text $t$, return the index of the first character that is generated by an LLM.''} To ensure a statistically robust evaluation, the length of the human-written sub-string is uniformly sampled from a range of 64 to 512 characters.
This setup simulates real-world scenarios in which MGT may be inserted into otherwise human-written content. 
% The same techniques described for the previous sub-task are used to generate continuations of varying complexity.
We provide an example in \Cref{fig:subtask_b}. 


\section{Dataset}
For each of the two sub-tasks, we provide participants with training and evaluation specific data.
In this section, we describe the two datasets in details.

\subsection{Sub-task A: MGT Detection}
The dataset for sub-task A consists of both human-written and machine-generated texts. Human texts are sampled from the Change-IT dataset \cite{demattei-etal-2020-changeit}. The original dataset consists of news articles collected from two Italian outlets: the \textit{La Repubblica}\footnote{\url{https://www.repubblica.it/}} newspaper and the \textit{Il Giornale}\footnote{\url{https://www.ilgiornale.it/}} newspaper. From the dataset, we retain the headline field, storing the title of the news articles and the human-written article itself.
We use the headline to prompt a pool of LLMs to generate an synthetic version of the article. Furthermore, we provide the LLM with guidelines regarding the political agenda of the original news outlet. We also prompt the model to avoid any formatting style (e.g., retain from using bullet points, markdown sections, etc.) to adhere more closely to the style of the human-written articles. All models are prompted with their respective default generation parameters.
We report the prompts used in \Cref{tab:prompts}. We sample a random balanced selection of human-written and machine-generated texts to construct the task' datasets.

% prompting strategy for sub-task A
\begin{table}[t]
    \centering
    \begin{tabular}{c|p{10cm}}
         Source & Prompt  \\
         \hline
         Il Giornale &  Sei un giornalista italiano che scrive sul giornale conservatore di destra "Il Giornale". Scrivi un articolo di giornale a partire da questo titolo: \texttt{<NEWS TITLE>}. Evita qualsiasi tipo di formattazione. Non generare il titolo, inizia direttamente dal corpo dell'articolo. \\
         La Repubblica & Sei un giornalista italiano che scrive sul  giornale progressista di sinistra "La Repubblica". Scrivi un articolo di giornale a partire da questo titolo: \texttt{<NEWS TITLE>}. Evita qualsiasi tipo di formattazione. Non generare il titolo, inizia direttamente dal corpo dell'articolo.\\
    \end{tabular}
    \caption{Prompts used for the generations.}
    \label{tab:prompts}
\end{table}

\paragraph{Training dataset}
We collect 33,138 news articles, equally split between human and machine-generated texts, while preserving a 1:1 ratio for each news outlet. The models used to create the machine generated part of the training dataset are reported in Appendix \ref{sec:appendix_taskA}.

We sample texts from the generators to maintain a balanced distributions among LLMs. \Cref{subfig:trainA-len-by-class} reports the distribution of text lengths and \Cref{subfig:trainA-texts-per-generator} shows the generating LLMs distribution. We include 15,135 articles, balanced according to the target classes and evenly split among generators.

\paragraph{Test dataset} We include 13,268 articles evenly balanced across news outlets. \Cref{subfig:testA-len-by-class} reports text length statistics and \Cref{subfig:testA-texts-per-generator} shows the distribution of texts with respect to the generating LLMs. Selected generators are reported in Appendix \ref{sec:appendix_taskA}.


\subsection{Filtering by Perplexity}
To make the classification more challenging, we compute with a held-out model, \texttt{gemma-3-12b-it}, the mean perplexity score for the human-written texts. We use this mean value to filter out MGTs that drift too far away from the selected filtering score. This process allows us to select documents that are similar in style according to the evaluator LLM, shrinking the classification boundary between the two classes.

\begin{figure}[t]
    \centering
    \begin{subfigure}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/text_length_distribution_by_class.png}
        \caption{Text length distribution by class (training set)}
        \label{subfig:trainA-len-by-class}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/number_of_texts_per_generator.png}
        \caption{Number of texts per generator (training set)}
        \label{subfig:trainA-texts-per-generator}
    \end{subfigure}
    \caption{Text statistics for sub-task A training data.}
    \label{fig:textlens-taskA}
\end{figure}


\begin{figure}[t]
    \centering
    \begin{subfigure}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/test_text_length_distribution_by_class.png}
        \caption{Text length distribution by class (test set)}
        \label{subfig:testA-len-by-class}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/test_number_of_texts_per_generator.png}
        \caption{Number of texts per generator (test set)}
        \label{subfig:testA-texts-per-generator}
    \end{subfigure}
    \caption{Text statistics for sub-task A test data.}
    \label{fig:textlens-taskA-test}
\end{figure}


\subsection{Sub-task B: Human-MGT Segmentation}
Data for the sub-task B consist of news articles switching from human written content to machine-generated continuation. Each text is paired with an integer denoting the length of the human written part. To generate plausible continuations, we select a human-written article and discard up to the first three sentences of the text. The remaining segment is used as an input prompt and fed to one of nine LLMs. The rationale for discarding the initial portion of the article is to present participants with a more challenging scenario, designed to mimic the detection of machine-generated text segments appearing at arbitrary positions within an article, rather than only at its beginning.

\paragraph{Training dataset} This split consists of 19,945 news articles. Generating LLMs are reported in Appendix \ref{sec:appendix_taskB}. We set the minimum length of the human-written to 64 characters, and the maximum length to 512 characters. The length distribution of human-written segments is shows in \Cref{subfig:trainB-human-len} and the length distribution of full human-written and machine-continued texts is reported in \Cref{subfig:testB-total-len}.

\begin{figure}[t]
    \centering
    \begin{subfigure}[t]{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/train_human_len_histogram.png}
        \caption{Human text length distribution}
        \label{subfig:trainB-human-len}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/train_total_len_histogram.png}
        \caption{Total text length distribution}
        \label{subfig:trainB-total-len}
    \end{subfigure}
    \caption{Text length statistics for sub-task B training data.}
    \label{fig:textlens-taskB}
\end{figure}

% some statistics (figures) for the test dataset sub-task B
\begin{figure}[t]
    \centering
    \begin{subfigure}[t]{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/test_human_len_histogram.png}
        \caption{Human text length distribution (test set)}
        \label{subfig:testB-human-len}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{assets/test_total_len_histogram.png}
        \caption{Total text length distribution (test set)}
        \label{subfig:testB-total-len}
    \end{subfigure}
    \caption{Text length statistics for sub-task B test data.}
    \label{fig:textlens-taskB-test}
\end{figure}

\paragraph{Test dataset} For the test set, we keep the same LLMs and select 23,211 new samples, while maintaining an even distribution among models. \Cref{subfig:testB-human-len} shows the length distribution of the human segments at the beginning of each sample and \Cref{subfig:testB-total-len} the length distribution of the joined human-written and machine-continued texts.


\section{Evaluation Metrics}
We define the following evaluation metrics for each sub-task:
\begin{itemize}
    \item For sub-task A: the main evaluation metric is Accuracy obtained by each system in the test set. Furthermore, we also report True Positive Rate (TPR) and False Positive Rate (FPR) for all systems.
    \item For sub-task B: the evaluation metric is the Mean Absolute Error (MAE) computed as follows:
    $$
    \text{MAE} = \frac{1}{n}\sum^{n}_{i=1}|y_i - x_i|
    $$
\end{itemize}

\section{Participants}

We received a total of 35 submissions from six unique teams. Each team could submit up to 10 runs per sub-task, with the best-performing one automatically selected as the final score. To handle submissions we use the Codabench platform \cite{XU2022100543}.

Furthermore, the \desegma{} task was joined by Pangram Labs\footnote{\href{Pangram Labs}{https://www.pangram.com/}} as a non-competing industrial participant. In \Cref{tab:team_info_subtasks_runs_multirow}, we provide an overview of the participants' affiliations, team composition, joined sub-tasks, and number of test-run submitted.

\input{tables/participation}

\paragraph{Gradient Descenders} 
    For the \textbf{sub-task A}: 
    The team \cite{desegma2026gradientdescenders} employs the \texttt{UmBERTo} model\footnote{\href{https://github.com/musixmatchresearch/umberto}{https://github.com/musixmatchresearch/umberto}} an encoder only language model trained on Italian data. The authors add Multi Layer Perceptron with two dense layers and tanh activation as a classification head that takes the model \texttt{[CLS]} token as input.

    For the \textbf{sub-task B}: The team approaches segmentation as a token level binary-classification task. They assign to each token a human-written or machine-generated text and a train a binary classifier. At inference time, the first token in the sequence classified as human-written is selected and set as the boundary index by mapping the token boundary to its respective character. For binary classification they rely on a \texttt{DeBERTa} model fine-tuned for the Italian language.\footnote{\href{https://huggingface.co/osiria/deberta-base-italian}{https://huggingface.co/osiria/deberta-base-italian}}

\paragraph{Kenji Endo} 
    For the \textbf{sub-task A}: 
    The team \cite{desegma2026kenjiendo} employs a decoder-only transformer architecture pre-trained from scratch on the Kenji-Endo dataset, exploring both a dense model and a Mixture-of-Experts (MoE) variant. The team investigates two classification paradigms for this sub-task: a discriminative fine-tuning approach, in which a classification head is trained on top of the dense model, and a generative, prompt-based approach, applicable to both the dense and MoE models.
    For discriminative training, the dense model is fine-tuned either for a single epoch on the full training set or for multiple epochs on a reduced subset, while the generative setting performs inference via prompting without additional fine-tuning. The dense discriminative model trained for a single epoch on the full dataset was selected for submission.

\paragraph{UniTor} 
    For the \textbf{sub-task A}:
    The system submitted by the team \cite{desegma2026unitor} consists of a fine-tuned version of \texttt{ModernBERT-large} trained on an augmented\footnote{Note that this was not allowed by the competition rules, as reported in the website (\url{https://desegma.github.io/}): "Keep in mind that you should only use the training dataset we make available to train your detectors."} dataset. This augmented dataset consists of 14.000 additional instances: 7,000 human-written articles from the CHANGE-IT \cite{changeit_2020} corpus of Italian newspaper articles and 7,000 synthetic samples generated by translating the English RAID benchmark \cite{dubey-2024-evaluating}.

    For the \textbf{sub-task B}: 
    The team leverages \texttt{ModernBERT-large} fine-tuned for token-level binary classification, where each token is labeled as human-authored or machine-generated. A two-layer MLP classification head produces per-token probabilities. Rather than thresholding individual token predictions, boundary localization is performed via a change point detection procedure: the boundary token is selected by maximizing a score that aggregates log-likelihood evidence for human-authored tokens before the boundary and machine-generated tokens after it.

\paragraph{Nicla} 
    For the \textbf{sub-task A}: Team
    \cite{desegma2026nicla} addresses the task using \texttt{DistilBERT-base} fine-tuned for binary text classification. The model employs a standard sequence classification head, with hyperparameters selected via Bayesian optimization on a held-out validation subset to improve accuracy and generalization.

    For the \textbf{sub-task B}: The team addresses the task by training a \texttt{LightGBM} regressor on sentence embeddings produced by a frozen Sentence-BERT model based on \texttt{all-MiniLM-L6-v2}, directly predicting the character index of the human–machine boundary.

\paragraph{Stochastic Gradient Descenders}
    For the \textbf{sub-task A}: Team \cite{desegma2026stochasticgradientdescenders} frames sub-task A as a conditional single-token generation task using a decoder-only instruction-following LLM. The team fine-tunes \texttt{Qwen2.5-0.5B-Instruct} via supervised instruction tuning (SFT) with a conversation-style prompt, where the model outputs "0" for human-written or "1" for machine-generated text. Low-Rank Adaptation (LoRA) is applied to all linear layers to reduce computational cost and mitigate forgetting. At inference, predictions are obtained through greedy decoding of the first token, with a fallback to inspect raw logits if the model outputs non-numeric text.

    For the \textbf{sub-task B}: The team reformulates the task as a token-level sequence labeling problem, labeling each token as human-written or machine-generated. The team fine-tunes \texttt{UmBERTo} \cite{musixmatch-2020-umberto} with a standard token-level classification head on top of the encoder. During inference, the model predicts per-token labels, and the boundary is extracted as the start character of the first token classified as machine-generated. Full fine-tuning is performed with standard optimization and mixed-precision training.

\paragraph{MINDS} 
    For the \textbf{sub-task B}: 
    The team \cite{desegma2026minds} framed the task as a token-level sequence labeling problem, using an encoder-only transformer architecture to predict human–LLM segment boundaries. They experimented with different pretrained backbones, including \texttt{BERT} and \texttt{RoBERTa} variants, under a shared training setup, finding that an Italian-specific \texttt{BERT}\footnote{\url{https://huggingface.co/dbmdz/bert-base-italian-cased}} model yielded the best performance. Token-level predictions are converted into hard labels via a post-training threshold selection step, where the decision threshold is tuned on a validation set. 

\input{tables/subtask-a-results}

\paragraph{Baseline} 
    For the \textbf{sub-task A}: 
    The baseline system is based on a multilingual version of \texttt{DeBERTa-v3} \cite{he2023debertav} with a classification head. During training the backbone model is kept frozen, and only head's parameters are updated. The baseline is trained for one epoch on the whole training set.

    For the \textbf{sub-task B}: 
    The baseline system is based on a multilingual version of \texttt{DeBERTa-v3} \cite{he2023debertav} with a token-level classification head. During training the backbone model is kept frozen, and only head's parameters are updated. The baseline is trained for one epoch on the whole training set. To select the switching token, the leftmost machine-generated token is selected, and mapped to its starting character index.

\section{Results}
Most teams participated in both sub-tasks, with the exception of the \textit{Kenji-Endo} team which participated only in sub-task A and the \texttt{MINDS} team which only participated in sub-task B. We describe the results of the two sub-tasks separately.



\subsection{Sub-task A}

\input{tables/design-choices}

\Cref{tab:res-detection} reports the results obtained by participants on sub-task A and B, respectively.
Note that, for the ranking of submitted systems, we consider the accuracy score. All submitted systems outperform the baseline model. The best-performing system was submitted by the team \textit{Gradient Descenders}, which fine-tuned the Italian only LLM \texttt{UmBERTo} with a two-layer classification head. Notably, the second-best submission, by team \textit{Kenji-Endo}, is achieved by the only system based on a decoder-only transformer. The decoder is pre-trained from scratch on an Italian-only corpus to assess the effectiveness of smaller LM trained on curated, language-specific data. This result underscores the potential for decoder-only models to be used for document-level tasks, such as machine-generated text detection.

The two best performing participating submissions report an accuracy marginally above 0.94 with the highest being 0.9458. This indicates that while developed systems are effective at detecting machine generated texts approximately 5\% of the texts are misclassified, i.e. human-written texts are classified as machine-generated or the opposite. This is a significant limitation for a sensitive task such as detecting machine generated texts. For example an error of this kind can impact school- or work-related grading or performance reviews. \Cref{tab:res-detection} also reports a late submission from the \textit{UniTor} team (which does not count for the ranking) shows that higher accuracies are possible in our dataset, in particular they achieve an accuracy of 0.9578 supporting that novel techniques can further improve the leader-board of the \desegma{} task.

To provide a deeper analysis of the kind of errors made by participating teams, it is worth analyzing models' predictions through the lens of additional metrics in addition to accuracy. For this reason, in \Cref{tab:res-detection} we report False Positive and True Positive Rates (FPR and TPR respectively).

We see that all systems have and FPR lower than 2\% with the highest value 0.178 shown by \textit{Kenji-Endo} and the lowest by \textit{Stochastic Gradient Descenders} 0.0033. Based on these results, in the best case scenario every 1,000 human texts, 3 would be wrongly attributed to generative AI, while in the worst case this would happen for about 18 human-written texts. These results highlight that participating systems occasionally attribute human-written texts to AI systems, this is not desirable as it undermines the authorship of ``authentic'' content created by humans.

We also investigate which participants' choices resulted in better performance. Specifically, \Cref{tab:participants_choices} reports the main decisions made by each team when developing their system. As expected all teams used language models to tackle \desegma{}, therefore we report which teams used encoder-only and/or decoder-only language models and the language each model is trained on. For sub-task A we see an almost even split between encoder- and decoder-only language models (3 encoder and 2 decoder) and interestingly the first and second best performing results are obtained by teams using an encoder-only and a decoder-only language model respectively. Concerning the model language, we see a clear benefit from using models that have been trained on Italian texts, a choice made by the two best-performing submissions.


\subsubsection{Pangram Labs' Participation}
Pangram\footnote{\href{https://www.pangram.com/}{https://www.pangram.com/}} is an online tool for the automatic detection of Machine generated texts. They adopt a transformer-based classifier trained on a dataset they develop to detect both closed source Generative AI systems such as \texttt{GPT-4-0613} and open-source ones such as \texttt{Llama-2-70b-chat} \citep{emi2024technicalreportpangramaigenerated}. Their approach to the \desegma{} task is different from others because their system does not have access to the training set which was released at a time when the Pangram detector was already available. Due to this difference, their results are reported in \Cref{tab:res-detection} as non-competing. They have lower accuracy than models trained on the training set developed specifically for the \desegma{} task, however they are the only system with 0 FPR. This result indicates that their system never attributes human-written texts to Generative AI.

\subsection{Sub-task B}
\input{tables/subtask-b-results}

\Cref{tab:res-segmentation} reports the results for Sub-task B. Two out of the five submitted systems outperform the baseline. The best performing system was submitted by the \textit{Stochastic Gradient Descenders} team, the team fine-tuned the Italian only LLM \texttt{UmBERTo} and adopted a leftmost machine-generated token decision strategy. The second-best system, by team \textit{MINDS}, also relies on an Italian-specific variant of \texttt{BERT}, confirming the relevance of language-specific pre-training for fine-grained tasks such as the human-MGT text segmentation. Among the remaining teams, the regression-based approach proposed by team \textit{Nicla} does not achieve competitive performance, highlighting the effectiveness of framing the segmentation task as a sequence-labeling problem.

Also for sub-task B \Cref{tab:participants_choices} reports the design choices, in this case there is a clear preference for encoder-only language models. Same as for sub-task A there is and advantage in using Italian-first models, the choice made by the two best-performing teams.

\section{Discussion}
\begin{figure}[t]
    \centering
    \includegraphics[width=1\linewidth]{assets/recall_by_generator_and_team.png}
    \caption{Sub-task A: Per-generator recall for each participant.}
    \label{fig:per-generator-acc}
\end{figure}

The task of detecting machine-generated texts is a challenging tasks mostly due to the general purpose abilities of modern LLMs and because of the large number of available models. \desegma{}'s sub-task A challenged participants to train detectors that are able to address specifically this difficulty by including synthetic texts generated by different systems. In particular, while train and test set do not share any data, they share some of the LLMs used to generate synthetic texts. As a result, the accuracy of the detectors is high but not close to a perfect score. Highlighting that sensitivity to the LLM used to generate texts is a key factor to account for when training MGT detectors.

% To better understand how different generators affect detection performance,
% % a sense of how different models can prove challenging, 
% we inspect the accuracy of participating systems when restricted to each generator, as reported in \Cref{fig:per-generato-acc}. Note that this analysis is performed on the subset of machine-generated texts.
To better understand how different generators affect detection performance, we analyze systems separately for each generator, as reported in \Cref{fig:per-generator-acc}. This analysis is conducted exclusively on the subset of machine-generated texts, with human-authored texts excluded. As no negative instances are present in this setting, we report recall (i.e., the true positive rate) for each generator.
This analysis reveals interesting patterns. All systems 
achieve high recall on texts
% are accurate when detecting the texts
generated by models based on English-first pretrained models: on texts generated by \texttt{Llama-3.3-70B}, and \texttt{Gemma-3-27B} the average recall exceeds 90\%. Similar results are observed for English pretrained models later fine-tuned on Italian, \texttt{ANITA-Next-24B} is easy to detect, with systems maintaining a recall above 90\%. In contrast, models pre-trained from scratch on Italian data are more challenging to detect, resulting in lower recall scores, close to 50\%. The only model that deviates from this pattern is \texttt{Llama-3.1-8B} which, despite being pretrained on English, evades some of the detectors. We interpret this as evidence of unexpected generation patterns when the model produces Italian texts.

When performing MGT detection there are two possible errors, human-written texts classified as AI-generated and the opposite, AI-generated texts classified as human-written. In real applications the first type of error can be more harmful, since it results in genuine human effort being misattributed to AI systems. To quantify how often this happens we compute the FPR of participating systems and we find that all systems are between 0.3\% and 2\% FPR. While this shows that the proposed systems rarely make this type of error there are instances of human-written texts attributed to AI. 

The only system showing 0 FPR is the Pangram detector which is however less accurate than others, this can be due to not relying on the \desegma{} training set. Restricted to our test set, Pangram never attributes human-written texts to AI but it is more prone to attributing MGT texts to humans. This is a desirable design choice necessary for responsible deployment of MGT detectors.

\begin{figure}[t]
    \centering
    \includegraphics[width=1\linewidth]{assets/mae_by_generator_and_team.png}
    \caption{Sub-task B: Per-generator MAE for each participant.}
    \label{fig:per-generato-mae}
\end{figure}


Sub-task B, segmenting human-written and machine-generated texts, requires participants to identify the character where a text that is human-written in its first part switches to machine-generated. For this task, we measure performance through character-level mean absolute error, which corresponds to the number of offset characters between the predicted switch and the real one. We see that achieving an error inferior to 50 characters has proven challenging for participants, with the best system achieving a MAE of 52.54. Unlike sub-task A, identifying which models are harder to segment is not evident. \Cref{fig:per-generato-mae} shows the MAE achieved by each participant submission restricted to single models. There are fewer clear patterns compared to sub-task A, the main observations we draw is that \texttt{SmolLM3-3B} and \texttt{Minerva 7B} are more difficult to segment, with an average MAE above 70 characters, while \texttt{GPT-OSS-20B} is relatively easier with average MAE close to 40. On the remaining models, the average MAE of participants is evenly spread between 40 and 70, showing that the text generated by all models is comparably challenging to segment from human-written text.


\begin{acknowledgments}
  % We thank ... \\
  Giovanni Puccetti is fully funded by the Italian Ministry of University and Research under the PNRR project ITSERR (CUP B53C22001770006).
  Andrea Pedrotti is fully funded by the European Union - NextGenerationEU through PNRR (CUP B53C22001760006) ‘‘SoBigData.it: Strengthening the Italian RI for Social Mining and Big Data Analytics’’ (SoBigData.it). 
\end{acknowledgments}


%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}
During the preparation of this work, the authors used GPT-4 in order to: Grammar and spelling check; Improve writing style. 

%%
%% Define the bibliography file to be used
\bibliography{custom, anthology_0, anthology_1}

%%
%% If your work has an appendix, this is the place to put it.
\appendix

\section{Generative Models Details}

\Cref{tab:training-models-subtaskA} reports the four models used to create the machine-generated part of the training set of Sub-task A. \Cref{tab:evaluation-models-subtaskA} reports the four models used to create the machine-generated part of the test set of Sub-task A.

\subsection{Sub-task A}
\label{sec:appendix_taskA}

\begin{table}[ht]
\centering
\begin{tabular}{ll}
\hline
\textbf{Model} & \textbf{Hugging Face URL} \\
\hline
\texttt{Qwen3-4B-Instruct} &
\href{https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507}{Qwen/Qwen3-4B-Instruct-2507} \\

\texttt{Gemma-3-4B-it} &
\href{https://huggingface.co/google/gemma-3-4b-it}{google/gemma-3-4b-it} \\

\texttt{Nemo-Instruct} &
\href{https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407}{mistralai/Mistral-Nemo-Instruct-2407} \\

\texttt{Gpt-oss-20B} &
\href{https://huggingface.co/openai/gpt-oss-20b}{openai/gpt-oss-20b} \\
\hline
\end{tabular}
\caption{Training models used, with their relative Hugging Face URLs.}
\label{tab:training-models-subtaskA}
\end{table}


\begin{table}[ht]
\centering
\begin{tabular}{ll}
\hline
\textbf{Model} & \textbf{Relative URL} \\
\hline
\texttt{ANITA-NEXT-24B} &
\href{https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA}{m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA} \\
\texttt{Llama-3.3-70B-Instruct} &
\href{https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct}{meta-llama/Llama-3.3-70B-Instruct} \\
\texttt{Minerva-7B} &
\href{https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0}{sapienzanlp/Minerva-7B-instruct-v1.0} \\
\texttt{Gemma-3-27B-it} &
\href{https://huggingface.co/google/gemma-3-27b-it}{google/gemma-3-27b-it} \\
\hline
\end{tabular}
\caption{Models selected for evaluation, with Hugging Face relative URLs.}
\label{tab:evaluation-models-subtaskA}
\end{table}

\subsection{Sub-task B}
\label{sec:appendix_taskB}

\Cref{tab:training-models-subtaskB} reports the nine models used to generate the continuation of human-written news in the train and test sets of Sub-task B.

\begin{table}[ht]
\centering
\begin{tabular}{ll}
\hline
\textbf{Model} & \textbf{Relative URL} \\
\hline
    \texttt{SmolLM3-3B} &
    \href{https://huggingface.co/HuggingFaceTB/SmolLM3-3B}{HuggingFaceTB/SmolLM3-3B} \\
    
    \texttt{Qwen3-4B-Instruct} &
    \href{https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507}{Qwen/Qwen3-4B-Instruct-2507} \\
    
    \texttt{Gemma-3-27b-it} &
    \href{https://huggingface.co/google/gemma-3-27b-it}{google/gemma-3-27b-it} \\
    
    \texttt{ANITA-NEXT-24B} &
    \href{https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA}{m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA} \\
    
    \texttt{Llama-3.1-8B-Instruct} &
    \href{https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct}{meta-llama/Llama-3.1-8B-Instruct} \\
    
    \texttt{Llama-3.3-70B-Instruct} &
    \href{https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct}{meta-llama/Llama-3.3-70B-Instruct} \\
    
    \texttt{Nemo-Instruct} &
    \href{https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407}{mistralai/Mistral-Nemo-Instruct-2407} \\
    
    \texttt{Gpt-oss-20b} &
    \href{https://huggingface.co/openai/gpt-oss-20b}{openai/gpt-oss-20b} \\
    
    \texttt{Minerva-7B} &
    \href{https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0}{sapienzanlp/Minerva-7B-instruct-v1.0} \\

\hline
\end{tabular}
\caption{Generator models used in sub-task B, with Hugging Face relative URLs.}
\label{tab:training-models-subtaskB}
\end{table}

% \section{Online Resources}

% The sources for the ceur-art style are available via
% \begin{itemize}
% \item \href{https://github.com/yamadharma/ceurart}{GitHub},
% % \item \href{https://www.overleaf.com/project/5e76702c4acae70001d3bc87}{Overleaf},
% \item
%   \href{https://www.overleaf.com/latex/templates/template-for-submissions-to-ceur-workshop-proceedings-ceur-ws-dot-org/pkfscdkgkhcq}{Overleaf
%     template}.
% \end{itemize}


\end{document}
%%
%% End of file
