\documentclass[runningheads]{llncs}
%
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{amsfonts}
\usepackage{nicefrac}
\usepackage{microtype}
\usepackage{threeparttable}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage[hyphens]{url}
\urlstyle{rm}

\graphicspath{{./}{./figures/}}

\begin{document}

\title{Multilingual Evaluation of Human vs. AI Text Classification with Zero-Shot Analysis of Contemporary LLM Architectures}

\titlerunning{Multilingual Evaluation of Human vs.\ AI Text Classification}

\author{
    Pranamya Nilesh Deshpande\inst{1}\thanks{Corresponding author} \and
    Raj Abhijit Dandekar\inst{2} \and
    Rajat Dandekar\inst{2} \and
    Sreedath Panat\inst{2}
}

\authorrunning{P. N. Deshpande et al.}

\institute{
    GES's R. H. Sapat College of Engineering, Management Studies and Research,
    Nashik, MH, India \\
    \email{pranamyadeshpande14@gmail.com}
    \and
    Vizuara AI Labs, Pune, India \\
    \email{\{raj, rajatdandekar, sreedath\}@vizuara.com}
}

\maketitle

\begin{abstract}
Human-AI text recognition has emerged as an essential problem in maintaining
the authenticity of digital content worldwide. Despite advancements, current
detection tools largely cater to English texts only, causing a major lacuna in
covering multilingual scenarios. This paper introduces the first end-to-end
multilingual approach to human vs.\ AI text categorization for Hindi and Spanish
languages. We compare traditional machine learning classifiers and
state-of-the-art transformer models using three stages: baseline validation on
English data, multilingual evaluation on carefully filtered Hindi and Spanish
datasets, and zero-shot generalization from English outputs of various modern
large language models. In controlled experimental settings with curated datasets
(338 samples per language), models including XGBoost and T5 achieved perfect
classification performance (F1~=~1.00) on Hindi and Spanish test sets, suggesting
strong discriminative signals in morphologically rich languages. These results
represent diagnostic performance on formal text and require validation on larger,
more diverse datasets before deployment. Classical models beat
transformer-based methods in cross-lingual settings by a maximum of 0.17
increase in F1-score. Experiments in zero-shot testing indicate inconsistent
detectability of current LLMs, with commercial models detected consistently but
smaller open-source models going undetected. This diagnostic study establishes
proof-of-concept for multilingual AI text detection and identifies critical gaps
including vulnerability to small optimized models, highlighting the need for
continual system updates and larger-scale validation before real-world deployment.

\keywords{AI text detection \and Multilingual NLP \and Zero-shot generalization
\and TF-IDF \and Transformer models \and Human vs.\ AI classification}
\end{abstract}

%
% ---- Introduction ----
%
\section{Introduction}

There has been unprecedented progress in Large Language Models such as
GPT-4~\cite{openai2023}, Claude~3~\cite{anthropic2024}, and Gemini~1.5
Pro~\cite{deepmind2024} that can produce text similar to human authors in terms
of coherence, context understanding, and fluency. While these abilities have
enabled revolutionary advances in education, arts, and customer
service~\cite{brem2023}, they simultaneously have opened up fundamental risks
including academic plagiarism, AI-manipulated disinformation campaigns, and
diminished trust in information on the web~\cite{zellers2019,gehrmann2019}. Text
detection as being written by human vs.\ by AI is therefore emerging as a key
research issue.

Previous detection methods relied on stylometric and statistical analysis along
with conventional machine learning classifiers such as Logistic Regression and
Random Forests~\cite{frenois2019,ippolito2020}. These models could learn
surface-level and syntactic patterns that distinguished early-generation neural
text from human writing. However, as large autoregressive LLMs emerged, such
methods could not offer reliable accuracy---particularly when the generated text
is post-edited or adversarially paraphrased~\cite{bakhtin2021}. In contrast,
transformer-based detectors such as fine-tuned RoBERTa and T5 significantly
enhanced detection accuracy in monolingual English
settings~\cite{mitchell2023,kirchenbauer2023}.

Nevertheless, recent large-scale multilingual evaluation studies such as
MULTITuDE~\cite{macko2023} and MultiSocial~\cite{macko2024} have indicated
that English-trained models suffer drastic performance declines when evaluated
on typologically divergent or morphologically rich
languages~\cite{ruder2019,potthast2021}. Such limitations indicate that
cross-lingual robustness remains a key bottleneck in current detection
approaches.

Adversarial robustness is equally pressing an issue. Detectors like
DetectRL~\cite{yang2024} and BUST~\cite{kreps2024} have shown that detection
performance can be drastically degraded by slight text
manipulation---paraphrasing, summarization, or superficial stylistic editing. In
real-life scenarios, such vulnerabilities can be exploited by malicious actors
attempting to evade detection in disinformation operations or plagiarism.

To address these issues, recent research has explored Explainable Artificial
Intelligence (XAI) to enhance interpretability and transparency in text
detection. Methods such as LIME~\cite{ribeiro2016} and
SHAP~\cite{lundberg2017} enable the extraction of discriminative lexical,
syntactic, or semantic information and expose model decision boundaries. In
multilingual NLP applications, XAI techniques have been shown to enhance trust
and debuggability and uncover model biases~\cite{arrieta2020,petrillo2024}.

The HULLMI approach~\cite{joshi2024} demonstrated that interpretable, simple
models such as XGBoost and LSTM over TF-IDF features could match or surpass
fine-tuned transformer detectors in binary human vs.\ LLM text classification
problems. However, HULLMI was limited to English and only a few classes of LLM
outputs, with open questions regarding multilingual adaptability as well as
generalizability to newer LLM architectures.

This paper overcomes these shortcomings through extensive AI text detection
evaluation in Hindi and Spanish. Our approach, which is inspired by the HULLMI
framework, makes three important contributions:

\noindent\textbf{Multilingual Evaluation} -- We develop Hindi and Spanish
datasets consisting of equal numbers of human and AI-generated samples from
recent state-of-the-art LLMs, including GPT-4o, Gemini~1.5~Pro, and Claude~3
Opus.

\noindent\textbf{Zero-Shot Generalization} -- We test trained detectors on
outputs from state-of-the-art LLMs such as Gemini~2.0, Gemma~2B, and Phi-3
Mini to evaluate robustness to model advancement.

\noindent\textbf{Comparative Analysis} -- Through the integration of
cross-lingual analysis and interpretability, our approach aims to contribute to
robust, transparent, and generalizable AI text detection systems that can cope
with the rapid pace of generative AI development.

\noindent\textbf{Scope and Limitations.} Our study employs controlled
experimental conditions with formal, topic-balanced text (338 samples per class
per language). While several models achieve perfect classification scores
(F1~=~1.00) on Hindi and Spanish, these results should be interpreted as
diagnostic findings demonstrating strong language-specific detection signals
rather than claims of deployment-ready performance. Real-world applications
would require validation on substantially larger datasets including informal
text, code-switched content, and adversarially modified samples. Our goal is to
establish methodological foundations and identify cross-lingual patterns rather
than create a production-ready benchmark.

%
% ---- Methodology ----
%
\section{Methodology}
\label{sec:methodology}

This section explains our end-to-end methodology for multilingual AI
vs.\ human text classification. Our method introduces three novel contributions:
(1)~\textbf{multilingual evaluation} beyond English to Hindi and Spanish,
(2)~\textbf{zero-shot generalization testing} on modern LLMs not encountered
during training, and
(3)~\textbf{comparative analysis} between traditional machine learning models
and modern transformer architectures across various linguistic contexts.

Our methodology is organized into three distinct phases: (i)~baseline
validation using established English datasets, (ii)~construction and evaluation
of multilingual benchmark datasets, and (iii)~zero-shot performance assessment
on outputs from state-of-the-art LLMs. All experiments follow a unified
preprocessing and modeling pipeline to ensure comparability across evaluations.

Figure~\ref{fig:methodology} illustrates the comprehensive three-phase
methodology employed in our study.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{HuLLMi_Schematic.png}
    \caption{Overview of the three-phase methodology: Phase~1 validates
    baseline performance, Phase~2 constructs multilingual benchmark datasets,
    and Phase~3 evaluates on advanced LLM outputs.}
    \label{fig:methodology}
\end{figure}

\subsection{Phase 1: Baseline Validation}

To establish a foundation for our multilingual extensions, we first validate
our approach using the OpenGPTText-Final dataset, which contains balanced
samples of human and LLM text. Human text samples are sourced from
OpenWebText, while LLM samples are paraphrased versions of the same paragraphs
generated by GPT-3.5-turbo.

Our preprocessing pipeline includes newline character removal, duplicate
elimination, and tokenization. For traditional ML models, we transform text
into normalized vector representations using CountVectorizer followed by
TF-IDF transformation. The data is split into 80-20 train-test partitions. We
evaluate various model architectures including Naive Bayes, Logistic Regression,
Random Forests, XGBoost, Multi-Layer Perceptron (MLP), and Long Short-Term
Memory (LSTM) networks. Performance is assessed using six standard metrics:
Accuracy, F1-Score, False Positive Rate (FPR), False Negative Rate (FNR), True
Negative Rate (TNR), and True Positive Rate (TPR). This phase serves to
validate our modeling approach and establish baseline performance metrics for
subsequent multilingual comparisons.

\subsection{Phase 2: Multilingual Benchmark Construction and Evaluation}

The core novelty of our research lies in extending AI text detection to Hindi
and Spanish languages. For human-generated content, we collect data from
publicly accessible repositories such as AI4Bharat for Hindi and
Wikipedia/government corpora for Spanish. Each dataset is manually validated
for linguistic correctness and topic variability, yielding 338 human-written
instances across 13 topics per language.

To generate corresponding AI samples, we employ three state-of-the-art LLMs:
GPT-4o (OpenAI), Gemini~2.0 Flash (Google), and Claude~3 Opus (Anthropic).
Each model generates 26 articles per topic, resulting in 338 AI samples per
language. We ensure consistency across models using a standardized generation
prompt: ``You are a professional \textless LANGUAGE\textgreater{} journalist.
Write a 500--700 word article on \textless TOPIC\textgreater{}. Use an
encyclopedic neutral tone. Use one in-context example from the human corpus as
a style guide.'' This methodology ensures domain and stylistic consistency with
human-written text while capturing model-specific generation patterns.

\noindent\textbf{Dataset Scale and Scope.} Our study employs 338 samples per
class per language, yielding test sets of approximately 135 samples per class
after 80-20 splitting. This scale is appropriate for a diagnostic multilingual
study establishing proof-of-concept and identifying language-specific detection
patterns, but is substantially smaller than large-scale benchmarks like
MULTITuDE (n=74,756) and MultiSocial (n=50,000+). Our goal is to establish
methodological foundations for cross-lingual detection and reveal
language-specific phenomena rather than create a definitive production
benchmark. The controlled scale enables careful manual validation of data
quality and consistency but limits claims about robustness to diverse
real-world text.

Language-specific normalization and tokenization are applied to each dataset.
Traditional models use CountVectorizer and TF-IDF transformation, while LSTM,
T5, and RoBERTa models use their respective tokenizers. Data splits maintain
80:20 ratios with balanced class distributions. This multilingual extension
enables systematic evaluation of cross-lingual generalization capabilities and
identification of language-specific detection challenges.

\subsection{Phase 3: Zero-Shot Evaluation on Contemporary LLMs}

Our third major contribution involves evaluating model robustness against LLM
outputs unseen during training. We construct a custom test set
(custom\_test\_final.csv) containing 25 human-written and 25 AI-written samples
across diverse domains including literature, user reviews, recipes, forum posts,
and social media.

To minimize lexical overlap with training data, AI samples undergo a two-step
generation process using ChatGPT-4o: initial compression of source samples into
three-line abstracts, followed by expansion of abstracts into complete articles.
This approach reduces direct textual similarities while preserving semantic
content.

The test set includes outputs from six contemporary models spanning commercial
and open-source frameworks: Gemini~2.0, GPT-2 (Filtered), LLaMA~3.2~1B,
Qwen1.5~8B, Phi-3~Mini, and Gemma~2B. These models were deliberately excluded
from training data to simulate real-world deployment scenarios where content
originates from unknown or evolving model architectures. This zero-shot
evaluation assesses generalization capabilities and reveals potential
vulnerabilities in current detection approaches.

\subsection{Model Architecture and Training}

Our modeling approach encompasses both traditional machine learning and modern
deep learning architectures. Classical models---Naive Bayes, Logistic
Regression, Random Forests, XGBoost, and MLP---operate on TF-IDF vector
representations of text. These models provide high interpretability and
computational efficiency while maintaining competitive performance.

Deep learning models include LSTM networks, RoBERTa-Sentinel, and T5-Sentinel
models. RoBERTa-Sentinel employs a pre-trained RoBERTa encoder with a
classification head, while T5-Sentinel reformulates classification as a
sequence-to-sequence text generation task. All models use consistent
hyperparameter settings and loss functions to ensure fair comparisons.

Training is conducted independently for each language and experimental phase to
prevent data leakage. This independence enables isolation of language-specific
effects and model-specific biases across different linguistic contexts.

\subsection{Evaluation Framework}

Binary classification performance is assessed using standard metrics: Accuracy,
F1-score, True Positive Rate (TPR), True Negative Rate (TNR), False Positive
Rate (FPR), and False Negative Rate (FNR). These metrics are computed across
all models for English, Hindi, and Spanish datasets, as well as for
contemporary LLM outputs. ROC and DET curves provide additional visualization
of model discriminative ability and error trade-offs.

In multilingual settings, we carefully examine class-wise and overall
performance metrics while controlling for potential linguistic confounds that
might influence detection accuracy. This comprehensive evaluation framework
enables identification of strengths and weaknesses across different model types
and linguistic environments.

\subsection{Interpretability Analysis}

To ensure transparency and identify potential biases, we apply Local
Interpretable Model-agnostic Explanations (LIME) across all models and
experimental phases. For traditional models trained on TF-IDF vectors, LIME
reveals the most influential tokens driving classification decisions. For deep
learning models (LSTM, RoBERTa, T5), we adapt LIME's perturbation mechanism to
accommodate model-specific tokenization and softmax outputs.

LIME analysis generates the top~10 features influencing each model's
predictions, providing insights into model behavior across English, Hindi, and
Spanish texts, as well as contemporary LLM outputs. This comprehensive
interpretability analysis helps verify whether models rely on valid linguistic
patterns rather than spurious correlations, domain-specific artifacts, or
language-specific biases. Such analysis is crucial for identifying overfitting,
feature leakage, and ensuring robust generalization across diverse linguistic
and generative scenarios.

%
% ---- Results ----
%
\section{Results}
\label{sec:results}

\subsection{Introduction}

In this section, we demonstrate an overall analysis of our models for three
discrete stages of the study: (i)~baseline validation with the OpenGPTText
corpus to determine methodological soundness, (ii)~multilingual classification
accuracy on methodically developed Hindi and Spanish corpora with balanced human
and machine-generated data points, and (iii)~zero-shot generalization capacity
when applied to text drawn from modern LLMs never seen during training. All
models were thoroughly tested with six common binary classification metrics:
Accuracy, F1~score, False Positive Rate (FPR), False Negative Rate (FNR), True
Positive Rate (TPR), and True Negative Rate (TNR). We also provide ROC curves
and Area Under the Curve (AUC) values to ensure complete visualization of
discriminative ability of the models. This quantitative study provides a strong
basis for contrasting conventional machine learning methods with current
transformer frameworks in a variety of linguistic contexts and generative model
configurations.

\subsection{Phase 1: Baseline Validation Results}

Table~\ref{tab:baseline} shows the overall performance measures of our models
on the OpenGPTText-Final dataset as our baseline verification. The results show
that conventional ML models, when they are provided with proper vectorization
and preprocessing techniques, can have comparable performance in human vs.\ AI
text classification tasks. Common models like XGBoost, Logistic Regression, and
MLP provided acceptable performance, as shown by F1-scores between 0.88 and
0.92. The LSTM model also did equally well, recording an F1-score of 0.92 and
TPR of 0.93, which reflects excellent ability in identifying AI-generated text.

Interestingly, Naive Bayes performed drastically poorer with increased FNR
(0.52), indicating low detection quality for AI texts while retaining reasonable
detection quality on human-written data. T5 had the best overall performance
with F1-score of 0.97 and balanced error rates (FPR:~0.05, FNR:~0.04), followed
by RoBERTa with strong performance and F1-score of 0.94. These baseline results
confirm the reproducibility of proven detection approaches and offer a robust
platform for multilingual extensions.

\begin{table}[t]
    \caption{Baseline Validation Results on OpenGPTText Dataset.}
    \label{tab:baseline}
    \centering
    \small
    \begin{tabular}{lcccccc}
        \toprule
        Model & Acc. & F1 & FPR & FNR & TNR & TPR \\
        \midrule
        Naive Bayes    & 0.70 & 0.62 & 0.08 & 0.52 & 0.92 & 0.48  \\
        Logistic Reg.  & 0.90 & 0.90 & 0.12 & 0.08 & 0.88 & 0.92  \\
        Random Forests & 0.85 & 0.83 & 0.22 & 0.08 & 0.78 & 0.92  \\
        XGBoost        & 0.91 & 0.91 & 0.10 & 0.08 & 0.90 & 0.93  \\
        MLP            & 0.88 & 0.88 & 0.12 & 0.11 & 0.88 & 0.89  \\
        LSTM           & 0.93 & 0.92 & 0.08 & 0.06 & 0.92 & 0.93  \\
        RoBERTa        & 0.94 & 0.94 & 0.09 & 0.02 & 0.91 & 0.98  \\
        T5             & 0.97 & 0.97 & 0.05 & 0.04 & 0.94 & 0.995 \\
        \bottomrule
    \end{tabular}
\end{table}

\subsection{Phase 2: Multilingual Evaluation Results}

\subsubsection{Hindi Dataset Performance}

\noindent\textbf{Note on Perfect Classification Scores.} Several models achieve
F1~=~1.00 on Hindi and Spanish datasets. These results emerge from our
controlled diagnostic study with formal, well-structured text (338 samples per
class, 80-20 split yielding $\approx$135 test samples per class). Perfect scores
reflect strong discriminative signals in morphologically rich languages under
our experimental conditions but do not imply general real-world robustness.
Section~\ref{sec:discussion} provides detailed interpretation of these results
and their limitations.

We evaluated the ability of generalization of classification models on Hindi and
Spanish texts using datasets constructed according to our methodology.
Tables~\ref{tab:hindi_results} and~\ref{tab:spanish_results} show the results
for Hindi and Spanish, respectively. We rigorously tested the generalization
performance of all classification models on our meticulously labeled Hindi
dataset. Table~\ref{tab:hindi_results} uncovers stunning performance trends that
are very different from English baseline data. Tree-based (Random Forests,
XGBoost) and neural models (LSTM, T5) achieved perfect or near-perfect accuracy
with very low error rates, implying that Hindi linguistic characteristics might
indeed enable rather than impede AI text detection.

Most striking is the outstanding performance of T5 and XGBoost, both with
F1-scores of 1.00 and a zero false positive and false negative rate. This is
noteworthy compared to their English performance and suggests possibly
language-specific strengths in detection tasks. RoBERTa, though, performed much
worse with an F1-score of 0.63, and that is a stark $-0.31$ decline from its
English baseline performance.

\begin{table}[t]
    \centering
    \begin{threeparttable}
        \caption{Evaluation Results on Hindi Dataset (N=338 per class;
        80-20 split; test set $\approx$135 samples per class).}
        \label{tab:hindi_results}
        \small
        \begin{tabular}{lcccccc}
            \toprule
            Model & Acc. & F1 & FPR & FNR & TNR & TPR \\
            \midrule
            Naive Bayes    & 0.92 & 0.92 & 0.16 & 0.00  & 0.84 & 1.00  \\
            Logistic Reg.  & 0.99 & 0.99 & 0.02 & 0.01  & 0.98 & 0.99  \\
            Random Forests & 0.99 & 1.00 & 0.02 & 0.00  & 0.98 & 1.00  \\
            XGBoost        & 1.00 & 1.00 & 0.01 & 0.00  & 0.99 & 1.00  \\
            MLP            & 0.98 & 0.98 & 0.02 & 0.02  & 0.98 & 0.98  \\
            LSTM           & 0.99 & 0.99 & 0.00 & 0.015 & 1.00 & 0.984 \\
            RoBERTa$^\dagger$ & 0.68 & 0.63 & 0.00 & 0.69  & 1.00 & 0.31  \\
            T5             & 1.00 & 1.00 & 0.00 & 0.00  & 1.00 & 1.00  \\
            \bottomrule
        \end{tabular}
        \begin{tablenotes}
            \small
            \item $^\dagger$RoBERTa achieves perfect rank-ordering (AUC~=~1.00,
            see Fig.~\ref{fig:hindi_roc}) but sub-optimal decision threshold
            yields F1~=~0.63. This indicates the model can distinguish classes
            perfectly when threshold is optimized but default threshold is
            poorly calibrated for Hindi.
        \end{tablenotes}
    \end{threeparttable}
\end{table}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{H_ac.png}
    \caption{ROC and DET curves for Hindi dataset evaluation demonstrating
    superior performance for most models.}
    \label{fig:hindi_curves}
\end{figure}

\begin{figure}[t]
    \centering
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{rs_Hindi.png}
        \caption{RoBERTa on Hindi (AUC~=~1.00).}
        \label{fig:hindi_roc}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{t5_Hindi.png}
        \caption{T5 on Hindi (AUC~=~1.00).}
    \end{subfigure}
    \caption{Individual ROC curves for Hindi dataset models.}
\end{figure}

\subsubsection{Spanish Dataset Performance}

The Spanish dataset performance, as shown in Table~\ref{tab:spanish_results},
shows interesting patterns of performance that both replicate and are contrary
to Hindi outcomes. All the models except a few kept great performance with
XGBoost, Random Forests, LSTM, and T5 having perfect accuracy and F1-scores of
1.00. This consistency in both non-English languages indicates strong
cross-lingual generalization ability for all but a few architectures.

RoBERTa's Spanish performance (F1-score: 0.962) was far better than its Hindi
performance, yet not as high as its English baseline. The relative performance
difference between languages underscores the sophisticated interaction between
model architecture, pre-train data, and target language features. The better
performance of classical ML models on both languages supports the efficacy of
TF-IDF-based feature extraction for cross-lingual AI text detection.

\begin{table}[t]
    \caption{Evaluation Results on Spanish Dataset (N=338 per class;
    80-20 split; test set $\approx$135 samples per class).}
    \label{tab:spanish_results}
    \centering
    \small
    \begin{tabular}{lcccccc}
        \toprule
        Model & Acc. & F1 & FPR & FNR & TNR & TPR \\
        \midrule
        Naive Bayes    & 0.99 & 0.99  & 0.02 & 0.0  & 0.98 & 1.0  \\
        Logistic Reg.  & 0.98 & 0.99  & 0.02 & 0.0  & 0.98 & 1.00 \\
        Random Forests & 1.00 & 1.00  & 0.0  & 0.0  & 1.00 & 1.00 \\
        XGBoost        & 1.00 & 1.00  & 0.0  & 0.0  & 1.00 & 1.00 \\
        MLP            & 0.98 & 0.98  & 0.02 & 0.02 & 0.98 & 0.98 \\
        LSTM           & 1.00 & 1.00  & 0.0  & 0.0  & 1.00 & 1.00 \\
        RoBERTa        & 0.96 & 0.962 & 0.07 & 0.0  & 0.93 & 1.00 \\
        T5             & 1.00 & 1.00  & 0.0  & 0.0  & 1.00 & 1.00 \\
        \bottomrule
    \end{tabular}
\end{table}

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{S_ac.png}
    \caption{ROC and DET curves for Spanish dataset evaluation.}
\end{figure}

\begin{figure}[t]
    \centering
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{rs_Spanish.png}
        \caption{RoBERTa on Spanish (AUC~=~0.96).}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{t5_Spanish.png}
        \caption{T5 on Spanish (AUC~=~1.00).}
    \end{subfigure}
    \caption{Individual ROC curves for Spanish dataset models.}
\end{figure}

\subsubsection{Cross-Language Performance Analysis}

Table~\ref{tab:cross_language} gives systematic comparison results for model
performance across all three languages, with profound insights into cross-lingual
generalization patterns. XGBoost exhibited surprising consistency, actually
reaching higher performance on both Hindi (+0.09) and Spanish (+0.09) compared
to English baselines. T5 too exhibited modest improvements on both non-English
languages, reflecting strong multilingual capabilities.

The most salient result is RoBERTa's extreme performance fluctuation: while it
retained close-to-baseline performance on Spanish (+0.022 gain), it plummeted a
whopping $-0.31$ on Hindi. This trend indicates that cross-lingual performance
of transformer models is unusually sensitive to linguistic similarity with
pre-training data and might need language-specific fine-tuning to achieve the
best results.

\begin{table}[t]
    \caption{Cross-Language Performance Comparison (F1-scores across English,
    Hindi, Spanish).}
    \label{tab:cross_language}
    \centering
    \small
    \begin{tabular}{lccccc}
        \toprule
        Model & Eng. & Hin. & Spa. &
        \makecell{Drop\\E$\rightarrow$H} &
        \makecell{Drop\\E$\rightarrow$S} \\
        \midrule
        XGBoost    & 0.91 & 1.00 & 1.00  & +0.09 & +0.09  \\
        T5         & 0.97 & 1.00 & 1.00  & +0.03 & +0.03  \\
        RoBERTa    & 0.94 & 0.63 & 0.962 & -0.31 & +0.022 \\
        LSTM       & 0.92 & 0.99 & 1.00  & +0.07 & +0.08  \\
        Random F.  & 0.83 & 1.00 & 1.00  & +0.17 & +0.17  \\
        \bottomrule
    \end{tabular}
\end{table}

\subsection{Phase 3: Zero-Shot Generalization Results}

\subsubsection{Performance on Contemporary LLM Outputs}

The zero-shot detection on outputs of contemporary LLMs uncovered striking
fluctuations in detection success, as shown in Table~\ref{tab:zero_shot}. The
findings unveil essential weaknesses in present detection systems faced with
changing generative architectures. Gemini~2.0 outputs had ideal detectability
with F1-score of 1.00 and no error rates, indicating that commercial
large-scale models can maintain detectable generative signatures.

On the other hand, smaller open-source models offered unprecedented difficulty:
both Gemma~2B and Phi-3~Mini attained complete evasion with F1-scores of 0.00,
which means our trained classifiers correctly classified all the samples from
these sources. This is a complete detection failure, with FNR achieving 1.00
for both models. Mid-size models yielded intermediate performance, with LLaMA
3.2~1B attaining F1-score of 0.44 and Qwen1.5~8B attaining 0.67, suggesting
partial but unreliable detection ability.

\begin{table}[t]
    \caption{Zero-Shot Results on Modern LLM Outputs (25 human + 25 AI
    samples per model).}
    \label{tab:zero_shot}
    \centering
    \small
    \begin{tabular}{lcccccc}
        \toprule
        LLM Source    & Acc. & F1   & FPR  & FNR  & TNR  & TPR  \\
        \midrule
        Gemini~2.0    & 1.00 & 1.00 & 0.00 & 0.00 & 1.00 & 1.00 \\
        Gemma~2B      & 0.50 & 0.00 & 0.00 & 1.00 & 1.00 & 0.00 \\
        GPT-2 (Filt.) & 0.58 & 0.59 & 0.41 & 0.42 & 0.59 & 0.58 \\
        LLaMA~3.2~1B  & 0.54 & 0.44 & 0.28 & 0.64 & 0.72 & 0.36 \\
        Qwen1.5~8B    & 0.50 & 0.67 & 1.00 & 0.00 & 0.00 & 1.00 \\
        Phi-3~Mini    & 0.50 & 0.00 & 0.00 & 1.00 & 1.00 & 0.00 \\
        \bottomrule
    \end{tabular}
\end{table}

\subsubsection{Detection Success Rate by Model Architecture}

Table~\ref{tab:detection_success} groups the zero-shot results by type of model
architecture and indicates alarming trends in detection reliability. Large-scale
commercial models (Gemini~2.0, GPT-2) had an average F1-score of 0.795 with a
79.5\% rate of detection, with generally reliable detectability despite
improvements in architecture. Small open-source models totally avoided being
detected with 0\% success rate and 0.00 average F1-score.

Mid-size models showed intermediate difficulties with 55.5\% mean detection
rate, indicating that model size and optimization strategies have a significant
impact on detectability. This stratified performance trend has important
real-world deployment implications, where adversaries may deliberately select
``undetectable'' model structures to circumvent security mechanisms.

\begin{table}[t]
    \caption{Detection Success Rate by Model Architecture (averaged across
    zero-shot test set).}
    \label{tab:detection_success}
    \centering
    \small
    \begin{tabular}{lccc}
        \toprule
        Model Type  & \makecell{Comm.\\Models} &
                      \makecell{Small\\OS} &
                      \makecell{Mid\\Size} \\
        \midrule
        Avg F1      & 0.795  & 0.00 & 0.555  \\
        Detect Rate & 79.5\% & 0\%  & 55.5\% \\
        \bottomrule
    \end{tabular}
\end{table}

\subsubsection{Error Pattern Analysis}

The zero-shot testing identified characteristic error trends among various LLM
types. Compact open-source models (Gemma~2B, Phi-3~Mini) caused systematic
false negative errors, with classifiers routinely classifying AI output as
human-written. This trend indicates such models create text that possesses
human-like statistical characteristics that mislead conventional TF-IDF-based
detection techniques.

In contrast, such models as Qwen1.5~8B had very large false positive rates
(1.00), showing the classifiers falsely identified human text as AI-generated
when they were trained on this model's output. This two-way error pattern
illustrates the intricate connection between detection system weaknesses and
generative model architectures, highlighting the importance of adaptive
training techniques that are constantly drawing on outputs from novel LLM
architectures.

\subsubsection{Analysis of Detection Failures and Unexpected Results}

The zero-shot evaluation reveals several counterintuitive patterns that
illuminate fundamental limitations in current detection approaches.

\noindent\textbf{Complete Detection Failure (Gemma~2B, Phi-3~Mini).} Both
models achieve F1~=~0.00 with FNR~=~1.00, indicating that classifiers trained
on commercial LLM outputs (GPT-4o, Gemini, Claude) completely fail to detect
text from these smaller open-source models. Several mechanisms explain this
catastrophic failure:

\textit{Low-Entropy Generation:} Smaller models ($\leq$3B parameters) with
constrained parameter spaces may produce more deterministic, lower-entropy text
that closely mimics simple human writing patterns. Our training data, derived
from large commercial models with higher temperature settings and greater
diversity, captures different statistical signatures. The compact models' output
may fall within the distributional range of human text in TF-IDF feature space.

\textit{Aggressive Optimization:} Models like Phi-3~Mini undergo extensive
optimization for human preference alignment through RLHF (Reinforcement
Learning from Human Feedback) and DPO (Direct Preference Optimization). This
optimization explicitly trains models to produce text indistinguishable from
high-quality human writing in surface-level features, inadvertently improving
evasion capability.

\textit{Training Data Mismatch:} Our classifiers learn to distinguish between
specific commercial LLM generation patterns and human text. Smaller models,
trained on different data distributions and objectives, produce orthogonal
generation patterns not represented in our training distribution. The feature
space learned from GPT-4o/Claude samples does not transfer to compact model
outputs.

\textit{Feature Space Misalignment:} TF-IDF features effective for detecting
commercial LLM patterns (specific phrase frequencies, lexical diversity markers,
sentence length distributions) fail to capture the distinctive characteristics
of smaller models. These models may use constrained vocabularies or different
tokenization strategies that create entirely different n-gram distributions.

\noindent\textbf{Systematic False Positives (Qwen1.5~8B).} The FPR~=~1.00 for
Qwen1.5~8B indicates all human samples were misclassified as AI-generated. This
suggests Qwen's output distribution overlaps heavily with our human training
samples, possibly because: (1)~Qwen may be trained on similar corpora as our
human text sources (Wikipedia, government documents), (2)~its generation
strategy closely mimics formal encyclopedic style, or (3)~it lacks distinctive
model-specific artifacts present in GPT-4o/Claude/Gemini outputs.

\noindent\textbf{The Size-Detectability Paradox.} The inverse relationship
between model size and detectability (commercial models: 79.5\% detection;
small models: 0\% detection) has critical implications. Adversaries seeking to
evade detection can deliberately select small, highly-optimized models like
Gemma~2B or Phi-3~Mini. Current detection systems, trained primarily on large
commercial model outputs, exhibit complete blindness to these alternatives. This
finding necessitates either: (a)~continual training on diverse LLM outputs
including small models, or (b)~development of architecture-agnostic detection
methods based on deeper linguistic patterns rather than surface-level TF-IDF
statistics.

\noindent\textbf{Perfect Scores for Some Models but Not Others.} The contrast
between perfect scores on Hindi/Spanish
(Tables~\ref{tab:hindi_results}--\ref{tab:spanish_results}) and catastrophic
failure on small LLMs (Table~\ref{tab:zero_shot}) illustrates the brittleness
of TF-IDF-based detection. When training and test distributions align (same
commercial models, same languages), performance is excellent. When distributions
diverge (different model architectures), detection fails completely. This
brittleness underscores the need for more robust, generalizable detection
approaches.

%
% ---- Discussion and Conclusion ----
%
\section{Discussion and Conclusion}
\label{sec:discussion}

\subsection{Discussion}

We implemented a robust pipeline framework in this study to support holistic
multilingual evaluation of human versus AI-generated text categorization. Our
three-stage approach identified important findings in terms of cross-lingual
detection ability, zero-shot generalization issues, and relative performance
trends between classical and modern architectures.

\subsubsection{Cross-Language Performance Analysis}

Multilingual evaluation results contain striking trends that contradict
traditional hypotheses regarding cross-lingual AI text detection. The higher
performance of the majority categories on Hindi and Spanish datasets over
English baselines is a counterintuitive finding. XGBoost and T5 scored ideal
F1-scores (1.00) on both non-English languages, which translates to +0.09 and
+0.03 improvements respectively over their English counterparts.

This remarkable performance is thanks to linguistic properties unique to every
language. Hindi's morphological intricacies and unusual syntactic patterns can
potentially bring about greater stylistic contrasts between machine and human
writing. Likewise, Spanish's morphological redundancy and regular orthographic
spelling can yield strong statistical signals that classical TF-IDF-based
methods can readily capitalize on.

Yet, the dramatic performance decline of RoBERTa on Hindi text (F1~=~0.63
compared to 0.94 on English) reveals inherent deficits in transformer-based
detection methods. That $-0.31$ performance decline suggests pre-training
language alignment continues to be a significant bottleneck to cross-lingual
generalization, highlighting the need for multilingual model creation for
trustworthy detection systems.

\subsubsection{Interpretation of Perfect Classification Scores}

The perfect classification performance (F1~=~1.00) observed for multiple models
on Hindi and Spanish datasets warrants careful interpretation, as these results
emerge from specific experimental conditions that may not generalize to all
real-world scenarios.

\noindent\textbf{Controlled Experimental Setting.} Our datasets comprise formal,
well-structured articles (500--700 words) generated using standardized prompts
with consistent temperature settings. This controlled generation process may
create detectable regularities not present in naturally diverse AI-generated
content or adversarially modified text. The formal encyclopedic style and
topic-balanced structure reduce the natural linguistic variation found in social
media, conversational text, or domain-specific jargon.

\noindent\textbf{Language-Specific Discriminative Features.} Hindi's Devanagari
script and agglutinative morphology, combined with Spanish's regular orthography
and morphological patterns, appear to provide stronger statistical signatures
than English for TF-IDF-based detection. Our LIME analysis reveals that
character n-grams, morphological markers, and script-specific patterns serve as
highly discriminative features. Hindi's conjunct consonants and case suffixes
(case suffixes (ने, को, से)), along with Spanish's
consistent diacritic usage and gender agreement patterns, create distributional
differences that classical ML methods readily capture.

\noindent\textbf{Dataset Scale Considerations.} With 338 samples per class and
80-20 splits, our test sets contain approximately 135 samples per class. While
perfect scores on this scale demonstrate strong proof-of-concept, they require
validation on substantially larger benchmarks (e.g., 10,000+ samples) to
establish robust generalization. Recent large-scale studies like MULTITuDE
(n=74,756) and MultiSocial (n=50,000+) demonstrate that performance can degrade
with increased dataset diversity and scale.

\noindent\textbf{Generation Protocol Consistency.} All AI samples were generated
using identical prompts and temperature parameters across three commercial
models (GPT-4o, Gemini~2.0 Flash, Claude~3 Opus). This consistency, while
ensuring experimental control, may create model-specific artifacts that
facilitate detection. Production systems would encounter diverse generation
strategies, temperatures, and prompting techniques that could reduce detection
accuracy.

\noindent\textbf{Implications for Deployment.} These results establish that
multilingual AI text detection is feasible and that morphologically rich
languages may offer detection advantages. However, deployment would require:
(1)~validation on 10--100$\times$ larger datasets, (2)~inclusion of informal
and code-switched text, (3)~adversarial robustness testing, (4)~continual
updates as LLM architectures evolve, and (5)~evaluation across diverse domains
and text types. Our perfect scores represent upper-bound performance in
controlled conditions rather than expected real-world accuracy.

\subsubsection{Language-Specific Detection Patterns and RoBERTa's Differential Performance}

The dramatic performance variation across languages, particularly RoBERTa's
decline on Hindi (F1~=~0.63) versus strong Spanish performance (F1~=~0.96),
reveals fundamental interactions between model architecture, pre-training data,
and target language characteristics.

\noindent\textbf{RoBERTa's Cross-Lingual Performance Breakdown:}
\begin{itemize}
    \item English (baseline): F1~=~0.94
    \item Spanish: F1~=~0.96 (+0.02 gain, essentially maintained)
    \item Hindi: F1~=~0.63 ($-0.31$ drop, catastrophic degradation)
\end{itemize}

This pattern reflects three critical factors:

\noindent\textbf{1.~Pre-training Language Distribution.} RoBERTa's pre-training
corpus is predominantly English with substantial European language
representation. Spanish, as a high-resource Romance language with extensive web
presence (4.9\% of web content), likely has better representation in RoBERTa's
training data than Hindi (0.1\% of web content). This imbalance creates stronger
learned representations for Spanish, enabling better transfer to detection tasks.

\noindent\textbf{2.~Script Divergence and Tokenization Mismatch.} Hindi uses
Devanagari script, fundamentally different from RoBERTa's Latin-script-focused
Byte-Pair Encoding (BPE) tokenization. This creates several problems:
\textit{Over-segmentation}---Hindi words fragment into excessive subword units.
For example, the word for ``Prime Minister'' in Hindi may segment into 8--10 BPE
tokens versus 2--3 for Spanish ``presidente''. \textit{Loss of morphological
boundaries}---Devanagari's complex ligatures (conjunct consonants) and diacritics
(matras) are improperly segmented, losing grammatically meaningful boundaries.
\textit{Embedding space sparsity}---Devanagari tokens occupy sparse,
poorly-learned regions of RoBERTa's embedding space, resulting in weaker
representations. Spanish's Latin script aligns well with RoBERTa's tokenization,
preserving morphological boundaries and utilizing well-learned embedding regions.

\noindent\textbf{3.~Morphological Complexity and Learning Requirements.} Hindi's
agglutinative morphology (case marking, postpositions, complex verb
conjugations) requires more training examples for transformer models to learn
detection-relevant patterns. Spanish's relatively regular morphology (predictable
gender/number agreement, consistent conjugation paradigms) is more accessible
to transfer learning.

\noindent\textbf{Classical ML Robustness Across Scripts.} XGBoost, Random
Forests, and LSTM achieve F1~=~1.00 on both Hindi and Spanish, demonstrating
script-agnostic robustness. Character-level TF-IDF features capture
distributional patterns independent of script: character n-gram frequencies
(2--5 grams), lexical diversity metrics, punctuation and whitespace patterns,
and word length distributions. These features generalize across scripts because
they measure statistical regularities rather than semantic content.

\noindent\textbf{LIME-Revealed Language-Specific Features.} Our LIME analysis
(top-10 influential features) reveals distinct detection patterns per language:

\textit{Hindi Detection Features:} Devanagari character n-grams (conjunct
consonants), halant usage patterns; morphological markers (case suffixes),
postpositions; punctuation (Devanagari danda vs.\ period usage frequency); AI
text shows more uniform morphological complexity; human text has higher variance
in case marker distribution.

\textit{Spanish Detection Features:} Orthographic regularity (accent mark
patterns \'{a}, \'{e}, \'{i}, \'{o}, \'{u}); AI shows hyper-consistent diacritic
usage; gender/number agreement (AI exhibits perfect agreement; human text
occasionally violates in colloquialisms); verbal mood markers (subjunctive usage
differs in frequency); lexical diversity (AI over-relies on high-frequency
synonyms; human text uses wider synonym variation).

\textit{English Detection Features:} Discourse markers (AI overuses transitions
like ``furthermore,'' ``moreover,'' ``however,'' ``nevertheless''); sentence
structure (AI prefers balanced clause lengths; human text shows more structural
variety); lexical sophistication (AI consistently uses higher-register
vocabulary; human text mixes registers).

\noindent\textbf{Why TF-IDF Succeeds Cross-Lingually.} These patterns explain
why TF-IDF-based classical ML methods outperform transformers in cross-lingual
settings. TF-IDF directly captures character and word-level distributional
differences that manifest language-specifically but remain detectable through
frequency analysis. Transformer models, relying on learned embeddings optimized
for semantic tasks, struggle when target languages diverge from pre-training
distributions. The surface-level statistical signatures that distinguish AI from
human text appear to be more linguistically universal than the deep semantic
representations transformers learn.

\subsubsection{Zero-Shot Generalization Challenges}

The zero-shot test uncovers essential weaknesses with far-reaching deployment
implications. The total evasion of smaller models (Gemma~2B and Phi-3~Mini, both
having F1-scores of 0.00) is a 100\% rate of failure to detect. This
performance difference---0\% detect by small models and 79.5\% by commercial
ones---is a fundamental flaw where lightweight, tuned architectures produce text
with statistical features indistinguishable from human writing, which fully
tricks TF-IDF-based detectors.

The inverse size-detectability relationship indicates that smaller models
($\leq$3B parameters) could be more human-like in output patterns inherently
because: (1)~smaller parameter space constrains to depend on basic linguistic
patterns, (2)~aggressive optimisation removes detectable artifacts inherent in
larger models, and (3)~training practices favoring natural language generation
over raw capacity increase.

\subsubsection{Traditional vs.\ Transformer Model Performance}

The most unexpected result is the overall dominance of classical machine
learning methods over transformer-based detectors in multilingual conditions.
XGBoost performed flawlessly on all three languages, while the transformer
models displayed strong deviation and language-dependent failures. This
contradicts the hypothesis that more advanced models automatically provide
better performance for AI-based text detection tasks.

The success of TF-IDF-based feature extraction indicates that surface-level
statistical patterns could be more linguistically universal than deep semantic
representations induced by transformer models. Conventional methods seem to
encode strong detection signals that cut across linguistic boundaries and
provide computational efficiency benefits for real-world deployment.

\subsection{Conclusion}

This work shows that multilingual AI text detection is not just possible but
potentially holds surprising benefits over monolingual methods. Our exhaustive
analysis across Hindi and Spanish languages, as well as zero-shot testing on
modern LLM output, has uncovered both promising strengths and crucial weaknesses
in existing detection methods.

\subsubsection{Performance Benchmarking}

In our controlled diagnostic study, models achieved perfect classification
(F1~=~1.00) on formal Hindi and Spanish text (338 samples per class), compared
to prior English-only work obtaining F1-scores of 0.91--0.97. While these
results demonstrate strong proof-of-concept for morphologically rich languages,
they represent diagnostic performance on controlled data rather than
deployment-ready systems. Validation on larger, more diverse datasets is
essential before real-world application. Conventional classifiers outperformed
transformers across all languages consistently, with Random Forests showing an
improvement of 0.17 (17 percentage points) compared to English baselines.

Critically, our zero-shot evaluation exposes detection gaps not previously
considered: 79.5\% success rate among commercial models is starkly contrasted
with 0\% for small optimized models, creating the first exhaustive vulnerability
assessment of varied LLM architectures.

\subsubsection{Key Findings}

Our research produces several key results: (1)~Machine learning models based on
traditional methods exhibit better cross-lingual detection than
transformer-based detectors, with perfect F1-scores on several non-English
languages. (2)~Some linguistic properties may prove to aid AI text detection,
with models detecting more accurately for morphologically rich languages.
(3)~Compact, tuned LLM architectures pose unprecedented difficulties, with some
models evading detection entirely across all detection methods employed.

Optimal F1-scores for Hindi and Spanish datasets, in addition to strong
performance of interpretable models such as XGBoost, present a sound basis for
multilingual detection system deployment. Nevertheless, the total evasion by
smaller open-source models reveals essential gaps demanding adaptive training
practices and ongoing model upgrading protocols.

\subsubsection{Limitations}

\noindent\textbf{Dataset Scale and Controlled Conditions.} Our study employs
relatively modest datasets (338 samples per class per language) in controlled
experimental settings with formal, topic-balanced text. Perfect classification
scores observed for several models reflect strong discriminative signals under
these conditions but may not generalize to larger-scale, diverse real-world
scenarios. The controlled generation process (standardized prompts, consistent
temperature) may create detectable artifacts not present in naturally occurring
AI text. Future work requires validation on 10--100$\times$ larger datasets
spanning informal text, code-switched content, social media, and adversarially
modified samples before claims of deployment readiness can be made.

Our evaluation was only carried out on Hindi and Spanish, and its extension to
more typologically diverse languages would enhance claims to generalizability.
The dataset sizes were relatively modest (338 samples per language), and
larger-scale evaluations would provide more robust statistical validation. Our
evaluation focused primarily on formal, well-structured text; performance on
informal social media content or domain-specific jargon remains unexplored. The
rapid evolution of LLM architectures means our zero-shot evaluation may quickly
become outdated as new models emerge.

\subsubsection{Future Research Directions}

Subsequent work must target the creation of adaptive training systems that
constantly integrate outputs from newer LLM architectures. The investigation of
hybrid detection methods that merge classical statistical resilience with
transformer semantic comprehension presents promise for attaining both
interpretability and performance. Ternary system construction that can determine
unique model origins would offer increased detection granularity. Cross-domain
robustness testing and proactive defense mechanism development are other
research priorities for the development of reliable, transparent, and globally
deployable AI text detection systems.

%
% ---- Bibliography ----
%
\bibliographystyle{splncs04}
\bibliography{references}

\end{document}