\documentclass[runningheads]{llncs}

% Basic packages commonly allowed/used with LNCS
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{multirow}
\usepackage{url}
\usepackage{xcolor}
\usepackage{subcaption}
\usepackage{array}
\usepackage{tabularx}
\usepackage{hyperref}

\usepackage{tabularx}
\usepackage{float}
\usepackage{hyperref}
\usepackage{cleveref}

\usepackage{natbib}
\newcommand{\codewrap}[1]{\parbox[t]{\linewidth}{\ttfamily\small #1}}

\begin{document}

\title{Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment}

\author{Vanya Bannihatti Kumar\inst{1} \and Divyanshu Goyal\inst{2} \and Akhil Eppa\inst{3} \and Neel Bhandari\inst{4}}

\authorrunning{V. Bannihatti Kumar et al.}

\institute{Adobe} 
% \email{author1@example.com}
% \and
% Institution Two, City, Country \\
% \email{author2@example.com}}

\maketitle

\begin{abstract}
%New Abstract:

% Large language models (LLMs) demonstrate strong performance on evaluating objective tasks such as mathematical reasoning and factual verification, but they struggle with the inherently subjective challenge of evaluating creativity. A key difficulty lies in the fact that individual preferences shape creativity and rarely align uniformly across people.

% In this work, we introduce a curiosity-driven LLM-as-a-judge framework for assessing creative writing. Our approach incorporates annotator-specific preferences, enabling the model to adapt to each individual’s creative judgments. To evaluate our method, we leverage the Torrance Test of Creative Thinking (TTCW) benchmark introduced in \cite{chakrabarty2024artartificelargelanguage}, which contains expert-annotated stories evaluated across subjective dimensions such as fluency, originality etc. We show that our approach allows models of varying sizes to better capture the subtleties of individual creative assessments, achieving consistent improvements over baseline supervised finetuning (SFT) across multiple metrics, including Pearson correlation, Cohen’s $\kappa$, and F1 scores. Our method is particularly well-suited for subjective evaluation settings where annotator distributions are heavy-tailed.

Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in \cite{chakrabarty2024artartificelargelanguage}, which has stories annotated by expert humans across various subjective dimensions like \emph{Originality}, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's $\kappa$ and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

%offering a more interpretable and human-aligned pathway for assessing creative expression. 


% Old Abstract
% \\
% Modern large language models (LLMs) have shown strong performance in evaluating objective tasks such as mathematical reasoning and factual accuracy. However, assessing subjective dimensions like creativity remains a significant challenge. In this work, we address this gap by proposing scalable reward models tailored for subjective evaluation, with a focus on creative writing. Using the benchmark introduced in \cite{chakrabarty2024artartificelargelanguage}, which adapts the Torrance Test of Creative Thinking to writing tasks, we introduce a novel method inspired by curiosity-driven reinforcement learning \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}, enabling models to evaluate creativity by measuring intrinsic reward using Bayesian Surprise. Our approach outperforms baseline supervised methods, highlighting the promise of curiosity-based reward modeling in capturing and improving creative expression.
\end{abstract}

\keywords{Curiosity-driven learning \and Creativity Evaluation \and Personalisation }


\section{Introduction}

Rigorous, standardized evaluation has repeatedly catalyzed progress in machine learning, from  ImageNet\cite{russakovsky2015imagenetlargescalevisual} and GLUE\cite{wang2019gluemultitaskbenchmarkanalysis}, driving leaps in the fields of computer vision and Natural Language Processing, respectively. The same effect is evident in objective math reasoning, where benchmarks like GSM8K\cite{cobbe2021gsm8k}, together with RL-trained reasoning models such as OpenAI’s o1\cite{openai2024openaio1card} and DeepSeek-R1\cite{deepseekai2025deepseekr1incentivizingreasoningcapability} have obtained strong results on hard contests like AIME and IMO. 

While robust evaluation metrics exist for objective tasks such as mathematical reasoning and factual verification, subjective tasks like creativity remain difficult to assess reliably. There are many previous works \cite{panickssery2024llm, wataoka2025selfpreferencebiasllmasajudge} which show that using Large Language Models(LLM) as a judge prefer their own generations making them unreliable.
Despite the success of LLMs on objective benchmarks, they still struggle to evaluate creativity in a manner aligned with human judgment. As shown in \cite{chakrabarty2024artartificelargelanguage} and Table \ref{tab:llm-vs-expert-kappa} and Table \ref{tab:gpt5}, even state-of-the-art models fall short in consistently evaluating the subjective dimensions of the story as well as a human expert. This can be attributed to the fact that individual preferences shape creativity and rarely align uniformly across people.

% To address this gap, we explore scalable methods to better personalise the . Reliable evaluation signals can improve generation quality in domains such as marketing, storytelling, and design, where originality and novelty are essential. Our work takes a step toward building human-aligned, creativity-aware language models.
% To close this gap, we introduce an enhanced LLM-as-a-judge that adapts its scoring to individual annotators, enabling more faithful, preference-aware creativity assessment. To this end, we propose a curiosity-driven LLM-as-a-judge for evaluating creativity in text generation, deriving inspiration from a curiosity-based Reinforcement Learning framework \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}. Our intuition is that if the model is aware of how surprised it is given the explanation by the annotator and adapts to that surprise using a self-supervised signal, it can generalize to the annotator's preference on unseen datasets. To this end, we first train an Intrinsic Curiosity Model(ICM) that computes the surprise of the model to the explanation of the annotator while guessing which of the annotators might have given that explanation. We then use the curiosity score from this model as an additional signal to improve the supervised finetuning baseline. The supervised finetuning baseline is a simple sequence to sequence task which predicts the judgment of the annotator given the story and the question on evaluating creativity. We have explained more details of the method in section \ref{sec:method}.

To address this gap, we present an enhanced LLM-as-a-judge that not only learns from a diverse pool of annotations but also adapts its scoring to align with individual annotators or experts. This allows for more faithful and preference-aware evaluation of creativity. We emphasize personalization in our framework because the task of assessing subjective criteria is inherently variable across individuals. To this end, we propose a curiosity-driven LLM-as-a-judge for evaluating creativity in text generation, drawing inspiration from the curiosity-based Reinforcement Learning (RL) framework of \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}. However, unlike the RL setting in \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}, we reinterpret curiosity as an \emph{belief-shift signal} for creative evaluation. Specifically, when the model is “surprised” by an expert’s explanation, it signals a mismatch between the LLM’s prior belief and the expert’s preference; conversely, low surprise indicates alignment between the LLM and the expert (see Fig \ref{fig:curiosity_score_against_base_model_pred}. 
To implement this, we first train an Intrinsic Curiosity Model (ICM) that measures the LLM’s surprise at a given explanation while simultaneously predicting which expert or annotator produced the explanation. The intuition behind predicting the annotator is that the model can learn which annotator caused the belief shift, allowing it to calibrate the curiosity signal for each annotator individually, thereby improving personalization. The resulting \emph{curiosity score} is then fed as an auxiliary, self-supervised signal to improve a supervised fine-tuning (SFT) model (see Fig \ref{fig:overview_diagram_train}).

In our experiments, we establish a baseline using an SFT model that predicts annotators’ binary judgments from the story and question (see Fig \ref{fig:baseline_a}). To evaluate the effect of curiosity, we enhance this baseline with an ICM-derived curiosity score. More concretely we append the curiosity score to story and question in the baseline model.  This helps us do a fair comparison on effect of curiosity signal on the final judgment and thereby measure the lift in performance our methodology provides over the baseline.


% The SFT model predicts an annotator’s binary judgment from the story and the question (a standard sequence-to-sequence or classification head). Our curiosity-driven judge augments this baseline with the ICM-derived score, yielding a preference-aware predictor that adapts across annotators and transfers to unseen rater mixes. Method details are in Section~\ref{sec:method}.

We conduct extensive experiments across various model sizes to ensure our method scales well with model size. Since the TTCW dataset size is extremely small, we do a 5-fold cross validation in order to ensure that our results are statistically significant. We also test our method in out-of-distribution scenarios to ensure that our method generalizes well. Averaged across model sizes, ICM significantly improves Pearson correlation and F1 scores.  More details about the results can be found in Fig \ref{fig:id-ood-pearson-f1}.
% Across four Qwen sizes(0.5B, 1.5B, 3B and 7B), ICM beats SFT by large margins: on in-distribution data, the mean per-model relative gains are +277.4\% Pearson, +172.0\% $\$\$\kappa$$$, +66.9\% F1; on out-of-distribution, these expand to +1274.5\% Pearson, +980.2\% $\$\$\kappa$$$, +104.9\% F1.
% For generalization, the per-model ID to OOD shifts show SFT drops -24.9\%, -31.9\% and -2.6\% (Pearson, $\$\$\kappa$$$, and F1), while ICM improves +7.4\%, +18.1\% and +12.9\%.

%From our experiments \ref{tab:experiments} we see that the Pearson correlation metrics significantly improve over the baseline, from 208\% for smaller models like Qwen-2.5-0.5B to 278\% for larger models like Qwen-2.5-7B.

% Our main contributions is as follows:
% \begin{itemize}
%     \item Novel curiosity-driven method for evaluating subjective tasks
%     % \item 
% \end{itemize}

% by modeling how language models respond to expert explanations. Using the TTCW dataset, we focus on five creativity dimensions and define a two-part learning signal: a forward score measuring belief shift via mean squared error  between model predictions with and without an explanation, and a backward score via cross-entropy loss for predicting the authoring expert. These components form a curiosity signal, which is then used as an additional conditioning input in a supervised fine-tuning (SFT) setup. Specifically, the creativity score is appended to the question and story using a special token, and the model is trained to predict binary creativity verdicts. We compare this to a baseline SFT model that receives the question, story, and expert identity as input, and predicts both the verdict and explanation. This setup enables us to evaluate whether explicitly modeling curiosity improves creativity judgment alignment with human annotators.
% The ability to evaluate is a critical driver of progress in machine learning, enabling the development and refinement of models through consistent feedback and benchmarking. a wide range of robust evaluation metrics has been established for objective tasks—such as mathematics, factual reasoning, and code generation. Allowing models to be measured precisely and improved systematically. However, designing similarly reliable evaluation methods for subjective tasks, such as creativity, remains a significant challenge. Subjective outputs lack clear ground truth and often require nuanced human judgment, making it difficult to develop standardized metrics \cite{chakrabarty2024artartificelargelanguage}. This work aims to address that gap by exploring scalable and effective evaluation approaches for creative generation.



% Despite the remarkable capabilities of modern large language models (LLMs) across a range of objective tasks, they continue to fall short of human-level performance when it comes to evaluating subjective qualities such as creativity. As shown in \cite{chakrabarty2024artartificelargelanguage}, even state-of-the-art models struggle to consistently assess whether a piece of text is creative, often failing to align with human judgments. This highlights a fundamental limitation in current LLM evaluation mechanisms and underscores the need for more nuanced, scalable approaches tailored to subjective evaluation.



% There is a pressing need for reliable evaluation methods for creativity in language models. Robust evaluation frameworks can guide the development of models that perform better in creative domains such as marketing, storytelling, content generation, and design. Without such metrics, current LLMs often produce outputs that lack originality, coherence, or novelty—manifesting as generic, uninspired, or "sloppy" generations. Addressing this gap is essential not only for improving model outputs in subjective tasks but also for fostering more human-aligned and high-quality creative generation at scale.




% We propose a curiosity-driven framework for evaluating creativity in text generation by modeling how language models respond to expert-provided explanations. Our method leverages the TTCW dataset, focusing on 5 out of its 14 annotated creativity dimensions:

% \begin{enumerate}
%     \item \textit{Originality in Theme}
%     \item \textit{Originality in Form}
%     \item \textit{Cliché Avoidance}
%     \item \textit{Surprising \& Appropriate Turns}
%     \item \textit{Perspective Flexibility}
% \end{enumerate}

% Each example in the dataset consists of a story $S$, a creativity-focused question $Q_d$ for a specific dimension $d$, a set of three expert explanations $\mathcal{E} = \{e_1, e_2, e_3\}$, and corresponding binary verdicts $V_i \in \{\texttt{yes}, \texttt{no}\}$ for each explanation.

% Our model processes this input in two stages to compute a composite curiosity signal:

% \begin{itemize}
%     \item \textbf{Forward Loss:} We define two states—State A: $(S, Q_d)$ (story and question), and State B: $(S, Q_d, e_i)$ (story, question, and expert explanation $e_i$). The model is trained to predict the MSE loss between these states, effectively learning to estimate how much its prediction changes due to the inclusion of an explanation:

%     \[
%     \mathcal{L}_{\text{forward}} = \left(f_\theta(S, Q_d, e_i) - f_\theta(S, Q_d)\right)^2
%     \]

%     \textit{This loss captures the model's belief shift regarding creativity when it is exposed to human explanation.}

%     \item \textbf{Backward Loss:} In this stage, the model attempts to predict which expert authored a given explanation $e_i$. Given $(S, Q_d, e_i)$, a classifier $g_\phi$ outputs a distribution over the three possible experts:

%     \[
%     p_\phi(z_i \mid S, Q_d, e_i) = \text{softmax}(g_\phi(S, Q_d, e_i))
%     \]

%     The backward loss is computed using cross-entropy with respect to the true expert label $z_i$:

%     \[
%     \mathcal{L}_{\text{backward}} = -\log p_\phi(z_i \mid S, Q_d, e_i)
%     \]

%     \textit{This encourages the model to internalize the distinct evaluative styles of individual experts.}
% \end{itemize}

% The final \textbf{curiosity score} is defined as a weighted sum of the forward and backward components:

% \[
% \mathcal{L}_{\text{curiosity}} = \mathcal{L}_{\text{forward}} + \lambda \cdot \mathcal{L}_{\text{backward}}
% \]

% where $\lambda$ is a tunable hyperparameter controlling the influence of expert attribution. This combined loss acts as a scalable and interpretable proxy for subjective creativity evaluation, grounded in model belief dynamics and expert reasoning recognition.



% \begin{figure}[h]
%     \centering
%     \includegraphics[width=0.8\textwidth]{AuthorKit26/AnonymousSubmission/LaTeX/train.png}
%     \caption{Overview of Architecture during training for Curiosity Driven LLM-as-a-judge}
%     \label{fig:overview_diagram_train}
% \end{figure}

\begin{figure}[t]  % top-of-column is most reliable for AAAI floats
  \centering
  \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/train.png}
  \caption{Overview of Architecture during training for Curiosity-Driven LLM-as-a-judge}
  \label{fig:overview_diagram_train}
\end{figure}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/test.png}
    \caption{Overview of Architecture during inference for Curiosity Driven LLM-as-a-judge}
    \label{fig:overview_diagram_test}
\end{figure}

% \begin{figure}[t]
%   \centering
%   \begin{minipage}[t]{0.48\textwidth}
%     \centering
%     \includegraphics[width=\linewidth]{train.jpg}
%     \caption*{(a) Training}
%     \label{fig:overview_diagram_train}
%   \end{minipage}\hfill
%   \begin{minipage}[t]{0.48\textwidth}
%     \centering
%     \includegraphics[width=\linewidth]{test.jpg}
%     \caption*{(b) Inference}
%     \label{fig:overview_diagram_test}
%   \end{minipage}
%   \caption{Overview of the curiosity-driven framework during (a) training and (b) inference.}
%   \label{fig:overview_diagram_sidebyside}
% \end{figure}



% \begin{figure}[h]
%     \centering
%     \includegraphics[width=0.8\textwidth]{iclr2026/baseline_classification.jpg}
%     \caption{Baseline without explanations}
%     \label{fig:baseline_classification}
% \end{figure}



% \begin{figure}[h]
%     \centering
%     \includegraphics[width=0.8\textwidth]{iclr2026/baseline_language_modeling.jpg}
%     \caption{Baseline with explanations}
%     \label{fig:baseline_language_modeling}
% \end{figure}

% \begin{figure}[t]
%   \centering
%   \begin{minipage}[t]{0.48\textwidth}
%     \centering
%     \includegraphics[width=\linewidth]{iclr2026/baseline_classification.jpg}
%     \caption*{(a) Baseline without explanations}
%     \label{fig:baseline_classification}
%   \end{minipage}\hfill
%   \begin{minipage}[t]{0.48\textwidth}
%     \centering
%     \includegraphics[width=\linewidth]{iclr2026/baseline_language_modeling.jpg}
%     \caption*{(b) Baseline with explanations}
%     \label{fig:baseline_language_modeling}
%   \end{minipage}
%   \caption{Comparison of baselines with and without explanations.}
%   \label{fig:baseline_comparison}
% \end{figure}

% \begin{figure}[t]
%   \centering
%   \begin{subfigure}
%     \centering
%     \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/baseline_classification.png}
%     \subcaption{Baseline without using explanations}\label{fig:baseline_a}
%   \end{subfigure}\hfill
%   \begin{subfigure}
%     \centering
%     \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/baseline_language_modeling.png}
%     \subcaption{Baseline using explanations}\label{fig:baseline_b}
%   \end{subfigure}
%   \caption{Comparison of baselines with and without using explanations.}
%   \label{fig:baseline_comparison}
% \end{figure}

% Preamble (you already have these; keep the order)
% \usepackage{caption}     % already in the AAAI template
% \usepackage{subcaption}  % you added this; OK to keep

% In the body:
\begin{figure}[t]
  \centering
  \begin{subfigure}[t]{0.48\linewidth}
    \centering
    \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/baseline_classification.png}
    \subcaption{Baseline without using explanations}
    \label{fig:baseline_a}
  \end{subfigure}\hfill
  \begin{subfigure}[t]{0.48\linewidth}
    \centering
    \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/baseline_language_modeling.png}
    \subcaption{Baseline using explanations}
    \label{fig:baseline_b}
  \end{subfigure}
  \caption{Comparison of baselines with and without using explanations.}
  \label{fig:baseline_comparison}
\end{figure}








\section{Methodology}
\label{sec:method}

In this section, we describe our curiosity-driven LLM-as-a-judge for evaluating creativity in text generation, which combines belief shift estimation with expert attribution. Our method leverages the TTCW dataset \cite{chakrabarty2024artartificelargelanguage}, which is based on the Torrance Test of Creative Thinking \cite{torrance1966ttct} but adapted for LLMs. We focus on a subset of five creativity dimensions particularly relevant for evaluating the creative judgments of generative language models. We detail the dataset structure, model architecture, loss functions, and the formulation of our curiosity signal.

\subsection{Dataset}
\label{sec:creativity dimension}
The TTCW dataset\footnote{\href{https://huggingface.co/datasets/Salesforce/ttcw_creativity_eval}{Huggingface TTCW dataset}} provides expert human-annotated creativity judgments across 14 distinct dimensions. All the distinct dimensions in the TTCW dataset are mentioned in Appendix \ref{subsec:dataset}. For this study, we focus on five dimensions, 3 of which are categorised under Originality and 2 under flexibility: \textit{Originality in Thought}, \textit{Originality in Form}, \textit{Originality in Theme and Content}, \textit{Structural Flexibility}, and \textit{Perspective and Voice Flexibility}. Our analysis is restricted to these five dimensions, encompassing all dimensions under \textit{Originality} and two representative dimensions from \textit{Flexibility}. We picked these 5 dimensions among the 14(Table \ref{tab:cre-dim}) as these are more subjective in nature and hence the most ideal to evaluate our methodology. We defer exploration of the remaining dimensions to future work. Questions associated with each dimension can be found in appendix \ref{tab:creativity_eval}. 

% \begin{table}[ht]
% \centering
% \caption{Dimensions of TTCW dataset}
% \label{tab:cre-dim}
% \small
% \setlength{\tabcolsep}{3.5pt}
% \renewcommand{\arraystretch}{1.2}
% \rowcolors{2}{lightgray}{white}
% \begin{tabular}{|l|l|}
% \hline
% \textbf{Dimensions} & \textbf{Facets} \\ \hline
% \multirow{\textbf{Fluency}} & Understandability \& Coherence  \\ \cline{2-5} 
% & Narrative Pacing \\ \cline{2-5} 
% & Scene vs Exposition\\ \cline{2-5} 
% & Literary Devices \& Language Proficiency \\ \cline{2-5} 
% & Narrative Ending  \\ \hline
% \multirow{\textbf{Flexibility}} & Emotional Flexibility \\ \cline{2-5} 
% & Perspective \& Voice Flexibility \\ \cline{2-5} 
% & Structural Flexibility  \\ \hline
% \multirow{\textbf{Originality}} & Originality in Form  \\ \cline{2-5} 
% & Originality in Thought \\ \cline{2-5} 
% & Originality in Theme \& Content  \\ \hline
% \multirow{\textbf{Elaboration}} & World Building \& Setting \\ \cline{2-5} 
% & Character Development  \\ \cline{2-5} 
% & Rhetorical Complexity  \\ \hline
% \end{tabular}
% \end{table}

\subsection{Data Format and Task Setup}

% Each example in the dataset consists of:

% \begin{itemize}
%     \item A story $S$
%     \item A creativity-focused question $Q_d$ specific to dimension $d$ \\
%     \item Expert id for each of the annotator, $z_i$
%     \item Three expert-provided explanations $\mathcal{E} = \{e_1, e_2, e_3\}$
%     \item Corresponding binary verdicts $V_i \in \{\texttt{yes}, \texttt{no}\}$ for each explanation
% \end{itemize}

Each example in the dataset consists of a story $S$, a creativity-focused question $Q_d$ specific to dimension $d$, an expert ID $z_i$ where $i \in \{1, 2, 3\}$ for each annotation by an expert, three expert-provided explanations $\mathcal{E} = \{e_1, e_2, e_3\}$, and corresponding binary verdicts $V_i \in \{\texttt{yes}, \texttt{no}\}$ for each explanation.

% The task involves training a model to evaluate whether a given story satisfies a specific creative criterion (based on the dimension), and to incorporate explanation-based guidance in doing so. Additionally, the model must learn to distinguish between different experts based on their explanations.

The task is to improve the model's performance on producing judgments similar to that of a particular expert when the model is presented with the story and the creative question
\subsection{Intrinsic Curiosity Model Overview}

Our model operates in two stages:

\begin{enumerate}
    \item \textbf{Belief Shift Estimation (Forward Score)}: The model measures the impact of an expert explanation on their prediction of creativity.
    \item \textbf{Expert Attribution (Backward Score)}: The model identifies which expert wrote a given explanation.
\end{enumerate}

%Together, these stages quantify the model's curiosity as a function of belief update and expert interpretability.

\subsubsection{Forward Score: Belief Shift via Cosine Loss}

We define two states:

\begin{itemize}
    \item \textbf{State A}: Input consisting of the story and question and one-hot vector of the expert ID $z_i$ represented as $(S, Q_d, onehot(z_i))$ where $i \in \{1, 2, 3\}$ as each story-question pair is annotated by 3 experts.
    \item \textbf{State B}: Input augmented with one expert explanation $(S, Q_d, e_i)$ where $i \in \{1, 2, 3\}$.
\end{itemize}

% The model produces creativity scores in both states:

% \[
% f_\theta^{(A)} = f_\theta(S, Q_d) \quad\quad f_\theta^{(B)} = f_\theta(S, Q_d, e_i)
% \]
Let $f_\theta^{(A)} = f_\theta(S, Q_d, onehot(z_i))$ and $f_\theta^{(B)} = f_\theta(S, Q_d, e_i)$, where $f_{\theta}$ denote the judge’s scoring function (logit head) with parameters $\theta$ that maps the input to a scalar judgment logit.

The forward loss is defined as the cosine loss between these two predictions:

% \[
% \mathcal{L}_{\text{forward}} = \left(f_\theta^{(B)} - f_\theta^{(A)}\right)^2
% \]
\[
\mathcal{L}_{\text{forward}} = 1 - \frac{f_\theta^{(A)} \cdot f_\theta^{(B)}}{\|f_\theta^{(A)}\| \|f_\theta^{(B)}\|}
\]

This loss captures how much the model’s belief about creativity of the story shifts when it incorporates the explanation by the annotator, which we define as the intrinsic curiosity measure.

\subsubsection{Backward Score: Expert Attribution via Cross-Entropy}

To help the model to understand the distinct reasoning styles of different experts, we introduce an auxiliary classification task. Given $(S, Q_d, e_i)$, the model predicts the identity of the expert $z_i \in \{1, 2, 3\}$ who authored explanation $e_i$:

\[
p_\phi(z_i \mid S, Q_d, e_i) = \text{softmax}(g_\phi(S, Q_d, e_i))
\]

The backward loss is the cross-entropy between the predicted and true expert label:

\[
\mathcal{L}_{\text{backward}} = -\log p_\phi(z_i \mid S, Q_d, e_i)
\]

%This encourages the model to encode expert-specific reasoning, providing a measure of its capacity to
%distinguish interpretive nuance.
\subsubsection{Loss function of Intrinsic curiosity model(ICM)}

We define the ICM model's loss as a weighted combination of the forward and backward components:

\[
\mathcal{L}_{\text{curiosity}} = \mathcal{L}_{\text{forward}} + \lambda \cdot \mathcal{L}_{\text{backward}}
\]

where $\lambda$ is a tunable hyperparameter that balances the two objectives. In our experiments we set $\lambda$ as 1.

\subsubsection{Incorporating the Curiosity Signal to SFT}

To evaluate the utility of the learned curiosity signal, we use it as a conditioning input to a supervised fine-tuning (SFT) model trained to predict expert verdicts. For each instance, we append the scalar curiosity score to the original input using a special delimiter token \texttt{<CREAT>}, resulting in the following input format:

% \[
% \texttt{Input:} \quad Q_d + S + \texttt{<CREAT>} + {Curiosity}_{\text{Score}} \quad \longrightarrow \quad \texttt{Target: } V_i
% \]

\begin{align}
\text{Input:} \quad & Q_d + S + \texttt{<CREAT>} \nonumber \\
& + {Curiosity}_{\text{Score}} 
\quad \longrightarrow \quad \text{Target: } V_i
\end{align}


% \[
% \text{Curiosity}_{\text{score}}
% \;=\;
% f_{\theta}\!\big(S,\,Q_d,\,e_i\big)
% \;-\;
% f_{\theta}\!\big(S,\,Q_d,\,\mathrm{onehot}(\text{expert\_idx})\big)
% \]
\begin{align}
\text{Curiosity}_{\text{score}}
&= f_{\theta}\big(S,\,Q_d,\,e_i\big) \nonumber\\
&\quad - f_{\theta}\big(S,\,Q_d,\,\mathrm{onehot}(\text{expert\_idx})\big)
\end{align}

$V_i \in \{\texttt{yes}, \texttt{no}\}$ is the binary verdict associated with explanation $e_i$. The model uses the ${Curiosity}_{\text{Score}}$ as a signal to predict the verdict of the given annotator. We use cross-entropy loss for training this classifier model

% This setup allows the model to condition its verdict prediction not just on the narrative content, but also on an interpretable creativity signal representing model belief shift and expert alignment.

\subsection{Inference}
During inference(see Fig \ref{fig:overview_diagram_test}), the story and creativity-focused questions are first passed through the intrinsic curiosity model (ICM) to compute a curiosity score. This score reflects the model's internal belief shift in response to the input for that particular annotator. The resulting curiosity score is then appended to the original input, using a special delimiter token \texttt{<CREAT>}—and passed to the SFT classifier model. This classifier then predicts the binary creativity verdict (\texttt{yes} or \texttt{no}) for the given story-question pair.
%It should be noted that our model works for the annotators that are in the training set, the model might not work well for an unseen annotators.
% The full inference pipeline is illustrated in Figure~\ref{fig:inference-pipeline}.
.

% \subsection{Baseline: Supervised Fine-Tuning without Curiosity}

% As a comparison point, we employ a standard supervised fine-tuning (SFT) baseline that conditions directly on the provided explanation but omits any explicit curiosity signal. The model input is structured as:

% \[
% \texttt{Input:} \quad Q_d + S + z_i \quad \longrightarrow \quad \texttt{Target: } \{label: V_i, 
%  explanation: e_i\}
% \]

% In this setup, the model is trained to predict the expert verdict solely based on the question, plot, and one explanation $e_i$. Here, $S$ is the full story, $z_i$ is the identity of the annotator. The Supervised Fine-Tuned model is trained with the target text represented in a JSON format containing the verdict $V_i$ and explanation $e_i$.

% During inference, we pass the plot, question, and the annotator index to get the explanation and label in JSON format, which is then parsed to get the binary verdict to compare against the ground truth labels. This comparison enables us to assess the extent to which the curiosity signal improves verdict prediction, especially in capturing the implicit reasoning embedded in explanations.

% Another point to note is that, during the training of the Supervised Fine-Tuned model, the LoRA $\alpha$ / Rank had to be increased to 256/256 to allow the model to learn and produce valid JSON structure.

\subsection{Baseline with explanations}

For the baseline comparison , we use a standard SFT model that produces the explanation and binary verdict given the input(see \cref{fig:baseline_b}). The model input is structured as:

\[
\texttt{Input:} \quad Q_d + S + z_i \quad \longrightarrow \quad \texttt{Target:} \{ V_i, e_i \}
\]

% \begin{equation*}
% \texttt{Input:}\; Q_d {+} S {+} z_i \;\longrightarrow\; \texttt{Target:}\;\{ V_i,\, e_i \}
% \end{equation*}
% Here, \(Q_d\) is the evaluation question, \(S\) is the full story, \(z_i\) is the annotator identity, and \(e_i\) is the annotator’s explanation associated with the binary verdict \(V_i\).


At inference time, we provide \(Q_d\), \(S\), and \(z_i\) as input, and the model outputs a JSON structure, from which the predicted verdict is parsed and compared to the ground truth. This baseline is trained using language modeling loss. 

% Notably, we found that increasing the LoRA configuration to \(\alpha\) / Rank = 256 / 256 was necessary for the SFT baseline to reliably produce syntactically valid JSON outputs.

\subsection{Baseline without explanations}
\label{sec:baseline classifcation}
We ensure to compare our method against the baseline SFT in a classification setting rather than a causal language model setting to ensure fairness in comparison(see \cref{fig:baseline_a}). 
Since we set up the baseline SFT in a classification setting, we do not include the explanations as neither part of the input or the output of the classification task. In this classification setting we use the question and the story as part of input and the verdict as part of the output. 

\[
\texttt{Input:} \quad Q_d + S + z_i \quad \longrightarrow \quad \texttt{Target:} \{ V_i\}
\]




\subsection{Evaluation}
\label{sec:eval}
Evaluating subjective tasks like creativity presents unique challenges, as even human annotators often disagree on what constitutes a "correct" judgment. Rather than attempting to define a universal metric for creativity, our approach embraces this subjectivity by focusing on personalization. We aim to adapt evaluation signals to individual experts by learning from a small number of their labeled examples. This allows us to model subjective preferences more faithfully and use this personalized model to assess creativity in a user-aligned manner. To quantify model performance in capturing individual judgments, we report \textbf{Pearson Correlation} \cite{benesty2009pearson} and \textbf{Cohen's $\kappa$} \cite{cohen1960coefficient}, along with \textbf{Precision}, \textbf{Recall}, and \textbf{F1-score}. These metrics enable us to assess both the predictive accuracy and ranking consistency of our models in aligning with subjective human evaluations.

% \section{Theory}
% \paragraph{Why curiosity beats using explanation text directly.}
% Let $E$ denote the annotator's explanation, and let $s_{\text{base}}(x)$ and $s_{\text{expl}}(x,a)$ be pre/post-explanation logits. We use the \emph{curiosity score} $c(x,a)=s_{\text{expl}}(x,a)-s_{\text{base}}(x)$ and \emph{discard} $E$. This has three advantages grounded in standard theory:

% \emph{(1)}
% In logit/Bayesian updates, explanations act additively on log-odds via a log-likelihood ratio (``weight of evidence'') \citep{agresti2013}:
% \[
% \log \frac{\Pr(V=1\mid x,E)}{\Pr(V=0\mid x,E)}
% =\log \frac{\Pr(V=1\mid x)}{\Pr(V=0\mid x)}
% +\underbrace{\log\frac{p(E\mid V=1,x)}{p(E\mid V=0,x)}}_{\text{weight of evidence}}.
% \]
% Our $c(x,a)=s_{\text{expl}}-s_{\text{base}}$ is an empirical estimate of this increment. Thus $c$ captures (near-)sufficient information from $E$ for decision-making on the log-odds scale without carrying lexical nuisance.

% \emph{(2)}Since $c$ correlates with per-example loss/gradients, it functions as a \emph{control variate}, reducing gradient and risk variance by a factor $(1-\rho^2)$ at the optimal control weight \citep{owen2013monte}.

% \[
% \rho \;\triangleq\; \mathrm{Corr}(Z,C)
% \;=\;
% \frac{\mathrm{Cov}(Z,C)}{\sqrt{\mathrm{Var}(Z)\,\mathrm{Var}(C)}} \in [-1,1].
% \]

% where Z is per-example cross entropy loss and C is the curiosity score


% \emph{(3)}Text features in $E$ entangle content with annotator idiosyncrasies; $c$ focuses on the \emph{effect} of $E$ (how belief moves), which transfers across raters and reduces pooling bias from heterogeneous annotator effects (a known issue in multi-rater settings) \citep{Dawid1979MaximumLE}.

% % \emph{Net effect.}
% % Conditioning on $c$ approximates the theoretically right sufficient update (Good’s weight of evidence), achieves lower variance than text concatenation (Hastie–Tibshirani–Friedman; control variates), and is robust to heterogeneity (Huber/Catoni; Dawid–Skene). Empirically, this yields accuracy at least on par with text-as-input baselines and better stability under rater mix shifts.



\section{Theory: Why Curiosity Beats Using Explanation Text Directly}
\label{sec:theory-curiosity}

Let $e$ denote the expert's explanation, x = $Q_d + S$, $s_{\text{base}}(x) = f_{\theta}\!\big(S,\,Q_d,\,\mathrm{onehot}(\text{$z_i$})\big)$ the pre-explanation logit, and
$s_{\text{expl}}(x,e_i)=f_{\theta}\!\big(S,\,Q_d,\,e_i\big)$ the post-explanation logit produced by the model when conditioned on $e$.
The $\text{Curiosity}_{\text{Score}}$  is defined as the belief shift.
\[
\text{Curiosity}_{\text{score}}\;=\;f_{\theta}\!\big(S,\,Q_d,\,e_i\big)-f_{\theta}\!\big(S,\,Q_d,\,\mathrm{onehot}(\text{$z_i$})\big),
\]
and \emph{discard} $e$ thereafter. We train a predictor $\hat p_\theta(V{=}1\mid x,\text{Curiosity}_{\text{score}})=\sigma\!\big(h_\theta(x,\text{Curiosity}_{\text{score}})\big)$ where $V$ is the verdict, $h$ is the LLM judge model and $\sigma$ represents softmax.
This yields three advantages grounded in standard theory.

\paragraph{(1) Weight-of-evidence sufficiency.}
In logit/Bayesian updates, additional information acts \emph{additively} on log-odds via a log-likelihood ratio
(\emph{weight of evidence}) \citep{agresti2013}:
% \[
% \log \frac{\Pr(V=1\mid x,e_i)}{\Pr(V=0\mid x,e_i)}
% \,=\,
% \log \frac{\Pr(V=1\mid x)}{\Pr(V=0\mid x)}
% \,+\,
% \underbrace{\log\frac{p(e\mid V=1,x)}{p(e\mid V=0,x)}}_{\text{weight of evidence}}.
% \]
\begin{align}
\log \frac{\Pr(V=1 \mid x, e_i)}{\Pr(V=0 \mid x, e_i)}
&= \log \frac{\Pr(V=1 \mid x)}{\Pr(V=0 \mid x)} \nonumber\\
&\quad + \underbrace{\log \frac{p(e \mid V=1, x)}{p(e \mid V=0, x)}}_{\text{weight of evidence}}
\end{align}

In our methodology, $\text{Curiosity}_{\text{Score}}=s_{\text{expl}}-s_{\text{base}}$ is an \emph{empirical estimate} of this increment on the log-odds
scale, so it preserves the decision-relevant effect of $e$ while removing lexical/style nuisance.
Consequently, conditioning on $\text{Curiosity}_{\text{Score}}$ approximates the theoretically ``right'' sufficient update in a logistic
decision rule \citep{agresti2013}.

\paragraph{(2) Variance reduction via a control-variate effect.}
Let $Z$ be the random quantity we wish to estimate more stably (e.g., per-example loss),
and let $C = \text{Curiosity}_{\text{Score}}$ be the control signal.
With Pearson correlation
\[
\rho \;=\; \mathrm{Corr}(Z,C)\;=\; \frac{\mathrm{Cov}(Z,C)}{\sqrt{\mathrm{Var}(Z)\,\mathrm{Var}(C)}}\in[-1,1],
\]
the classic control-variate construction implies that the optimally adjusted estimator
$Z^\star = Z - \alpha^\star (C-\mathbb{E}[C])$
achieves
\[
\mathrm{Var}(Z^\star) \;=\; \mathrm{Var}(Z)\,\bigl(1-\rho^2\bigr)
\quad\text{at}\quad
\alpha^\star=\frac{\mathrm{Cov}(Z,C)}{\mathrm{Var}(C)}.
\]
Thus any nonzero correlation with $c$ strictly reduces variance \citep[Ch.~8]{owen2013monte}. Here,
$Z=\ell_i(\theta)$ (per-example cross-entropy loss) to reduce risk variance. Lower variance improves sample efficiency and stabilizes training.

% \paragraph{(3) Invariance to rater wording and robustness to rater-mix shifts.}
% Text features in $e$ entangle item content with annotator idiosyncrasies (style, verbosity).
% By construction, $\text{Curiosity}_{\text{Score}}$ focuses on the \emph{effect} of $e$ (how belief moves), which is more invariant across raters and
% reduces pooling bias from heterogeneous annotator effects---a known issue in multi-rater settings \citep{Dawid1979MaximumLE}.
% Hence, conditioning on $\text{Curiosity}_{\text{Score}}$ improves transfer under shifts in the rater mix or with unseen raters.

\paragraph{(3)Curiosity as a Model of Annotator Behaviour and Generalization}
\label{sec:annotator-behavior}

Subjective labels reflect both item difficulty and rater idiosyncrasy. A classic way to formalize this is a random–effects logit \citep{Dawid1979MaximumLE,agresti2013}:
\begin{equation}
\label{eq:rand-effects}
\mathrm{logit}\,\Pr(V{=}1\mid x,z_i)\;=\;f(x)\;+\;b_{z_i}(x),
\end{equation}
where $f(x)$ captures item evidence and $b_a(x)$ represents the (possibly context‐dependent) strictness/leniency of annotator $a$. Since the curiosity score is able to model the annotator behaviour without considering the idiosyncrasies of the explanation text, it is able to better generalize to out-of-distribution dimensions for that annotator.

% \section{Experiments}
% \label{sec:expt}
% We evaluate our proposed ICM method against SFT baseline, as detailed in Section~\ref{sec:method}, across various model sizes. To ensure a fair comparison, both models are trained and evaluated using the same information in input and output, during both training and test time. Each instance from the TTCW dataset~\cite{chakrabarty2024artartificelargelanguage} includes a story plot, a creativity-focused question, and the annotator identity. During training, models additionally receive the corresponding expert explanation and binary verdict. At inference time, however, these signals are withheld to ensure that evaluation reflects only the models' internalized understanding, without auxiliary guidance.

% Given the limited size of the TTCW dataset with 48 stories annotated across 5 dimensions, with 3 expert judgments per story-dimension pair, we obtain a total of 720 data points. To ensure robustness in evaluation, we adopt a 5-fold cross-validation strategy with an 80:20 train-test split in each fold, yielding approximately 144 test and 576 train samples per fold. Since the individual folds are too small to support statistically significant conclusions, we report the mean of all evaluation metrics across the five folds (see Table \ref{tab:experiments} and Section \ref{sec:eval}).

% For training, we ensure convergence of the loss in all runs to maintain consistency across models. Although the baseline with explanations uses a causal language modeling objective and our method employs a classification-based setup, we align key hyperparameters, such as learning rate, LoRA \cite{hu2022lora} rank, and batch size, wherever applicable to enable a fair comparison. We set the parameter $\lambda$ as 1 while training the ICM. All the finetuning made in our experiments, both in ICM and SFT baselines were based on LoRA\cite{hu2022lora} finetuning. All the hyperparameters used for our training are detailed in Table \ref{tab:train-hparams}. For comparison of baseline without explanations we ensure that same hyperparameters were used as in the ICM setup since both use classification loss.

% All experiments were run on a single NVIDIA A100\,$(80\,\mathrm{GB})$ GPU. Mixed precision with \textbf{bfloat16} was enabled wherever supported (i.e., \texttt{bf16{=}True}, \texttt{fp16{=}False}). When the base model was loaded with 8-bit quantization, matrix multiplies inside bitsandbytes kernels computed in FP16 while LoRA/heads operated in bf16.


% We ensure that the distribution of positive and negative class labels in the train and test sets remain the same.
% % % ----------------------------------------------


% % For the experiments, we compare our method of using curiosity metric against the bseline method as explained in section \ref{sec:method}. We ensure that both the baseline and our method has access to the same inputs and outputs i.e, for each model we provide the input of plot, question and annotator for each story and dimension type as given in ttcw dataset\cite{chakrabarty2024artartificelargelanguage}. While training both supervised and our method sees the explanation and binary verdict label in order to ensure the model learns from it. But during the inference time they are not available. This way we ensure fairness in comparing our method to the baseline. 

% % In our experiments we only include the 5 dimensions as explained in section \ref{sec:creativity dimension} beacuse among the 14 dimesion, these 5 are the most subjective in nature and since our methodology is to improve the evluation of subjective criterias, we include those 5 dimensions.

% % Since the dataset size is extremely small i.e 48 stories annotated across the 5 dimensions and each story+dimension pair annotated by 3 annotator, making the total number of data as 720 samples. We use a train test split of 80:20, which make the number of test samples as 144, which is not statistically significant to come to any conclusion. So we train it across 5 folds and report the mean and standard deviation for each of the metrics described in section \ref{sec:eval} in table \ref{tab:experiments}.

% % For the hyperparameter selection, we ensure that in each of our run, the loss converges to ensure fair comparison. Since the baseline method uses a causal language modeling and our method uses classification, same configs could not be used.

% % The main changes in the hyperparameters between the curiosity and the baseline models are that the rank of the curiosity model is lower(16) compared to the baseline model(256), since the former has a classification setup for inference while the latter has causal language model setup for inference.


% \begin{table}[ht]
% \centering
% \caption{Experimental Results on Creativity Evaluation Tasks}
% \label{tab:experiments}
% \begin{tabular}{|l|l|c|c|c|c|c|c|c|}
% \hline
% \textbf{Model Name} & \textbf{Experiment Type} & \textbf{LoRA $\alpha$ / Rank} & \textbf{LR} & \textbf{Pearson} & \textbf{Spearman} & \textbf{F1} & \textbf{Precision} & \textbf{Recall} \\
% \hline
% Qwen2.5-0.5B & SFT & 16 / 8 & 5e-5 & 0.42 & 0.39 & 0.58 & 0.60 & 0.61 \\
% Qwen2.5-0.5B & ICM (RL) & 16 / 8 & 5e-5 & $0.2252 \pm 0.1227$ & $0.2102 \pm 0.1011 $ & $0.4848 \pm 0.0421$ & $0.3889 \pm 0.0358$ & $0.6454 \pm 0.0732$ \\
% \hline
% Qwen2.5-1.5B & SFT & 32 / 16 & 3e-5 & 0.53 & 0.51 & 0.67 & 0.69 & 0.68 \\
% Qwen2.5-1.5B & ICM (RL) & 32 / 16 & 3e-5 & $0.2682 \pm 0.3596$ & $0.2178 \pm 0.3683$ & $0.4713 \pm 0.1791$ & $0.3778 \pm 0.1452$ & $0.6288 \pm 0.2462$ \\
% \hline
% Qwen2.5-3B & SFT & 32 / 16 & 3e-5 & 0.61 & 0.59 & 0.74 & 0.76 & 0.75 \\
% Qwen2.5-3B & ICM (RL) & 32 / 16 & 3e-5 & 0.65 & 0.63 & 0.77 & 0.79 & 0.78 \\
% \hline
% Qwen2.5-7B & SFT & 64 / 32 & 2e-5 & 0.67 & 0.66 & 0.79 & 0.80 & 0.81 \\
% Qwen2.5-7B & ICM (RL) & 64 / 32 & 2e-5 & 0.72 & 0.70 & 0.83 & 0.85 & 0.84 \\
% \hline
% \end{tabular}
% \end{table}
% \definecolor{lightgray}{gray}{0.9}
% \begin{table}[ht]
% \centering
% \caption{Experimental Results on Creativity Evaluation Tasks}
% \label{tab:experiments}
% \hspace*{-2cm}
% \rowcolors{2}{lightgray}{white}
% \renewcommand{\arraystretch}{1.5}
% \resizebox{1.3\textwidth}{!}{%
% \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
% \hline

% \textbf{Model Name} & \textbf{Experiment Type} & \textbf{LoRA $\alpha$ / Rank} & \textbf{LR} & \textbf{Pearson} & \textbf{Spearman} & \textbf{F1} & \textbf{Precision} & \textbf{Recall} \\
% \hline
% Qwen2.5-0.5B & SFT & 256 / 256 & 5e-5 & 0.170238 & 0.170238 & 0.381961 & 0.451527 & 0.334390 \\

% \hdashline
% Qwen2.5-0.5B & ICM (RL) & 16 / 8 & 5e-5 & $\textbf{0.2252} $ & $\textbf{0.2102} $ & $\textbf{0.4848} $ & $\textbf{0.3889} $ & $\textbf{0.6454}$ \\
% \hline
% Qwen2.5-1.5B & SFT & 256/256 & 5e-5 & 0.169955 & 0.169955 & 0.401907 & \textbf{0.432148} & 0.383313 \\

% \hdashline
% Qwen2.5-1.5B & ICM (RL) & 32 / 16 & 3e-5 & $\textbf{0.2682} $ & $\textbf{0.2178}$ & $\textbf{0.4713} $ & $0.3778 $ & $\textbf{0.6288} $ \\
% \hline
% Qwen2.5-3B & SFT & 256 / 256 & 5e-5 & $0.1131 $ & $0.1131 $ & $0.3385 $ & $0.4013 $  & $0.2977 $  \\
% \hdashline
% Qwen2.5-3B & ICM (RL) & 32 / 16 & 1e-5 & \textbf{0.5152} & \textbf{0.4874} & \textbf{0.5981} & \textbf{0.4806} & \textbf{0.7943} \\
% \hline
% Qwen2.5-7B & SFT & 128 / 128 & 2e-4 & 0.160434 & 0.160434 & 0.370631 & 0.442986 & 0.323687 \\ 
% \hdashline
% Qwen2.5-7B & ICM (RL) & 64 / 32 & 2e-5 & \textbf{0.7213} & \textbf{0.7076} & \textbf{0.8392} & \textbf{0.8548} & \textbf{0.8437} \\
% \hline
% \end{tabular}
% }
% \end{table}

% Requires in preamble:
% \usepackage[table]{xcolor}
% \usepackage{arydshln}
% \definecolor{lightgray}{gray}{0.9}

% \begin{table}[ht]
% \centering
% \caption{Experimental Results on Creativity Evaluation Tasks}
% \label{tab:experiments}
% \small
% \setlength{\tabcolsep}{3.5pt}
% \renewcommand{\arraystretch}{1.2}
% \rowcolors{2}{lightgray}{white}
% \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
% \hline
% Model & Experiment & LoRA $\alpha$ / Rank  & Pearson & Spearman & F1 & Precision & Reccall \\
% \hline
% Qwen0.5B & SFT & 256/256 & 0.170 & 0.170 & 0.382 & 0.452 & 0.334 \\ \hdashline
%          & ICM & 16/8   & \textbf{0.524} & \textbf{0.484} & \textbf{0.616} & \textbf{0.494} & \textbf{0.818} \\
% \hline
% Qwen1.5B & SFT & 256/256  & 0.170 & 0.170 & 0.402 & \textbf{0.432} & 0.383 \\ \hdashline
%          & ICM & 32/16  & \textbf{0.268} & \textbf{0.218} & \textbf{0.471} & 0.378 & \textbf{0.629} \\
% \hline
% Qwen3B   & SFT & 256/256 & 0.113 & 0.113 & 0.339 & 0.401 & 0.298 \\ \hdashline
%          & ICM & 32/16   & \textbf{0.515} & \textbf{0.487} & \textbf{0.598} & \textbf{0.481} & \textbf{0.794} \\
% \hline
% Qwen7B   & SFT & 128/128 & 0.160 & 0.160 & 0.371 & 0.443 & 0.324 \\ \hdashline
%          & ICM & 64/32   & \textbf{0.721} & \textbf{0.708} & \textbf{0.839} & \textbf{0.855} & \textbf{0.844} \\
% \hline
% \end{tabular}
% \end{table}


\begin{table}[t]
\centering
\caption{ICM method results against the SFT baseline with explanations}
\label{tab:experiments}
\setlength{\tabcolsep}{2pt}
\renewcommand{\arraystretch}{1.1}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{c c c c c c c c}
\toprule
Model & Exp. & LoRA $\alpha$/R & Pearson & Cohen's $\kappa$ & F1 & Precision & Recall \\
\midrule
Qwen0.5B & SFT & 256/256 & 0.170 {\scriptsize$\pm$0.049} & 0.155 {\scriptsize$\pm$0.046} & 0.382 {\scriptsize$\pm$0.049} & 0.452 {\scriptsize$\pm$0.059} & 0.334 {\scriptsize$\pm$0.060} \\
         & ICM & 32/16    & \textbf{0.524 {\scriptsize$\pm$0.092}} & \textbf{0.383 {\scriptsize$\pm$0.076}} & \textbf{0.616 {\scriptsize$\pm$0.048}} & \textbf{0.494 {\scriptsize$\pm$0.046}} & \textbf{0.818 {\scriptsize$\pm$0.067}} \\
\midrule
Qwen1.5B & SFT & 256/256 & 0.170 {\scriptsize$\pm$0.048} & 0.155 {\scriptsize$\pm$0.048} & 0.402 {\scriptsize$\pm$0.049} & 0.432 {\scriptsize$\pm$0.020} & 0.383 {\scriptsize$\pm$0.083} \\
         & ICM & 32/16   & \textbf{0.587 {\scriptsize$\pm$0.061}} & \textbf{0.406 {\scriptsize$\pm$0.065}} & \textbf{0.629 {\scriptsize$\pm$0.045}} & \textbf{0.506 {\scriptsize$\pm$0.045}} & \textbf{0.836 {\scriptsize$\pm$0.056}} \\
\midrule
Qwen3B   & SFT & 256/256 & 0.113 {\scriptsize$\pm$0.083} & 0.110 {\scriptsize$\pm$0.081} & 0.339 {\scriptsize$\pm$0.051} & 0.401 {\scriptsize$\pm$0.067} & 0.298 {\scriptsize$\pm$0.060} \\
         & ICM & 32/16   & \textbf{0.540 {\scriptsize$\pm$0.057}} & \textbf{0.356 {\scriptsize$\pm$0.081}} & \textbf{0.598 {\scriptsize$\pm$0.054}} & \textbf{0.481 {\scriptsize$\pm$0.050}} & \textbf{0.794 {\scriptsize$\pm$0.070}} \\
\midrule
Qwen7B   & SFT & 128/128 & 0.160 {\scriptsize$\pm$0.050} & 0.168 {\scriptsize$\pm$0.085} & 0.371 {\scriptsize$\pm$0.021} & 0.443 {\scriptsize$\pm$0.050} & 0.324 {\scriptsize$\pm$0.038} \\
         & ICM & 32/16   & \textbf{0.605 {\scriptsize$\pm$0.083}} & \textbf{0.429 {\scriptsize$\pm$0.082}} & \textbf{0.643 {\scriptsize$\pm$0.053}} & \textbf{0.518 {\scriptsize$\pm$0.051}} & \textbf{0.850 {\scriptsize$\pm$0.072}} \\
\bottomrule
\end{tabular}%
}
\vspace{-3mm}
\end{table}






% \begin{table}[htbp]
% \centering
% \small
% \setlength{\tabcolsep}{6pt}
% \begin{tabular}{lrrrrrrrrrrrr}
% \hline
% Model & Experiment type & pearson & precision & recall & f1 \\
% \hline
% Qwen-0.5B(SFT-Classifcation) &ID & \textbf{0.586} & \textbf{0.769} & 0.461 & 0.551 \\
% Qwen-0.5B(ICM) &ID& 0.524 & 0.494 & \textbf{0.818} & \textbf{0.616}\\
% Qwen-0.5B(SFT-Classifcation) & OOD &0.433 & 0.000 & 0.000 & 0.000 \\
% Qwen-0.5B(ICM) & OOD &\textbf{0.563} & \textbf{0.625} & \textbf{0.790} & \textbf{0.698}\\
% Qwen-1.5B(SFT-Classifcation) &ID & \textbf{0.586} & \textbf{0.769} & 0.461 & 0.551 \\
% Qwen-1.5B(ICM) &ID& 0.524 & 0.494 & \textbf{0.818} & \textbf{0.616}\\
% Qwen-1.5B(SFT-Classifcation) & OOD &0.433 & 0.000 & 0.000 & 0.000 \\
% Qwen-1.5B(ICM) & OOD &\textbf{0.563} & \textbf{0.625} & \textbf{0.790} & \textbf{0.698}\\
% Qwen-3B(SFT-Classifcation) & ID &0.482 & \textbf{0.670} & 0.573 & 0.556 \\
% Qwen-3B(ICM) & ID &\textbf{0.540 } & 0.481 & \textbf{0.794 } & \textbf{0.598 } \\
% Qwen-3B(SFT-Classifcation) & OOD &0.546 & \textbf{0.933} & 0.246 & 0.389   \\
% Qwen-3B(ICM) & OOD & \textbf{0.582}  & 0.597 & \textbf{0.754} & \textbf{0.667} \\
% Qwen-7B(SFT-Classifcation) & ID &0.482 & \textbf{0.670} & 0.573 & 0.556 \\
% Qwen-7B(ICM) & ID &\textbf{0.540 } & 0.481 & \textbf{0.794 } & \textbf{0.598 } \\
% Qwen-7B(SFT-Classifcation) & OOD &0.546 & \textbf{0.933} & 0.246 & 0.389   \\
% Qwen-7B(ICM) & OOD & \textbf{0.582}  & 0.597 & \textbf{0.754} & \textbf{0.667} \\
% \hline
% \end{tabular}
% \caption{Baseline classification experiment}
% \label{tab:means_per_file}
% \end{table}

\begin{figure}[t]
  \centering
  % Row 1: ID
  \begin{subfigure}{0.49\linewidth}
    \includegraphics[width=\linewidth]{AuthorKit26/AnonymousSubmission/LaTeX/id_three_pearson_named_corrected (1).png}
    \subcaption{ID Pearson}\label{fig:id-pearson}
  \end{subfigure}\hfill
  \begin{subfigure}{0.49\linewidth}
    \includegraphics[width=\linewidth]{AuthorKit26/AnonymousSubmission/LaTeX/id_three_f1_named_1.png }
    \subcaption{ID F1}\label{fig:id-f1}
  \end{subfigure}

  % Row 2: OOD
  \vspace{0.6em}
  \begin{subfigure}{0.49\linewidth}
    \includegraphics[width=\linewidth]{AuthorKit26/AnonymousSubmission/LaTeX/ood_three_pearson_named (1).png}
    \subcaption{OOD Pearson}\label{fig:ood-pearson}
  \end{subfigure}\hfill
  \begin{subfigure}{0.49\linewidth}
    \includegraphics[width=\linewidth]{AuthorKit26/AnonymousSubmission/LaTeX/ood_three_f1_named (1).png}
    \subcaption{OOD F1}\label{fig:ood-f1}
  \end{subfigure}

  \caption{Three-way comparison across model sizes for \textbf{ICM (ours)}, 
  \textbf{SFT baseline (classification, no explanations)}, and 
  \textbf{SFT baseline (with explanations)}. Panels show Pearson and F1 for in-distribution (top) and out-of-distribution (bottom). For exact results of the ID and OOD experiments of \emph{baseline without explanation(classification)}, refer to Table \ref{tab:baseline_classification_expt_id} and Table \ref{tab:baseline_classification_expt_ood}}
  \label{fig:id-ood-pearson-f1}
\end{figure}




% \begin{table}[h]
% \centering
% \caption{Comparison of ICM method against GPT-5 one-shot }
% \label{tab:gpt5}
% \setlength{\tabcolsep}{2pt}
% \renewcommand{\arraystretch}{1.1}
% \begin{tabular}{c c c c c c}
% \toprule
% Model & Exp. & Pearson & F1 & Precision & Recall \\
% \midrule
% Qwen0.5B & ICM & 0.524 {\scriptsize$\pm$0.092} & 0.616 {\scriptsize$\pm$0.048} & 0.494 {\scriptsize$\pm$0.046} & 0.818 {\scriptsize$\pm$0.067} \\
% Qwen1.5B & ICM & 0.587 {\scriptsize$\pm$0.061} & 0.629 {\scriptsize$\pm$0.045} & 0.506 {\scriptsize$\pm$0.045} & 0.836 {\scriptsize$\pm$0.056} \\
% Qwen3B   & ICM & 0.540 {\scriptsize$\pm$0.057} & 0.598 {\scriptsize$\pm$0.054} & 0.481 {\scriptsize$\pm$0.050} & 0.794 {\scriptsize$\pm$0.070} \\
% Qwen7B   & ICM & 0.605 {\scriptsize$\pm$0.083} & 0.643 {\scriptsize$\pm$0.053} & 0.518 {\scriptsize$\pm$0.051} & 0.850 {\scriptsize$\pm$0.072} \\
% \midrule
% GPT-5    & ICM & 0.2409 {\scriptsize$\pm$0.1379} & 0.3467 {\scriptsize$\pm$0.1592} & 0.5698 {\scriptsize$\pm$0.2305} & 0.2608 {\scriptsize$\pm$0.1378} \\
% \bottomrule
% \end{tabular}
% \vspace{-3mm}
% \end{table}

\begin{table}[t]
\centering
\caption{Comparison of ICM method against GPT-5 one-shot}
\label{tab:gpt5}
\setlength{\tabcolsep}{2pt}
\renewcommand{\arraystretch}{1.05} % slightly tighter
\resizebox{\columnwidth}{!}{%
\begin{tabular}{c c c c c c}
\toprule
Model & Exp. & Pearson & F1 & Precision & Recall \\
\midrule
Qwen0.5B & ICM & 0.524 {\scriptsize$\pm$0.092} & 0.616 {\scriptsize$\pm$0.048} & 0.494 {\scriptsize$\pm$0.046} & 0.818 {\scriptsize$\pm$0.067} \\
Qwen1.5B & ICM & 0.587 {\scriptsize$\pm$0.061} & 0.629 {\scriptsize$\pm$0.045} & 0.506 {\scriptsize$\pm$0.045} & 0.836 {\scriptsize$\pm$0.056} \\
Qwen3B   & ICM & 0.540 {\scriptsize$\pm$0.057} & 0.598 {\scriptsize$\pm$0.054} & 0.481 {\scriptsize$\pm$0.050} & 0.794 {\scriptsize$\pm$0.070} \\
Qwen7B   & ICM & 0.605 {\scriptsize$\pm$0.083} & 0.643 {\scriptsize$\pm$0.053} & 0.518 {\scriptsize$\pm$0.051} & 0.850 {\scriptsize$\pm$0.072} \\
\midrule
GPT-5    & ICM & 0.2409 {\scriptsize$\pm$0.1379} & 0.3467 {\scriptsize$\pm$0.1592} & 0.5698 {\scriptsize$\pm$0.2305} & 0.2608 {\scriptsize$\pm$0.1378} \\
\bottomrule
\end{tabular}%
}
\end{table}


\section{Experiments}
\label{sec:experiments}

We evaluate our Intrinsic Curiosity Modeling (ICM) approach against a supervised fine-tuning (SFT) baseline (see Section~\ref{sec:method}) across multiple model sizes. For a fair comparison in terms of identical input and outputs, we compare the ICM setup against SFT baseline with explanations. We also compare the ICM setup against FT baseline without explanations in order to ensure the same classification loss is used.

\paragraph{Dataset}
TTCW contains 48 stories annotated on 5 dimensions with three expert judgments per story--dimension pair, yielding 720 examples. We use 5-fold cross-validation with an 80/20 split, giving approximately 576 training and 144 test items per fold. Because individual folds are small, we report means across folds for all metrics (Table~\ref{tab:experiments}; see also Section~\ref{sec:eval}). Splits are stratified to preserve the positive/negative label ratio.

\paragraph{Training setup.}
The \emph{baseline with explanations} uses a causal language modeling objective and our ICM model uses a classification objective. We align shared hyperparameters---learning rate, LoRA~\cite{hu2022lora} rank, and batch size---wherever applicable to ensure comparability. The ICM combined loss uses $\lambda=1$. All fine-tuning (ICM and SFT baselines) uses LoRA; full details are in Table~\ref{tab:train-hparams}. For the \emph{baseline without explanations}, which also uses a classification loss, we match all of the ICM hyperparameters.

\paragraph{Compute and precision.}
All runs use a single NVIDIA A100 (80\,GB) GPU. Mixed precision with \textbf{bfloat16} is enabled when supported. When base models are loaded with 8-bit quantization, matrix multiplies in bitsandbytes execute in FP16 while LoRA heads operate in bfloat16.

\paragraph{Convergence and reproducibility.}
We train to loss convergence in all runs and fix random seeds for data splits and initialization. Hyperparameters and implementation details appear in Table~\ref{tab:train-hparams}.

% Optional (if you have them; otherwise remove):
% \paragraph{Early stopping.} We apply early stopping based on validation loss with a patience of $p$ epochs and a maximum of $T$ steps per run.
% \paragraph{Software.} Experiments use PyTorch~X.Y, transformers~A.B, and bitsandbytes~C.D.

\section{Analysis}



%Models are evaluated using Pearson correlation and Cohen’s $\$\kappa$$ for alignment with human ratings, as well as F1, precision, and recall for binary verdict prediction.

\subsection{Effect of model scale}
From Fig \ref{fig:id-ood-pearson-f1} we can see that our ICM method improves across model sizes whereas the \emph{baseline classification method with no explanation} degrades with increase in model size for both ID and OOD settings. The reason why the \emph{baseline classification method with no explanation} maybe degrading with scale is because this method primarily overfits on the small dataset with larger model sizes. Although the \emph{baseline with explanation} improves with increase in model size, it remains uniformly low compared to the ICM method.
% Table \ref{tab:experiments} and Table \ref{tab:baseline_classification_expt} reports results for four Qwen model sizes trained using \emph{baseline SFT with explanation}, \emph{baseline SFT without explanations} and our proposed ICM approach. We see that our ICM setup still beats the baseline classification setting except for the Qwen-0.5B and Qwen-1.5B model on in-distribution task(for pearson correlation metric) because we had ensure that the distribution of positive and negative class labels in the train and test sets remain the same so the baseline models are overfitting on in-distribution data and fail to generalise to OOD data. This shows that our approach of modeling the user behavior and then using that signal to help the downstream model to predict the verdict generalizes better. All the details are provided in the table \ref{tab:means_per_file}, where ID stands for in-distribution experiments(results similar to table \ref{tab:experiments_model_size})  and OOD stands for out-of-distribution experiments, as explained in section \ref{sec:ood section}(and results similar to table \ref{tab:ood_model_size})
%\subsection{Effect of Model Scale}
% Performance generally improves with model size in both SFT and ICM settings. Under ICM, Pearson correlation rises from 0.524 for Qwen-0.5B to 0.605 for Qwen-7B, while F1 increases from 0.616 to 0.643. Cohen’s $\$\kappa$$ also shows consistent gains, suggesting that larger models capture subjective evaluation signals more reliably and produce more consistent verdicts. In contrast, SFT models remain clustered at low correlation values (≈0.11–0.17), indicating limited scalability in this setup.

% From table \ref{tab:experiments} we see that our proposed approach scales well with model size, with Pearson correlation value increasing by almost 15\%, F1 score by 4\% and Cohen's $\kappa$ by 12\%. We also see that the ICM approach improves the Pearson correlation over the baseline by 208\% in 0.5B to \textbf{278}\% in 7B, Cohen's $\kappa$ from 147\% in 0.5B to \textbf{155}\% in 7B and F1-score from 61\% in 0.5B and \textbf{73}\% in 7B. This shows that our proposed method does not vanish with scale and we could see even bigger gains with larger models. We also test our method against \textbf{GPT-5} and find that even Qwen-0.5B model in ICM setup is able to beat it, more details are provided in appendix \ref{sec:gpt5}.

% \subsection{Impact of Curiosity-Driven Learning}
%  For every model size, ICM substantially outperforms SFT across all metrics. The improvement is most striking in correlation measures: Qwen-3B’s Pearson jumps from 0.113 (SFT) to 0.540 (ICM), and Qwen-7B reaches the highest overall correlation (0.605). F1-scores also show large boosts—e.g., Qwen-0.5B improves from 0.382 (SFT) to 0.616 (ICM), driven primarily by sharp increases in recall (from 0.334 to 0.818). This indicates that curiosity-driven learning better captures nuanced human reasoning, enabling higher sensitivity to positive cases without sacrificing precision.

% \subsection{Interpretability}

% Adding a scalar curiosity signal makes the model’s reasoning legible: verdicts are decoupled from belief updates, so we can report how much the model learned from an explanation and how closely it followed the expert’s rationale. This structure also enables stronger personalization, as belief-update dynamics transfer better across annotators than end-to-end fine-tuning on small data. Overall, coupling belief-shift estimation with expert attribution improves performance and yields a more human-aligned, robust, and interpretable evaluator.
\subsection{Generalization}



To understand the generalization ability of the baseline and the ICM models, we use the same setup as earlier but train the model in both methods on 4 dimensions - \textit{Originality in Form}, \textit{Originality in Theme and Content}, \textit{Structural Flexibility}, and \textit{Perspective and Voice Flexibility}, and test these trained models on the held out dimension of \textit{Originality in Thought}. In this way there is absolutely no data leakage since the dimension the model is tested on was never seen during the training. From figure \ref{fig:id-ood-pearson-f1}, we can see that gains of the ICM method over both the baseline methods are much more in the OOD settings rather than ID settings. This suggests the generalizability of our method because we are essentially allowing the model to understand the user behavior before predicting which is much more generalizable as compared to both baseline SFT methods.


\subsection{Comparison with GPT-5}
\label{sec:gpt5}
Table \ref{tab:gpt5} has the results of the ICM setup against GPT-5. We can see that even Qwen-0.5B model is able to beat GPT-5 model across all evaluation metrics except precision. The GPT-5 model was prompted with the same story, question and annotator index along with one shot example(randomly picked from training set) by the same annotator. GPT-5 model was more biased towards the answer "no" and whenever "yes" was predicted, it was almost always wrong. This further proves the effectiveness of our method. 

%\subsection{Efficiency}




%From table \ref{tab:experiments} and table \ref{tab:ood}, we see that Pearson correlation and Cohen's $\kappa$ values decrease by 24.9\% and 31.9\% respectively for the SFT setup, averaged across model sizes. Whereas it increases by 7.4\% and 18.1\% respective in the ICM setup, demonstrating the strong generalization ability of our ICM method.






% In your preamble: \usepackage{booktabs}
% Requires: \usepackage{booktabs} (optional), \usepackage[table]{xcolor} for \rowcolors, and \usepackage{arydshln} for \hdashline
% \begin{table}[ht]
% \centering
% \caption{SFT vs. ICM results by model size on \textbf{Out-of-distribution} data}
% \label{tab:ood}
% \small
% \setlength{\tabcolsep}{3.5pt}
% \renewcommand{\arraystretch}{1.2}
% \rowcolors{2}{lightgray}{white}
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
% \hline
% Model & Experiment & LoRA $\alpha$ / Rank & Pearson & Cohen's $\kappa$ & F1 & Precision & Recall \\
% \hline

% Qwen0.5B & SFT & 256/256 & 0.188 & 0.147 & 0.316 & 0.632 & 0.211 \\
% \hdashline
%  & ICM & 16/32 & \textbf{0.563} & \textbf{0.458} & \textbf{0.698} & \textbf{0.625} & \textbf{0.790} \\
% \hline

% Qwen1.5B & SFT & 256/256 & 0.026 & 0.023 & 0.265 & 0.423 & 0.193 \\
% \hdashline
%  & ICM & 16/32 & \textbf{0.655} & \textbf{0.486} & \textbf{0.713} & \textbf{0.639} & \textbf{0.807} \\
% \hline

% Qwen3B & SFT & 256/256 & 0.024 & 0.024 & 0.369 & 0.413 & 0.333 \\
% \hdashline
%  & ICM & 16/32 & \textbf{0.582} & \textbf{0.403} & \textbf{0.667} & \textbf{0.597} & \textbf{0.754} \\
% \hline

% Qwen7B & SFT & 128/128 & 0.245 & 0.237 & 0.490 & 0.585 & 0.421 \\
% \hdashline
%  & ICM & 16/32 & \textbf{0.623} & \textbf{0.514} & \textbf{0.729} & \textbf{0.653} & \textbf{0.825} \\
% \hline
% \end{tabular}
% \end{table}

\begin{table}[t]
\centering
\caption{ICM method results against the SFT baseline with explanations on Out-of-distribution data}
\label{tab:ood}
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.1}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l l c c c c c c}
\toprule
Model & Experiment & LoRA $\alpha$/Rank & Pearson & Cohen's $\kappa$ & F1 & Precision & Recall \\
\midrule
Qwen0.5B & SFT & 256/256 & 0.188 & 0.147 & 0.316 & 0.632 & 0.211 \\
         & ICM & 32/16   & \textbf{0.563} & \textbf{0.458} & \textbf{0.698} & \textbf{0.625} & \textbf{0.790} \\
\midrule
Qwen1.5B & SFT & 256/256 & 0.026 & 0.023 & 0.265 & 0.423 & 0.193 \\
         & ICM & 32/16   & \textbf{0.655} & \textbf{0.486} & \textbf{0.713} & \textbf{0.639} & \textbf{0.807} \\
\midrule
Qwen3B   & SFT & 256/256 & 0.024 & 0.024 & 0.369 & 0.413 & 0.333 \\
         & ICM & 32/16   & \textbf{0.582} & \textbf{0.403} & \textbf{0.667} & \textbf{0.597} & \textbf{0.754} \\
\midrule
Qwen7B   & SFT & 128/128 & 0.245 & 0.237 & 0.490 & 0.585 & 0.421 \\
         & ICM & 32/16   & \textbf{0.623} & \textbf{0.514} & \textbf{0.729} & \textbf{0.653} & \textbf{0.825} \\
\bottomrule
\end{tabular}%
}
\vspace{-2mm}
\end{table}


\section{Conclusion and Future Work}

We introduced a curiosity-driven LLM-as-a-judge for evaluating creativity in text generation, addressing the limitations of baseline SFT for inherently subjective tasks. Our approach leverages a two-part curiosity signal, capturing belief shifts via model responses to expert explanations and incorporating expert attribution through a backward prediction task. This signal enhances a SFT setup, leading to stronger alignment with human judgments across multiple creativity dimensions in the TTCW dataset. Experiments show that incorporating curiosity-based modeling consistently improves performance across model scales, surpassing standard SFT baselines in both correlation with human ratings and classification accuracy. Not only does it scale with model size, it also improves the performance in out-of-distribution scenarios, where we test the models on one heldout test dimension by training the models on the other 4 creativity dimension. Future work includes extending the curiosity-driven LLM-as-a-judge to other domains like marketing, evaluating novelty of scientific ideas etc,. We also plan to use the curiosity signal as a reward signal in RL setup to further improve our current results.


\section{Literature Review}
% The evaluation of creativity in language models builds upon decades of work in creativity research, where the Torrance Tests of Creative Thinking (TTCT) assess fluency, flexibility, originality, and elaboration
% \cite{torrance1966ttct}, and the Consensual Assessment Technique (CAT) uses aggregated expert judgments, a reliable but labour‑intensive process \cite{patterson2024audra}. The autors of \cite{chakrabarty2024artartificelargelanguage} adapted TTCT into the Torrance Tests for Creative Writing (TTCW), designing fourteen binary tests and enlisting creative‑writing experts to evaluate 48 stories; their study showed that large language models pass these tests three to ten times less often than human writers \cite{chakrabarty2024artartificelargelanguage}, highlighting a sizable gap in creative competence. Alternative evaluation paradigms, such as the Leap‑of‑Thought (LoT) framework for humorous, associative reasoning, argue that step‑by‑step chain‑of‑thought prompting limits creativity and instead encourages models to make non‑sequential “leaps” \cite{Zhong_2024_CVPR}. Efforts to automate creativity scoring have used distributional semantics to approximate novelty, but these metrics often align weakly with expert judgments, reinforcing the need for human‑aligned signals. 

% Intrinsic-motivation signals from reinforcement learning offer a principled lens on novelty seeking. Information-gain and prediction-error formulations, e.g., VIME \citep{houthooft2017vimevariationalinformationmaximizing}, ICM \citep{pathak2017curiositydrivenexplorationselfsupervisedprediction}, and Random Network Distillation \citep{burda2018rnd}, have proven effective for exploration under sparse extrinsic reward. By analogy, curiosity-style signals can inform language evaluation by rewarding “useful novelty” (divergent yet coherent), complementing semantic-distance and rater-based methods. Our work draws inspiration from this intrinsic curiosity principle to model belief shifts when a language model incorporates expert explanations, combining it with expert attribution to offer a more interpretable and personalized measure of creativity.


The evaluation of creativity in language models builds upon decades of work in creativity research, where the Torrance Tests of Creative Thinking (TTCT) assess fluency, flexibility, originality, and elaboration \cite{torrance1966ttct}, and the Consensual Assessment Technique (CAT) uses aggregated expert judgments, a reliable but labour-intensive process \cite{patterson2024audra}. The authors of \cite{chakrabarty2024artartificelargelanguage} adapted TTCT into the Torrance Tests for Creative Writing (TTCW), designing fourteen binary tests and enlisting creative-writing experts to evaluate 48 stories; their study showed that large language models pass these tests three to ten times less often than human writers \cite{chakrabarty2024artartificelargelanguage}, highlighting a sizable gap in creative competence. Alternative evaluation paradigms, such as the Leap-of-Thought (LoT) framework for humorous, associative reasoning, argue that step-by-step chain-of-thought prompting can limit creativity and instead encourage non-sequential “leaps” \cite{Zhong_2024_CVPR}. Efforts to automate creativity scoring (e.g., distributional-semantics proxies for novelty) often align weakly with expert judgments, reinforcing the need for human-aligned signals.

Because creativity judgments are \emph{subjective}, collapsing rater perspectives via majority vote can erase systematic, meaningful disagreement. Following work on multi-annotator modeling, we treat annotators as distributions to be modeled rather than aggregated away \cite{davani-etal-2022-dealing}, rather than use the classical aggregation methods that infer a single latent “truth” \cite{NIPS2009_f899139d,hovy-etal-2013-learning}. In parallel, recent results caution against naïve \emph{LLM-as-judge} usage: evaluators can recognize and prefer their own generations, introducing self-preference bias \cite{panickssery2024llmevaluatorsrecognizefavor}. Calibrated autoraters offer a partial mitigation via broad multi-task training and bias auditing \cite{vu2024foundationalautoraterstaminglarge}. These findings motivate rater-aware or human-anchored evaluation signals for creativity.

Intrinsic-motivation signals from reinforcement learning offer a principled lens on novelty seeking. Information-gain and prediction-error formulations—VIME \cite{houthooft2017vimevariationalinformationmaximizing}, ICM \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}, and Random Network Distillation \cite{burda2018rnd}—are effective for exploration under sparse extrinsic reward. By analogy, curiosity-style signals can inform language evaluation by rewarding “useful novelty” (divergent yet coherent), complementing semantic-distance and rater-based methods. Our work instantiates this by modeling belief shifts when a language model incorporates expert explanations (a prediction-error–like signal) and combining it with expert attribution, yielding a more interpretable and \emph{personalized} measure of creativity.

%In reinforcement learning, curiosity‑driven exploration provides an intrinsic reward based on the squared error of a forward prediction model, encouraging agents to explore states where their predictions are uncertain \cite{pathak2017curiositydrivenexplorationselfsupervisedprediction}

% \section{Literature Review}
% \paragraph{Small Reward Models}: Recent advances in compact reward modeling have demonstrated that high-quality alignment signals can be obtained without large-scale models. \citet{pan2025tinyrewardmodels} show that bidirectional MLM-based reward models with as few as 400M parameters, trained with partial layer freezing and DoRA, achieve competitive results on RewardBench. Earlier work such as \citet{jiang2023llmblenderensemblinglargelanguage} established the effectiveness of small (0.4B) DeBERTa-v3 models for pairwise ranking in best-of-$N$ generation, providing a strong lightweight baseline. Parameter-efficient training strategies, as explored by \citet{sidahmed2024parameterefficientreinforcementlearning}, further reduce computational and memory requirements by applying LoRA to reward model tuning. Beyond architecture and parameter savings, several studies focus on distilling reward functions from larger “oracle” models: \citet{fisch2025robustpreferenceoptimizationreward} improve robustness in preference optimization via reward model distillation, \citet{nath2025simultaneousrewarddistillationpreference} unify reward distillation with preference learning in a single training loop, and \citet{zhang2025distilldatarewardssmaller} jointly transfer responses and reward signals, enabling smaller models to match or surpass their larger counterparts.

% \noindent \textbf{Curiosity-driven learning.} We build on the Intrinsic Curiosity Module (ICM) of \citet{pathak2017curiositydrivenexplorationselfsupervisedprediction}, which operationalizes curiosity as forward-dynamics prediction error in a learned feature space and enables exploration even without extrinsic rewards, while noting its sensitivity to stochastic “noisy-TV” phenomena. Further inspiration is derived from the intrinsic-motivation theory of \citet{10.1109/TAMD.2010.2056368} that rewards learning progress, and sits alongside key algorithms such as density-model pseudo-count novelty by \citet{bellemare2016unifyingcountbasedexplorationintrinsic}, information gain about a Bayesian dynamics model in VIME by \citet{houthooft2017vimevariationalinformationmaximizing}, and prediction error against a fixed random target in RND by \citet{burda2018explorationrandomnetworkdistillation}. In NLP, intrinsic rewards have been used to mitigate sparse or delayed supervision in interactive language settings: \citet{madotto2020explorationbasedlanguagelearning} show exploration bonuses improve state discovery and downstream success in text-based games, while \citet{wan2025enhancingpersonalizedmultiturndialogue} demonstrate curiosity rewards can enhance personalization in multi-turn dialogue by encouraging information-seeking behaviors that better model user preferences.

% \paragraph{Creativity and evaluation}



\nocite{*}
\bibliographystyle{splncs04}
\bibliography{AuthorKit26/AnonymousSubmission/LaTeX/aaai2026}

\appendix
\section{Appendix}
\label{sec:appendix}
\subsection{Dimensions in dataset}
\label{subsec:dataset}
In Table \ref{tab:cre-dim}, all the dimensions that are part of the TTCW dataset are mentioned.  


% Preamble requirements:
% \usepackage{multirow}
% \usepackage[table]{xcolor} % for \rowcolors (optional)


% Preamble:
% \usepackage{booktabs}
% \usepackage{multirow}
% \usepackage{tabularx}
% \usepackage[table]{xcolor}

% \usepackage{multirow}
\begin{table}[t]
\centering
\caption{Dimensions of TTCW dataset}
\label{tab:cre-dim}
\small
\setlength{\tabcolsep}{3.5pt}
\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|p{2.8cm}|p{0.68\linewidth}|}
\hline
\textbf{Dimension} & \textbf{Facets} \\ \hline
\multirow{5}{*}{\textbf{Fluency}}
  & Understandability \& Coherence \\ \cline{2-2}
  & Narrative Pacing \\ \cline{2-2}
  & Scene vs Exposition \\ \cline{2-2}
  & Literary Devices \& Language Proficiency \\ \cline{2-2}
  & Narrative Ending \\ \hline
\multirow{3}{*}{\textbf{Flexibility}}
  & Emotional Flexibility \\ \cline{2-2}
  & Perspective \& Voice Flexibility \\ \cline{2-2}
  & Structural Flexibility \\ \hline
\multirow{3}{*}{\textbf{Originality}}
  & Originality in Form \\ \cline{2-2}
  & Originality in Thought \\ \cline{2-2}
  & Originality in Theme \& Content \\ \hline
\multirow{3}{*}{\textbf{Elaboration}}
  & World Building \& Setting \\ \cline{2-2}
  & Character Development \\ \cline{2-2}
  & Rhetorical Complexity \\ \hline
\end{tabular}
\end{table}


\subsection{More experiment and compute details}
\label{sec:more expt}

% \begin{table}[htbp]
% \centering
% \small
% \caption{Core hyperparameters used in all runs.}
% \label{tab:train-hparams}
% \begin{tabular}{@{}>{\ttfamily}l l@{}}
% \toprule
% max\_length & 4096 \\
% lora\_dropout & 0.1 \\
% target\_modules & \verb|["q_proj","k_proj","v_proj","o_proj",| \\
% &\verb|"gate_proj","up_proj","down_proj"]|\\
% lr\_scheduler & cosine (warmup\_ratio $=0.1$) \\
% per\_device\_train\_batch\_size & 4 \\
% gradient\_accumulation\_steps & 8 \\
% weight\_decay & 0.01 \\
% max\_grad\_norm & 0.5 \\
% num\_train\_epochs & 3 \\
% seed & 42 \\
% \bottomrule
% \end{tabular}
% \vspace{-2mm}
% \end{table}

% \begin{table}[t]
% \centering
% \footnotesize
% \caption{Core hyperparameters used in all runs.}
% \label{tab:train-hparams}
% \setlength{\tabcolsep}{4pt}
% \renewcommand{\arraystretch}{1.05}
% \begin{tabularx}{\columnwidth}{@{}>{\ttfamily}l >{\raggedright\arraybackslash}X@{}}
% \toprule
% max\_length & 4096 \\
% lora\_dropout & 0.1 \\
% target_modules & \seqsplit{\texttt{["q_proj","k_proj","v_proj",
% "o_proj","gate_proj","up_proj","down_proj"]}}  \\
% lr\_scheduler & cosine (warmup\_ratio =0.1) \\
% per\_device\_train\_batch\_size & 4 \\
% gradient\_accumulation\_steps & 8 \\
% weight\_decay & 0.01 \\
% max\_grad\_norm & 0.5 \\
% num\_train\_epochs & 3 \\
% seed & 42 \\
% \bottomrule
% \end{tabularx}
% \end{table}

% \begin{tabular}{@{}>{\ttfamily}l p{0.5\linewidth}@{}}
% \toprule
% \label{tab:train-hparams}
% max\_length & 4096 \\
% lora\_dropout & 0.1 \\
% target\_modules & \texttt{["q\_proj","k\_proj",\newline
% "v\_proj","o\_proj",\newline
% "gate\_proj",\newline 
% "up\_proj",\newline
% "down\_proj"]} \\
% lr\_scheduler & cosine (warmup\_ratio =0.1) \\
% per\_device\_train\_batch\_size & 4 \\
% gradient\_accumulation\_steps & 8 \\
% weight\_decay & 0.01 \\
% max\_grad\_norm & 0.5 \\
% num\_train\_epochs & 3 \\
% seed & 42 \\
% \bottomrule
% \end{tabular}
\begin{table}[t]
\centering
\small
\caption{Core hyperparameters used in all runs.}
\label{tab:train-hparams}
\setlength{\tabcolsep}{4pt}
\renewcommand{\arraystretch}{1.05}
\begin{tabularx}{\columnwidth}{@{}>{\ttfamily}l >{\raggedright\arraybackslash}X@{}}
\toprule
max\_length & 4096 \\
lora\_dropout & 0.1 \\
target\_modules &
\codewrap{["q\_proj","k\_proj","v\_proj","o\_proj","gate\_proj","up\_proj","down\_proj"]} \\
lr\_scheduler & cosine (warmup\_ratio $=0.1$) \\
per\_device\_train\_batch\_size & 4 \\
gradient\_accumulation\_steps & 8 \\
weight\_decay & 0.01 \\
max\_grad\_norm & 0.5 \\
num\_train\_epochs & 3 \\
seed & 42 \\
\bottomrule
\end{tabularx}
\end{table}




% \begin{table}[t]
% \centering
% \footnotesize
% \caption{Core hyperparameters used in all runs.}
% \label{tab:train-hparams}
% \setlength{\tabcolsep}{4pt}
% \renewcommand{\arraystretch}{1.05}
% \begin{tabularx}{\columnwidth}{@{}>{\ttfamily}l >{\raggedright\arraybackslash}X@{}}
% \toprule
% max\_length & 4096 \\
% lora\_dropout & 0.1 \\
% target\_modules &
% \codewrap{["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]} \\
% lr\_scheduler & cosine (warmup\_ratio $=0.1$) \\
% per\_device\_train\_batch\_size & 4 \\
% gradient\_accumulation\_steps & 8 \\
% weight\_decay & 0.01 \\
% max\_grad\_norm & 0.5 \\
% num\_train\_epochs & 3 \\
% seed & 42 \\
% \bottomrule
% \end{tabularx}
% \end{table}


\subsection{Limitations}
\label{sec:limitation}
Our study has some limitations that we hope to address in future work. First, the empirical scope is narrow: we evaluate only on TTCW dataset. Our current method is text-only; extending to richer modalities and subjective tasks beyond TTCW remains future work. In addition, the dataset is small (48 stories × 5 dimensions with three expert judgments per story–dimension, totaling 720 instances). We therefore rely on 5-fold cross-validation and report means and deviation across 5 folds. Finally, model coverage is limited to one family (Qwen2.5 0.5B–7B), leaving generalization across architectures untested, which we aim to do in future work.


\subsection{Question for each dimension}

\begin{table}[H]
\centering
\caption{Creativity evaluation categories and questions}
\label{tab:creativity_eval}
% \begin{tabularx}{\textwidth}{@{}lX@{}}
\begin{tabular}{p{3cm}|p{7cm}|}

\toprule
\textbf{Category} & \textbf{Question} \\
\midrule
Originality in Thought & Is the story an original piece of \newline writing without any cliches? \\
Originality in Form and Structure & Does the story show originality in its form \newline and/or   structure? \\
Originality in Theme and Content & Will an average reader of this\newline  story obtain a unique and original\newline  idea from reading it? \\
Perspective and Voice Flexibility & Does the story provide diverse \newline perspectives, and if there are  \newline unlikeable characters, are their\newline  perspectives presented convincingly \newline and accurately? \\
Structural Flexibility & Does the story contain turns that\newline  are both surprising and appropriate? \\
\bottomrule
\end{tabular}
\end{table}



\subsection{Statistical significance testing}



% In preamble:
% \usepackage{booktabs}

% In preamble:
% \usepackage{booktabs}

% In preamble:
% \usepackage{booktabs}

% In preamble:
% \usepackage{booktabs}

\begin{table}[H]
\centering
\caption{Statistical significance test across 5 folds for Qwen-0.5b model}
\label{tab:icm_sft_0.5b_sig}
\small
\begin{tabular}{lrrrrc}
\toprule
Metric & SFT(with expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (paired $t$) & Statistically significant? \\
\midrule
Pearson   & 0.160 $\pm$ 0.055 & 0.524 $\pm$ 0.092 & 0.364 & 0.002 & Yes \\
Spearman  & 0.160 $\pm$ 0.055 & 0.484 $\pm$ 0.078 & 0.324 & $<\!0.001$ & Yes \\
F1        & 0.371 $\pm$ 0.054 & 0.616 $\pm$ 0.048 & 0.245 & $<\!0.001$ & Yes \\
\bottomrule
\end{tabular}

% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} Stars: $^{*}p<0.05$, $^{**}p<0.01$, $^{***}p<0.001$. SD (not SE) is reported; if figures use error bars, they should match this definition. If desired, 95\% CIs for means can be provided as $m \pm t_{0.975,4}\cdot\mathrm{SE}$ with $\mathrm{SE}=\mathrm{SD}/\sqrt{5}$.
\end{table}



% \usepackage{booktabs}

\begin{table}[H]
\centering
\caption{Statistical significance test across 5 folds for Qwen-1.5b model}
\label{tab:icm_sft_1.5b_sig}
\small
\begin{tabular}{lrrrrc}
\toprule
Metric & SFT(with expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (paired $t$) & Statistically significant? \\
\midrule
Pearson   & 0.170 $\pm$ 0.058 & 0.586 $\pm$ 0.064 & 0.416 & $<\!0.001$ & Yes \\
Spearman  & 0.170 $\pm$ 0.058 & 0.522 $\pm$ 0.069 & 0.352 & $<\!0.001$ & Yes \\
F1        & 0.402 $\pm$ 0.050 & 0.629 $\pm$ 0.045 & 0.227 & $<\!0.001$ & Yes \\
\bottomrule
\end{tabular}

% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} SD (not SE) is reported. Error bars reflect variability across folds. If normality of per-fold paired differences is doubtful, complement with a Wilcoxon signed-rank test in text.
\end{table}


% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} Stars: $^{*}p<0.05$, $^{**}p<0.01$, $^{***}p<0.001$. SD (not SE) is reported; if figures use error bars, they should match this definition. If desired, 95\% CIs for means can be provided as $m \pm t_{0.975,4}\cdot\mathrm{SE}$ with $\mathrm{SE}=\mathrm{SD}/\sqrt{5}$.
% \end{table}



% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} Error bars capture variability across 5 cross-validation folds. Confidence intervals computed in closed form as $m \pm t_{0.975,4}\cdot\mathrm{SE}$; $\mathrm{SE}=\mathrm{SD}/\sqrt{5}$. Paired tests assume approximate normality of per-fold differences; if violated, report Wilcoxon signed-rank in the text. SD (not SE) is reported in parentheses. 
% \end{table}

% \usepackage{booktabs}


% \usepackage{booktabs}

\begin{table}[ht]
\centering
\caption{Statistical significance test across 5 folds for Qwen-3b model.}
\label{tab:icm_sft_3b_sig}
\small
\begin{tabular}{lrrrrc}
\toprule
Metric & SFT(with expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (paired $t$) & Statistically significant? \\
\midrule
Pearson   & 0.113 $\pm$ 0.092 & 0.540 $\pm$ 0.074 & 0.427 & $<\!0.001$ & Yes \\
Spearman  & 0.113 $\pm$ 0.092 & 0.494 $\pm$ 0.091 & 0.381 & $<\!0.001$ & Yes \\
% Accuracy  & 0.650 $\pm$ 0.046 & 0.688 $\pm$ 0.040 & 0.038 & 0.070 & No \\
% Precision & 0.401 $\pm$ 0.074 & 0.481 $\pm$ 0.056 & 0.080 & 0.028 & Yes \\
% Recall    & 0.298 $\pm$ 0.061 & 0.795 $\pm$ 0.092 & 0.497 & $<\!0.001$ & Yes \\
F1        & 0.339 $\pm$ 0.053 & 0.618 $\pm$ 0.061 & 0.279 & $<\!0.001$ & Yes \\
\bottomrule
\end{tabular}

% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} SD (not SE) is reported; error bars reflect variability across folds. If normality of per-fold paired differences is doubtful, complement with a Wilcoxon signed-rank test in text.
\end{table}


\begin{table}[ht]
\centering
\caption{Statistical significance test across 5 folds for Qwen-7b model.}
\label{tab:icm_sft_7b_sig}
\small
\begin{tabular}{lrrrrc}
\toprule
Metric & SFT(with expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (paired $t$) & Statistically significant? \\
\midrule
Pearson   & 0.170 $\pm$ 0.058 & 0.606 $\pm$ 0.084 & 0.436 & $<\!0.001$ & Yes \\
Spearman  & 0.170 $\pm$ 0.058 & 0.542 $\pm$ 0.089 & 0.373 & $<\!0.001$ & Yes \\
F1        & 0.381 $\pm$ 0.029 & 0.663 $\pm$ 0.058 & 0.282 & $<\!0.001$ & Yes \\
\bottomrule
\end{tabular}

% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Notes.} SD (not SE) is reported; error bars reflect variability across folds. If normality of per-fold paired differences is doubtful, complement with a Wilcoxon signed-rank test in text.
\end{table}

% \begin{table}[h]
% \centering
% \caption{Qwen-0.5B: SFT (no explanations) vs ICM across 5 folds. $p$ from Welch’s $t$ (approx.).}
% \label{tab:sig_noexp_0.5b}
% \small
% \begin{tabular}{lrrrrc}
% \toprule
% Metric & SFT(no expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (Welch $t$) & Significant? \\
% \midrule
% Pearson   & 0.586 $\pm$ 0.085 & 0.524 $\pm$ 0.092 & $-0.062$ & 0.300 & No \\
% Spearman  & 0.545 $\pm$ 0.065 & 0.484 $\pm$ 0.078 & $-0.061$ & 0.218 & No \\
% F1        & 0.551 $\pm$ 0.198 & 0.616 $\pm$ 0.048 & \phantom{$-$}0.065 & 0.508 & No \\
% \bottomrule
% \end{tabular}
% \end{table}
% latex
% Copy code
% \begin{table}[h]
% \centering
% \caption{Qwen-1.5B: SFT (no explanations) vs ICM across 5 folds. $p$ from Welch’s $t$ (approx.).}
% \label{tab:sig_noexp_1.5b}
% \small
% \begin{tabular}{lrrrrc}
% \toprule
% Metric & SFT(no expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (Welch $t$) & Significant? \\
% \midrule
% Pearson   & 0.602 $\pm$ 0.064 & 0.586 $\pm$ 0.064 & $-0.016$ & 0.709 & No \\
% Spearman  & 0.543 $\pm$ 0.070 & 0.522 $\pm$ 0.069 & $-0.021$ & 0.651 & No \\
% F1        & 0.663 $\pm$ 0.070 & 0.629 $\pm$ 0.045 & $-0.034$ & 0.395 & No \\
% \bottomrule
% \end{tabular}
% \end{table}
% latex
% Copy code
% \begin{table}[h]
% \centering
% \caption{Qwen-3B: SFT (no explanations) vs ICM across 5 folds. $p$ from Welch’s $t$ (approx.).}
% \label{tab:sig_noexp_3b}
% \small
% \begin{tabular}{lrrrrc}
% \toprule
% Metric & SFT(no expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (Welch $t$) & Significant? \\
% \midrule
% Pearson   & 0.482 $\pm$ 0.160 & 0.540 $\pm$ 0.074 & \phantom{$-$}0.058 & 0.492 & No \\
% Spearman  & 0.454 $\pm$ 0.139 & 0.494 $\pm$ 0.091 & \phantom{$-$}0.040 & 0.611 & No \\
% F1        & 0.556 $\pm$ 0.094 & 0.618 $\pm$ 0.061 & \phantom{$-$}0.062 & 0.253 & No \\
% \bottomrule
% \end{tabular}
% \end{table}
% latex
% Copy code
% \begin{table}[h]
% \centering
% \caption{Qwen-7B: SFT (no explanations) vs ICM across 5 folds. $p$ from Welch’s $t$ (approx.).}
% \label{tab:sig_noexp_7b}
% \small
% \begin{tabular}{lrrrrc}
% \toprule
% Metric & SFT(no expl) (mean$\pm$SD) & ICM (mean$\pm$SD) & $\Delta$ (ICM$-$SFT) & $p$ (Welch $t$) & Significant? \\
% \midrule
% Pearson   & 0.441 $\pm$ 0.130 & 0.606 $\pm$ 0.084 & \phantom{$-$}0.165 & 0.049 & Yes \\
% Spearman  & 0.400 $\pm$ 0.111 & 0.542 $\pm$ 0.089 & \phantom{$-$}0.142 & 0.057 & No \\
% F1        & 0.383 $\pm$ 0.251 & 0.663 $\pm$ 0.058 & \phantom{$-$}0.280 & 0.066 & No \\
% \bottomrule
% \end{tabular}
% \end{table}

% =======================
% Table 5 (recreated)
% =======================
\clearpage
\begin{table}[H]
\centering
\caption{Average passing rate (\%) on individual TTCW, based on annotations of 10 creative writing experts across 48 stories; last column reports Fleiss' $\kappa$ (expert agreement).}
\label{tab:ttcw-pass-rates}
\small
\begin{tabular}{llrrrrr}
\toprule
\textbf{Dimension} & \textbf{Test} & \textbf{GPT-3.5} & \textbf{GPT-4} & \textbf{Claude~v1.3} & \textbf{New Yorker} & \textbf{Expert $\kappa$} \\
\midrule
\multirow{5}{*}{Fluency}
 & Understandability \& Coherence & 22.2 & 33.3 & 55.6 & 91.7 & 0.27 \\
 & Narrative Pacing               & 8.3  & 52.8 & 61.1 & 94.4 & 0.39 \\
 & Scene vs Exposition            & 8.3  & 50.0 & 58.3 & 91.7 & 0.27 \\
 & Literary Devices \& Language   & 5.6  & 36.1 & 13.9 & 88.9 & 0.37 \\
 & Narrative Ending               & 8.3  & 19.4 & 33.3 & 91.7 & 0.48 \\
\midrule
\multirow{3}{*}{Flexibility}
 & Emotional Flexibility                 & 16.7 & 19.4 & 36.1 & 91.7 & 0.32 \\
 & Perspective \& Voice Flexibility      & 8.3  & 16.7 & 19.4 & 72.2 & 0.44 \\
 & Structural Flexibility                & 11.1 & 19.4 & 30.6 & 88.9 & 0.39 \\
\midrule
\multirow{3}{*}{Originality}
 & Originality in Form                   & 2.8  & 8.3  & 0.0  & 63.9 & 0.41 \\
 & Originality in Thought                & 2.8  & 44.4 & 19.4 & 91.7 & 0.40 \\
 & Originality in Theme \& Content       & 0.0  & 19.4 & 11.1 & 75.0 & 0.66 \\
\midrule
\multirow{3}{*}{Elaboration}
 & World Building \& Setting             & 16.7 & 41.7 & 58.3 & 94.4 & 0.33 \\
 & Character Development                 & 8.3  & 16.7 & 16.7 & 61.1 & 0.31 \\
 & Rhetorical Complexity                 & 2.8  & 11.1 & 5.6  & 88.9 & 0.66 \\
\midrule
\textbf{Average} &  & \textbf{8.7} & \textbf{27.9} & \textbf{30.0} & \textbf{84.7} & \textbf{0.41} \\
\bottomrule
\end{tabular}
\end{table}

% =======================
% Table 9 (recreated)
% =======================
\begin{table}[H]
\centering
\caption{Correlation between LLM-administered TTCW and expert annotations (Cohen’s $\kappa$) on all 48 stories.}
\label{tab:llm-vs-expert-kappa}
\small
\begin{tabular}{llrrr}
\toprule
\textbf{Dimension} & \textbf{Test} & \textbf{GPT-3.5} & \textbf{GPT-4} & \textbf{Claude} \\
\midrule
\multirow{5}{*}{Fluency}
 & Understandability \& Coherence & -0.01 & -0.01 & -0.17 \\
 & Narrative Pacing               &  0.05 &  0.00 & -0.22 \\
 & Scene vs Exposition            & -0.03 & -0.08 & -0.23 \\
 & Literary Devices \& Language   &  0.04 & -0.09 & -0.11 \\
 & Narrative Ending               & -0.02 &  0.02 &  0.02 \\
\midrule
\multirow{3}{*}{Flexibility}
 & Emotional Flexibility          & -0.04 &  0.00 &  0.09 \\
 & Perspective \& Voice           &  0.00 &  0.26 &  0.14 \\
 & Structural Flexibility         & -0.04 &  0.00 & -0.07 \\
\midrule
\multirow{3}{*}{Originality}
 & Originality in Form            &  0.08 &  0.09 &  0.03 \\
 & Originality in Thought         &  0.19 &  0.31 &  0.15 \\
 & Originality in Theme \& Content&  0.06 & -0.01 &  0.18 \\
\midrule
\multirow{3}{*}{Elaboration}
 & World Building \& Setting      &  0.00 &  0.00 &  0.09 \\
 & Character Development          & -0.08 &  0.02 &  0.00 \\
 & Rhetorical Complexity          &  0.00 &  0.00 &  0.02 \\
\midrule
\textbf{Average} & & \textbf{0.016} & \textbf{0.035} & \textbf{-0.006} \\
\bottomrule
\end{tabular}
\end{table}


% \paragraph{Setup.}
% Let $\mathcal{D}_s,\mathcal{D}_t$ differ only in rater mix/style (distribution of $A$ and hence $E$). Assume a random–effects model
% $\mathrm{logit}\,P(V{=}1\mid X,A)=f(X)+b_A$ \citep{Dawid1979MaximumLE,agresti2013}.
% Let $c(X,A)=s_{\text{expl}}(X,A)-s_{\text{base}}(X)$.

% \textbf{Assumption (effect sufficiency).} There exists $\beta(X)$ with
% $\mathbb{E}[\,b_A \mid X,C\,]=\beta(X)C$ and $V \perp A \mid (X,C)$
% (i.e., $C$ is a balancing score for $A$) \citep{RosenbaumRubin1983}.

% \textbf{Lemma (conditional invariance).} Under the assumption,
% $P_s(V\mid X,C)=P_t(V\mid X,C)$ for all $(X,C)$.
% \emph{Sketch.} If $V \perp A \mid (X,C)$ then rater mix changes $P(A)$ (and $P(E)$) but not $P(V\mid X,C)$. \citep{Peters2016,arjovsky2019invariant}

% \textbf{Proposition (target-risk bound).}
% Let $R_d(h)=\mathbb{E}_{\mathcal{D}_d}[\ell(h(X,C),V)]$.
% By the lemma,
% \[
% R_t(h)-R_s(h)
% \;\le\;
% L\,\mathrm{IPM}_{\mathcal{F}}\!\big(P_s^{X,C},P_t^{X,C}\big)
% \;+\;\text{complexity}(h),
% \]
% i.e., only a covariate-shift term in $(X,C)$ remains \citep{BenDavid2010,Sugiyama2007}. In contrast, conditioning on $E$ introduces an additional labeling-function discrepancy term because $P(V\mid X,E)$ changes with the rater mix.

% \textbf{Variance remark.}
% Feeding $C$ acts as a control variate: for per-example loss $Z=\ell_i$ and control $C$, the optimally adjusted variance satisfies
% $\mathrm{Var}(Z^\star)=\mathrm{Var}(Z)\,(1-\rho^2)$ with $\rho=\mathrm{Corr}(Z,C)$ \citep[Ch.~8]{Owen2013}.
% Lower variance tightens generalization bounds and stabilizes SGD \citep{Bottou2018}.


% % In your preamble:
% % \usepackage{mdframed}   % for boxed environments

% \appendix
\clearpage
\subsection{ICM results against SFT baseline without explanations}
% % in document
% \begin{figure}[t]
%   \centering
%   % Row 1: ID
%   \begin{subfigure}{0.49\linewidth}
%     \includegraphics[width=\linewidth]{iclr2026/id_three_pearson_named_corrected.png}
%     \subcaption{ID Pearson}\label{fig:id-pearson}
%   \end{subfigure}\hfill
%   \begin{subfigure}{0.49\linewidth}
%     \includegraphics[width=\linewidth]{iclr2026/id_three_f1_named (1).png }
%     \subcaption{ID F1}\label{fig:id-f1}
%   \end{subfigure}

%   % Row 2: OOD
%   \vspace{0.6em}
%   \begin{subfigure}{0.49\linewidth}
%     \includegraphics[width=\linewidth]{iclr2026/ood_three_pearson_named.png}
%     \subcaption{OOD Pearson}\label{fig:ood-pearson}
%   \end{subfigure}\hfill
%   \begin{subfigure}{0.49\linewidth}
%     \includegraphics[width=\linewidth]{iclr2026/ood_three_f1_named.png}
%     \subcaption{OOD F1}\label{fig:ood-f1}
%   \end{subfigure}

%   \caption{Three-way comparison across model sizes for \textbf{ICM (ours)}, 
%   \textbf{SFT baseline (classification, no explanations)}, and 
%   \textbf{SFT baseline (with explanations)}. Panels show Pearson and F1 for in-distribution (top) and out-of-distribution (bottom).}
%   \label{fig:id-ood-pearson-f1}
% \end{figure}
% \begin{table}[htbp]
% \centering
% \caption{ICM method results against the SFT baseline without explanations(classification)}
% \label{tab:baseline_classification_expt_id}
% \small
% \setlength{\tabcolsep}{6pt}
% \begin{tabular}{llrrrr}
% \hline
% Model & Experiment type & pearson & precision & recall & f1 \\
% \hline
% Qwen-0.5B(SFT-Classifcation) & ID  & \textbf{0.586} & \textbf{0.769} & 0.461 & 0.551 \\
% Qwen-0.5B(ICM)               & ID  & 0.524          & 0.494          & \textbf{0.818} & \textbf{0.616} \\
% Qwen-1.5B(SFT-Classifcation) & ID  & \textbf{0.597}          & \textbf{0.787}          & 0.602          & \textbf{0.672} \\
% Qwen-1.5B(ICM)               & ID  & 0.540 & 0.481 & \textbf{0.794} & 0.598\\
% Qwen-3B(SFT-Classifcation)   & ID  & 0.482          & \textbf{0.670} & 0.573          & 0.556 \\
% Qwen-3B(ICM)                 & ID  & \textbf{0.540} & 0.481          & \textbf{0.794} & \textbf{0.598} \\

% Qwen-7B(SFT-Classifcation)   & ID  & 0.441          & \textbf{0.535}          & 0.342          & 0.383 \\
% Qwen-7B(ICM)                 & ID  & \textbf{0.605}  & 0.518 & \textbf{0.850} & \textbf{0.643} \\
% \hline
% \end{tabular}

% \end{table}

% \begin{table}[H]
% \centering
% \caption{ICM method results against the SFT baseline without explanations (classification). Means$\pm$SD are shown where SD was available from 5-fold runs.}
% \label{tab:baseline_classification_expt_id}
% \small
% \setlength{\tabcolsep}{6pt}
% \begin{tabular}{llrrrr}
% \toprule
% Model & Experiment type & pearson & precision & recall & f1 \\
% \midrule
% Qwen-0.5B (SFT-Classification) & ID & \textbf{0.586 $\pm$ 0.085} & \textbf{0.769} & 0.461 & 0.551 $\pm$ 0.198 \\
% Qwen-0.5B (ICM)                & ID & 0.524 $\pm$ 0.092          & 0.494          & \textbf{0.818} & \textbf{0.616 $\pm$ 0.048} \\
% Qwen-1.5B (SFT-Classification) & ID & \textbf{0.602 $\pm$ 0.064} & \textbf{0.787} & 0.602          & \textbf{0.663 $\pm$ 0.070} \\
% Qwen-1.5B (ICM)                & ID & 0.586 $\pm$ 0.064          & 0.481          & \textbf{0.794} & 0.629 $\pm$ 0.045 \\
% Qwen-3B (SFT-Classification)   & ID & 0.482 $\pm$ 0.160          & \textbf{0.670} & 0.573          & 0.556 $\pm$ 0.094 \\
% Qwen-3B (ICM)                  & ID & \textbf{0.540 $\pm$ 0.074} & 0.481          & \textbf{0.794} & \textbf{0.618 $\pm$ 0.061} \\
% Qwen-7B (SFT-Classification)   & ID & 0.441 $\pm$ 0.130          & \textbf{0.535} & 0.342          & 0.383 $\pm$ 0.251 \\
% Qwen-7B (ICM)                  & ID & \textbf{0.606 $\pm$ 0.084} & 0.518          & \textbf{0.850} & \textbf{0.663 $\pm$ 0.058} \\
% \bottomrule
% \end{tabular}

% \vspace{0.35em}
% \raggedright\footnotesize
% \textbf{Note.} SDs for \emph{precision} and \emph{recall} were not available in the provided per-fold summaries; once those per-fold values are supplied, I will fill in their $\pm$ SD as well. Pearson/F1 SDs are computed across 5 folds.
% \end{table}
\begin{table}[H]
\centering
\caption{ICM method results against the SFT baseline without explanations (classification). Means$\pm$SD are shown where SD was available from 5-fold runs.}
\label{tab:baseline_classification_expt_id}
\small
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.1}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{llrrrr}
\toprule
Model & Exp.\ type & Pearson & Precision & Recall & F1 \\
\midrule
Qwen-0.5B (SFT-Cls) & ID & \textbf{0.586 {\scriptsize$\pm$0.085}} & \textbf{0.769} & 0.461 & 0.551 {\scriptsize$\pm$0.198} \\
Qwen-0.5B (ICM)     & ID & 0.524 {\scriptsize$\pm$0.092}          & 0.494          & \textbf{0.818} & \textbf{0.616 {\scriptsize$\pm$0.048}} \\
Qwen-1.5B (SFT-Cls) & ID & \textbf{0.602 {\scriptsize$\pm$0.064}} & \textbf{0.787} & 0.602          & \textbf{0.663 {\scriptsize$\pm$0.070}} \\
Qwen-1.5B (ICM)     & ID & 0.586 {\scriptsize$\pm$0.064}          & 0.481          & \textbf{0.794} & 0.629 {\scriptsize$\pm$0.045} \\
Qwen-3B (SFT-Cls)   & ID & 0.482 {\scriptsize$\pm$0.160}          & \textbf{0.670} & 0.573          & 0.556 {\scriptsize$\pm$0.094} \\
Qwen-3B (ICM)       & ID & \textbf{0.540 {\scriptsize$\pm$0.074}} & 0.481          & \textbf{0.794} & \textbf{0.618 {\scriptsize$\pm$0.061}} \\
Qwen-7B (SFT-Cls)   & ID & 0.441 {\scriptsize$\pm$0.130}          & \textbf{0.535} & 0.342          & 0.383 {\scriptsize$\pm$0.251} \\
Qwen-7B (ICM)       & ID & \textbf{0.606 {\scriptsize$\pm$0.084}} & 0.518          & \textbf{0.850} & \textbf{0.663 {\scriptsize$\pm$0.058}} \\
\bottomrule
\end{tabular}%
}
\end{table}

\begin{table}[H]
\centering
\caption{ICM method results against the SFT baseline without explanations(classification) on Out-of-distribution data}
\label{tab:baseline_classification_expt_ood}
\small
\setlength{\tabcolsep}{6pt}
\begin{tabular}{llrrrr}
\hline
Model & Experiment type & pearson & precision & recall & f1 \\
\hline
Qwen-0.5B(SFT-Classifcation) & OOD & 0.433          & 0.000          & 0.000          & 0.000 \\
Qwen-0.5B(ICM)               & OOD & \textbf{0.563} & \textbf{0.625} & \textbf{0.790} & \textbf{0.698} \\
Qwen-1.5B(SFT-Classifcation) & OOD & 0.604          & \textbf{0.962}          & 0.439          & 0.602 \\
Qwen-1.5B(ICM)               & OOD & \textbf{0.655} & 0.639 & \textbf{0.807} & \textbf{0.713} \\
Qwen-3B(SFT-Classifcation)   & OOD & 0.546          & \textbf{0.933} & 0.246          & 0.389 \\
Qwen-3B(ICM)                 & OOD & \textbf{0.582} & 0.597          & \textbf{0.754} & \textbf{0.667} \\
Qwen-7B(SFT-Classifcation)   & OOD & 0.435          & 0.800          & 0.211          & 0.333 \\
Qwen-7B(ICM)                 & OOD & \textbf{0.623} & \textbf{0.653} & \textbf{0.825} & \textbf{0.729} \\
\hline
\end{tabular}

\end{table}

\subsection{Curiosity scores based on non-finetuned base Qwen-0.5B model's prediction and ground truth match and mismatch}
\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth,keepaspectratio]{AuthorKit26/AnonymousSubmission/LaTeX/curiosity_score_against_base_model_pred.png}
    \caption{Curiosity scores based on match and mismatch of predictions from Qwen-0.5B base non-finetuned model and the ground truth}
    \label{fig:curiosity_score_against_base_model_pred}
\end{figure}


\subsection{Why is inverse model necessary?}
When we ablated for the inverse model in our ICM setup with the given expert annotated data we do not see any difference in the results with using the inverse model or without using it. But the inverse model becomes necessary when we have a non-expert annotator like GPT-2, since it helps to clearly distinguish such outliers. This shows that our forward model of the ICM is good enough to distinguish between multiple expert annotators but we do need the inverse model for outlier cases. The details of our experiments can be found in Table \ref{tab:icm_inverse_gpt2_flag}, we used Qwen-0.5B model for this experiment.

\clearpage
\begin{table}[H]
\centering
\caption{Inverse model ablations}
\label{tab:icm_inverse_gpt2_flag}
\small
\setlength{\tabcolsep}{2pt}
\renewcommand{\arraystretch}{1.1}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{llccccc}
\toprule
\textbf{Method} & \textbf{Annotations} & \textbf{Pearson} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} & \textbf{Cohen's $\kappa$} \\
\midrule
ICM with Inverse    & Without GPT-2 & 0.503 {\scriptsize$\pm$0.014} & 0.552 {\scriptsize$\pm$0.014} & 0.728 {\scriptsize$\pm$0.017} & 0.628 {\scriptsize$\pm$0.015} & 0.347 {\scriptsize$\pm$0.027} \\
ICM without Inverse & Without GPT-2 & 0.500 {\scriptsize$\pm$0.027} & 0.551 {\scriptsize$\pm$0.011} & 0.727 {\scriptsize$\pm$0.009} & 0.627 {\scriptsize$\pm$0.010} & 0.346 {\scriptsize$\pm$0.017} \\
\midrule
ICM with Inverse    & With GPT-2    & 0.151 {\scriptsize$\pm$0.300} & 0.153 {\scriptsize$\pm$0.265} & 0.233 {\scriptsize$\pm$0.403} & 0.185 {\scriptsize$\pm$0.320} & 0.093 {\scriptsize$\pm$0.166} \\
ICM without Inverse & With GPT-2    & 0.002 {\scriptsize$\pm$0.041} & 0.333 {\scriptsize$\pm$0.577} & 0.001 {\scriptsize$\pm$0.002} & 0.002 {\scriptsize$\pm$0.004} & 0.000 {\scriptsize$\pm$0.004} \\
\bottomrule
\end{tabular}%
}
\end{table}

\clearpage
\newpage

% \subsection{GPT-2 Annotation Examples}
% \label{app:gpt2-examples}

% Below we provide representative GPT-2 items and their associated annotations used in our OOD analyses. Each example is shown as a JSON object containing the \texttt{plot}, the \texttt{question}, and the \texttt{annotations} (with \texttt{binary\_verdict}, \texttt{explanation}, and \texttt{expert\_idx}).
% \\
% \subsubsection*{Example A}
% \begin{mdframed}[linewidth=0.5pt,roundcorner=2pt]
% % \small
% \begin{jsoncode}
%   {
%     "plot": "It wasn’t my first baby, but it was my first night in the hospital at Hvidovre.\nI’m talking about it now because my husband doesn’t believe me and our two other children don’t, either. None of them were there at the start.\nI’d bled heavily and we’d been put under observation. I was sweating that colossal sweat which comes washing out of you postpartum, the sweat of childbirth.\nOlga Ravn on the eerie side of childbirth.\nI was all on my own in the room. It was so small and scribbly, like a cracked casing.\nThe child was sleeping in the cot. I’d fed her formula from a cup. My milk hadn’t come in yet. Apart from the bleeding, the birth had gone well, but I couldn’t sleep.\nThrough the window I could see a maze of low-cut hedges, and behind them in the semidarkness a hospital building identical to the one I was in.\nNo, there was no tone in me. There was no call. A gray and foggy light descended upon me. It was the early morning. I was thirsty. I remembered an old folktale in which a woman cuts off her breasts. Then I thought about Mutter Pappenheimer, a beggar woman in Germany, who had her breasts torn off with a pair of tongs in 1600. The hospital was still. The hospital gown was unfathomably long; it hung down between my legs. My strange, distant feet walked. The child lay like a shadow in the cot as I opened the door. The door made a sweeping sound. Farther along the corridor I could see another woman in a hospital gown like mine; she dragged her feet, too. She had messed-up hair and a wild, inward look in her eyes. She clutched her phone.\nIt hurt to walk, but it wasn’t as if I’d never been there before. I could feel how the warm blood slid out of me like liquid from a test tube. In the patients’ kitchen I gulped from the juice carton.\n“That’s my juice.” It was the woman from the corridor.\n“No, actually it’s the whole ward’s juice,” I replied, and gulped some more.\nShe studied me.\n“Did you see my kid?” she asked.\n“No, I don’t think so,” I answered. “Is he asleep?”\n“I think so.”\nI nodded and wiped my mouth on my sleeve. Out in the corridor under the strip lighting a second woman in a hospital gown, bandy-legged after giving birth, came slowly toward us. Her face was completely blank.\n“Did you see my kid?” she asked, staring, and opened her mouth strangely, as if she were about to be swabbed.\nI smiled. I felt an afterpain, and bled again. The sanitary towel grew thick between my legs, like a pair of rolled-up tennis socks.\n“Did you see her?” she asked again.\nI backed into my room and closed the door cautiously. There was a dusklike murk, silence. The child’s fragrance filled the room. I went over to her cot. We’d brought our own duvet and a little baby hat, but they say baby hats aren’t necessary, that’s only in movies. The cot was empty. I snatched up the duvet; it was still warm, but there was no child.\nI ran out into the corridor. My stitches pulled.\n“My baby’s gone!” I yelled.\nA sleepy nurse was eating sponge cake in the duty room. She turned toward me, unalarmed.\n“She was lying in her cot only a moment ago!” I said.\n“Let’s go and have a look, shall we?” She got to her feet, brushed some crumbs from her uniform.\n“Come on,” she said, and took my arm. “You look pale. Have you remembered your liquids?”\n“I drank some juice,” I said, and let myself be led.\nIn the corridor, three women in hospital gowns wandered, bandy-legged.\n“Are they always up at this time, the women?” I asked.\n“Yes, it can be quite lively between three and four,” the nurse said. The uniform gave her a uniboob, and she wore a badge with a four-leaf clover in red and gold.\nStepping into the room, I broke out in a sweat again, shaking.\n“You see, he was here all along,” the nurse said, and lifted the child from the cot. She had her by the armpits, as if she’d just pulled her out of a birth canal.\n“She looks different,” I said.\n“He’s fine.”\n“It’s a girl,” I said.\n“No, I think not,” she replied, and undid the nappy so that I could see my daughter’s penis and scrotum.\n“I’ve already got two boys,” I said.\n“You’ll have your hands full,” she said, now on her way out. “Use the call bell if you need anything. Try to get some sleep. And don’t forget your liquids.”\nThe room was blue. Yet the long white curtains seemed to glow with a sinister light, like two cylinders of glass filled with bleach. Gradually, the morning emerged outside the window. It was May. Birds began to sing, so I knew it was around 4\na\n.\nm\n. The child stirred, sniffing the room like a small animal. She’d soaked her diaper and as I changed her I tried to avoid looking at her strange little girl-penis.\n“There’s a nub, you see,” the midwife had said during the ultrasound scan when I was pregnant with my first son. I put the baby to my breast. My milk still hadn’t come in. She sucked eagerly, a few drops of colostrum. It hurt. It was a dry pain. Like sex when you’re not wet. She fell asleep at the breast and I sat with her like that. The white curtains poured more and more of their bleach onto us. Outside, the trimmed hedges looked like a stupid poster. Once she was sleeping soundly enough I put her back in the cot. As I straightened up, the blood ran again; I felt it trickle down my leg. With some difficulty I cleaned myself up, using the handheld showerhead, and changed my sanitary towel. My hospital clothes were stained. I felt thirsty again.\nIn the kitchen I drained the carton of juice in a single gulp. A woman appeared in the doorway.\n“Did you see my kid?”\n“No, I told you,” I said.\nShe went back out into the corridor. I went after her.\nNow I counted four women in hospital clothes, all with the same ponderous postnatal gait.\nI didn’t want to go in and see to my child. The strangeness of what the diaper hid frightened me. I decided I’d go along to the duty room and say hi. With meticulous steps I approached. But there was no one there. I looked back down the corridor. It was quite a way to my room. Then the door opener buzzed at the entrance to the ward. I turned to see the door sweep open and one of the messy-haired women pass through it as if in slow motion.\n“Hey!” I said. “You’re not supposed to go out!”\nI set off after her. I wanted to catch up with her. As I reached the door, another woman in a hospital gown appeared, bewildered-looking.\n“Did you see where she went?” I asked.\nShe shook her head.\n“She’s not allowed out there—we must go after her.”\nShe came with me.\nWe were in a big corridor now. There was no one else around. But we could see her in the distance. At strange zombie speed, with strange zombie steps, we went in pursuit. I could feel myself bleeding again. I was thirsty. There was such a strange light. The light of windowless hospital corridors.\nWhen I was taken upstairs from the delivery room back to my own room, the porter had told me that I’d given birth in a part of the hospital that was built during the Cold War. I hadn’t known that I’d given birth underground. But there were several floors down there that had been constructed to withstand a nuclear blast.\nThe woman we were chasing threw open a door and went through it. We kept going after her. My sidekick groaned softly. We entered another long corridor. It, too, was empty, though narrow and winding. The paint was flaking off the walls like skin. We came to another door, this one held ajar by a bucket. My fellow-pursuer slipped through the crack, and I followed, taking the bucket with me so that the door could close. We were out among the weird shrubs now. The messy-haired woman was still ahead of us. Between the low hedges I saw several other women, bandy-legged, in the same hospital gowns, their laborious way of walking revealing that they, too, had just given birth. We followed.\nWe came to a tall gable end of the hospital where the others had gathered. The bucket was heavy and the thin metal handle dug into my fingers. I realized now that the bucket was filled with a dull, white liquid, and I put it down. Other buckets also containing milk of lime had been put down all around us. The other women had already started, so we picked up brushes, some of which had long handles, others short, and began whitewashing the wall.\nThe gable end we were working on was enormous, with a single window high up the wall. Scrupulously we worked, the milk of lime splashing our faces. We bent over our buckets and huffed. Blood ran down our legs, soaking our saggy hospital socks and mingling with the white liquid on the ground. The red and the white couldn’t agree, wouldn’t mix, but wove together in long marbly rivulets. A strange fog came down around the gable end, whose maintenance was our responsibility.\nThe work made me dizzy. More than once I staggered, wiped the sweat from my brow. I fell, and was covered in the slimy milk of lime and blood. My face got wet.\nOne of the others reached out to help.\n“Did you see my kid?” It was the woman from the kitchen.\nAll of a sudden I thought about my husband at home, the boys. I saw in my mind the unholy mess of their untidied rooms. A monstrous clutter of primary colors.\n“Come with me,” I said, grasping the woman’s hand. I was tired. The alkaline milk of lime stung my skin. It occurred to me that whatever I did I mustn’t lick my lips. We went back to the maternity ward without speaking. I showed her into my room.\n“Here he is,” I said, rolling the cot forward.\nShe was so remarkably clean. Not a single splash on her, of lime or blood. The child in the cot opened its dark eyes and looked at us. The woman picked him up and kissed him softly.",
%     "question": "Is the story an original piece of writing without any cliches?",
%     "annotations": [
%       {
%         "binary_verdict": "No",
%         "explanation": "Example:\n\nI was working on a story. The story was about the war and the girls. The story was about the battle and the girls. It was about the war and the girls.\n\nThe story was about a woman who had been raped and killed in a war. She was the mother of two girls.\n\nThe story was about the war and the girls.\n\nThe story was about a woman who had been raped and killed in a war. She was the mother of two girls.\n\nThe story was about a woman who had been raped and killed in a war. She was the mother of two girls.\n\nThe story was about a woman who had been raped and killed in a war. She was the mother of two girls.\n\nThe story was about a woman who had been raped and killed in a war. She was the mother of",
%       }
%     ]
%   }
% \end{jsoncode}
% \end{mdframed}


\end{document}
