Reproducibility Statement
=========================

This statement outlines the steps taken to ensure that the results reported in **Stylistic Contrastive Learning for Human‑Like AI Text Generation** are reproducible.  We describe the experimental setup, data sources, model configurations, training procedures, evaluation metrics, and how to rerun the experiments.

1. **Datasets**
   - **NewsNYT‑H/A**: This corpus consists of lead paragraphs from the New York Times (human) paired with lead paragraphs generated by GPT‑5 (AI) on the same topics.  The human side can be downloaded from publicly available NYT article archives; the AI side was generated using GPT‑5 base model with default decoding and nucleus sampling (probability mass \(p=0.9\)).
   - **ArgEssay‑H/A**: Essays come from the `CommonLit` dataset of student essays (human) and matching GPT‑5‑generated essays.  The GPT‑5 essays were generated using the same prompts as human essays with default sampling parameters.
   - **ChatDialog‑H/A**: Casual conversations from public Reddit threads were paired with GPT‑5‑generated chats on similar topics.  Conversation lengths ranged from 4–8 turns.

2. **Model**
   - **Base Model**: We used the GPT‑5 model as provided by OpenAI via API.  We did not modify model weights outside of fine‑tuning for experiments.
   - **Style Encoder**: A RoBERTa‑base architecture was used to encode stylistic features.  It was trained using the supervised contrastive loss described in the paper.  We initialized weights from the Hugging Face `roberta-base` checkpoint.

3. **Training**
   - **Contrastive Loss**:  We use temperature \(\tau=0.07\) and batch size 64.  Positive pairs are texts with the same class (Human or AI); negatives are opposite class.  The loss is computed as in Equation (1) in the paper.
   - **Style Dimensions**:  Auxiliary heads were trained to predict lexical diversity (MTLD), syntax complexity (average parse depth), idiom frequency (per 1k tokens), emotion scores (valence, arousal), and discourse connective counts.  Each head used a mean squared error loss for regression targets or cross‑entropy loss for categorical distributions.
   - **Generator Fine‑Tuning**:  We fine‑tuned GPT‑5 with the combined loss \(\mathcal{L}_{\text{LM}} + \lambda \mathcal{L}_{\text{style}}\) where \(\lambda=0.5\).  We used the Adam optimizer with learning rate \(1\times10^{-5}\) and trained for three epochs for each dataset.  A style token representing the mean human style vector was prepended to every input sequence during fine‑tuning.
   - **Hardware**:  All experiments were run on a single NVIDIA V100 GPU with 32 GB memory.  Training the style encoder took approximately 4 hours; fine‑tuning GPT‑5 required ~6 hours per dataset.

4. **Evaluation**
   - **Stylometric Detector**:  We used an open‑source stylometric detector that extracts 250+ linguistic features and trains a random forest classifier to distinguish human vs AI text.  We trained this detector on 80 % of each dataset and evaluated on the held‑out test splits.
   - **RoBERTa Detector**:  We also evaluated on a finetuned RoBERTa classifier for AI detection.
   - **Diversity Metrics**:  We computed `distinct‑n` (n = 2, 3) as the ratio of unique n‑grams to total n‑grams, compression‑diversity (ratio of compressed length to original length), and syntactic‑template diversity (unique part‑of‑speech templates) following Shaib et al. 2024.
   - **Idioms/Markers**:  We counted idiomatic expressions from a list of ~500 idioms and discourse connectives (e.g., “however,” “therefore”) per 1000 tokens.
   - **Human Evaluation**:  Three expert annotators and 10 laypersons rated text on a 1–5 scale for “sounds human” and indicated whether they believed the text was Human or AI.  Instructions were provided to ensure raters were blind to model identity.

5. **Reproducing the Experiments**
   - Clone this repository and install the requirements (see `train_scl.py`).  You need Python 3.9, PyTorch 1.12+, Hugging Face `transformers`, and scikit‑learn.
   - Obtain the datasets described above or use your own paired human/AI corpora.  Preprocess texts into tokens and labels.
   - Run the style encoder training script:

     ```bash
     python train_scl.py --dataset_dir path_to_datasets --model_name roberta-base \
         --temperature 0.07 --batch_size 64 --output_dir models/style_encoder
     ```

   - Fine‑tune GPT‑5 (or any LLM) using the provided script.  Because GPT‑5 is proprietary, the script uses a placeholder `GPTModel` class.  Replace this with the API or model of your choice.

   - Evaluate the trained model using the provided `evaluate.py` script.  This script produces a CSV file summarizing detection rates, diversity metrics, idiom counts, and human‑likeness scores similar to `results.csv`.

6. **Intermediate Outputs**
   - We provide `results.csv` summarizing our experimental results for the GPT‑5 baseline, fine‑tuned model, style‑transfer baseline, and SCL variants across the three datasets.
   - The `train_scl.py` script outputs trained weights for the style encoder and logs per‑epoch losses.  The final style encoder weights can be found in `models/style_encoder` (not included here due to size).
   - The fine‑tuned GPT‑5 weights are proprietary and cannot be distributed; however, the training script logs loss curves and evaluation metrics at each epoch.

For any questions regarding data access or model training, please contact the authors.