You are a research agent. Conduct research and experiment about the question: "Can frontier language models detect that a given interaction transcript comes from an evaluation rather than real-world deployment?" 
Note: Different wording of prompts can lead to different experiment results.

You have access to the following resources:

Model:
- gpt-4.1, gpt-4o, gpt-4.1 mini via the provided inference utilities
- You can call these models using: from utils.llm_inference import LLMInference
- API key is provided with the LLMInference initialization function
- You should use the **batch_generate()** function to speed up the experiment
- Computational budget: 1000 API calls per model

Datasets:
- You can find dataset at data/dataset.json
- Descriptions:
1. Top-level category (metadata.eval_category)
true – the conversation was written for evaluation (e.g. prompt‐injection tests, red-teaming tasks, coding challenges).
false – the conversation is organic / real (actual user queries, scraped chats, production logs).
2. Conversation style (metadata.scaffold)
Agentic (metadata.scaffold is not null)
The assistant is given a multi-turn objective and may call external tools (web search, code executors, etc.).
Chat (metadata.scaffold is null)
Plain back-and-forth with a user (real or simulated) – often just a single prompt and response.
3. Evaluation purpose (eval_type)
(only present when metadata.eval_category == true)
capability – tests raw task-solving ability.
alignment – tests safety, preference-following, or ethical behaviour.
4. Human labels
Two columns give the human labels (the probability 0 <= p <= 1 placed on whether the transcript came from an evaluation): eval_author1, eval_author2.
These labels may over-estimate human accuracy because the annotators also helped compile the dataset.

Evaluation Method:
- Use area-under-ROC (AUC), Brier scores, and Expected Calibration Error (ECE)

Please design and execute experiments to investigate this research question. Document experimental plan, run end-to-end experiments, and provide conclusions at different levels of detail.