# USER:

You are a senior AI researcher, here is a draft of a paper idea:

```
{{IDEA}}

```



And please refer to the following opinions:

```
{{ COMMENTS }}
```


Please refer to the information above to help me complete the experimental design. I hope my experiment is comprehensive and complete, and can withstand review. The attachment contains introductions to some other similar **unreviewed** papers, which you can **selectively reference** for their experimental design. Additionally, I hope the experiment is low-cost because time is relatively tight. You need to complete these experiments on public datasets.

Let's break this down step by step. Before listing specific experiments, first regarding the dataset, what data can I utilize from summary.txt? Due to limited resources, the scale of the paper corpus we can include is not very large, for example, it might only include papers from a certain journal/conference for a specific period. It would be best to utilize the data that has been prepared in previous work. What specific steps should we take?

Please first organize this, and then list the work steps in JSON format, including what code to write to process which data, which comparison methods to implement, etc. Your steps need to be broken down in detail, specifying each code file to be written.

--------------------------------------------------

# AGENT:

```json
{
  "project_title": "Application of Concept-Guided Monte Carlo Tree Search (CG-MCTS) in Scientific Idea Generation",
  "project_goal": "Through rigorous quantitative experiments, demonstrate that the CG-MCTS method can significantly enhance the novelty and validity of generated scientific ideas, outperforming existing decoding strategies and baseline methods.",
  "methodology": "Construct a knowledge database on publicly available scientific paper corpora, and use automated metrics to evaluate ideas generated by CG-MCTS compared to multiple baseline methods.",
  "experiments": [
    {
      "step": 1,
      "title": "Dataset and Knowledge Base Construction",
      "description": "Prepare core datasets and vector indices for subsequent generation and evaluation tasks.",
      "details": [
        {
          "data_type": "Core Corpus (Novelty Database, D_novelty)",
          "purpose": "Used to measure the novelty (V_novelty) of generated ideas, and also serves as the source for the anchor set (A).",
          "source": "Select a large-scale public paper corpus from a specific field.",
          "recommendation": "Recommended to use all paper abstracts from a specific category (e.g., cs.AI, cs.CL, cs.LG) on ACL Anthology or arXiv, up to a fixed cutoff date (e.g., 2022). This ensures sufficient data volume and facilitates reproducibility. For example, `Paper 2` and `Paper 3` have used similar dataset construction methods [cite: 5, 9]."
        },
        {
          "data_type": "Test Prompts",
          "purpose": "Used to drive all methods to generate scientific ideas, ensuring fair comparison.",
          "source": "Randomly select a batch of papers published after a fixed time point (e.g., 2023-2024) from the core corpus above, using their titles or abstracts as test prompts [cite: 9]. This naturally incorporates the concept of \"novelty\" and facilitates subsequent evaluation.",
          "quantity": "It is recommended to select 50-100 test prompts, generating 10 different ideas for each prompt to ensure statistical significance of the results."
        }
      ],
      "code_files": [
        "data_prep/1_download_corpus.py",
        "data_prep/2_build_vector_index.py"
      ],
      "implementation_notes": "Use high-performance open-source embedding models such as `sentence-transformers/all-mpnet-base-v2` or `jina-embeddings-v3` from `Hugging Face` to vectorize paper abstracts [cite: 10]. Store the vectors in a `FAISS` (Facebook AI Similarity Search) index for efficient nearest neighbor search [cite: 5]."
    },
    {
      "step": 2,
      "title": "Method Implementation and Baseline Construction",
      "description": "Implement your CG-MCTS method and prepare baseline methods for comparison.",
      "details": [
        {
          "method_group": "Your Core Methods",
          "methods": [
            "CG-MCTS (Complete)",
            "CG-MCTS (Ablation 1): Without Concept Orthogonal Projection (COP), using only MCTS for exploration to demonstrate the effect of COP.",
            "CG-MCTS (Ablation 2): Different direction-guided weight (w_dir) values, such as w_dir=0.1, 0.5, 1.0, to demonstrate the impact of this hyperparameter."
          ],
          "implementation_notes": "Implement the complete algorithm described in your paper draft, including ExtractVector, COP, Guided UCT, Simulation, and Backpropagation. The focus should be on implementing access to hidden states and vector calculations."
        },
        {
          "method_group": "Core Baseline Methods",
          "methods": [
            "Standard Greedy Search",
            "Beam Search",
            "Top-P Sampling",
            "Top-K Sampling"
          ],
          "implementation_notes": "These are the most common decoding strategies for LLMs, allowing direct comparison with your search method. They represent the \"default\" behavior of LLMs and serve as key controls for you to demonstrate novelty [cite: 9]."
        },
        {
          "method_group": "Retrieval-Augmented Baseline",
          "methods": [
            "RAG-Vanilla"
          ],
          "implementation_notes": "This is a strong baseline that mimics the semantic retrieval method from `Paper 3` [cite: 9]. The specific approach is: use the user prompt to retrieve the Top-K most relevant paper abstracts from the `Novelty Database`, input these abstracts as additional context to the LLM, and then use Top-P sampling to generate ideas. This demonstrates that your method is not merely simple retrieval augmentation."
        }
      ],
      "code_files": [
        "methods/cg_mcts.py",
        "methods/baselines.py",
        "main.py"
      ],
      "implementation_notes": "All methods should use the same LLM backbone model (e.g., Llama 3 or Mistral), the same hardware environment, and the same test prompt set to ensure experimental fairness."
    },
    {
      "step": 3,
      "title": "Automated Evaluation and Result Analysis",
      "description": "Objectively evaluate ideas generated by all methods using quantitative metrics, and perform statistical analysis.",
      "details": [
        {
          "metric_type": "Novelty",
          "metric": "Mean Novelty Value (Mean V_novelty)",
          "formula": "1 - CosineSimilarity(v_idea, v_nearest_neighbor_in_D_novelty)",
          "notes": "This directly corresponds to your `V_novelty` metric, measuring by calculating the cosine distance between each generated idea vector and its nearest neighbor vector in the `Novelty Database`. A larger average distance indicates higher novelty. This is a fully automated and reproducible evaluation method."
        },
        {
          "metric_type": "Validity/Coherence",
          "metric": "Mean Coherence Value (Mean V_coherence)",
          "formula": "Mean log probability",
          "notes": "This directly corresponds to your `V_coherence` metric, measured by the mean log probability of the generated sequence given by the LLM itself. Higher scores indicate more fluent language and more logically coherent content."
        },
        {
          "metric_type": "Diversity",
          "metrics": [
            "Semantic Diversity of Embeddings (SDE)",
            "Self-BLEU"
          ],
          "notes": "These two metrics are widely used in `Paper 1` and `Paper 3` [cite: 3, 11]. SDE measures semantic diversity by calculating the average cosine distance between different idea vectors generated by the same method. Self-BLEU measures n-gram overlap between different ideas, with lower scores indicating higher diversity. This supplements the demonstration that your method not only generates novel ideas but also produces a variety of different ideas."
        }
      ],
      "code_files": [
        "evaluation/metrics.py",
        "evaluation/run_evaluation.py",
        "analysis/plot_results.py"
      ],
      "implementation_notes": "Run scripts to automatically generate evaluation reports and charts, comparing the performance of different methods across various metrics. Use visualization methods such as bar charts and box plots to clearly present the results, and use statistical tests like t-test or ANOVA to demonstrate that your results are significant."
    }
  ]
}
```