{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4b560aba58dfb0d4",
   "metadata": {},
   "source": [
    "# Tutorial 1: Introduction to Automated Research Generation with CycleResearcher\n",
    "\n",
    "\n",
    "Welcome to the cutting-edge world of AI-driven scientific research! CycleResearcher represents a groundbreaking approach to automating the entire research lifecycle, from literature review to paper generation and peer review simulation.\n",
    "\n",
    "## Key Objectives\n",
    "- Understand the CycleResearcher framework\n",
    "- Learn how to generate research papers across different domains\n",
    "- Explore iterative research generation techniques\n",
    "- Gain insights into AI-assisted scientific discovery\n",
    "\n",
    "## Theoretical Background\n",
    "\n",
    "CycleResearcher is built on several key principles:\n",
    "1. **Iterative Research Generation**: Using a novel SimPO (Simple Preference Optimization) framework\n",
    "2. **Comprehensive Literature Integration**: Leveraging extensive reference materials\n",
    "3. **Automated Peer Review Simulation**: Employing CycleReviewer for quality assessment\n",
    "\n",
    "## Research Lifecycle Simulation\n",
    "The framework mimics the traditional academic research process:\n",
    "- Literature Review\n",
    "- Hypothesis Formulation\n",
    "- Experimental Design\n",
    "- Manuscript Preparation\n",
    "- Peer Review Simulation\n",
    "- Iterative Refinement\n",
    "\n",
    "## Getting Started"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "887b935308fdecf3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 02-25 20:16:59 __init__.py:183] Automatically detected platform cuda.\n",
      "INFO 02-25 20:18:16 config.py:135] Replacing legacy 'type' key with 'rope_type'\n",
      "INFO 02-25 20:18:24 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.\n",
      "WARNING 02-25 20:18:24 arg_utils.py:1119] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.\n",
      "INFO 02-25 20:18:24 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.\n",
      "INFO 02-25 20:18:24 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/zhuminjun/llama70/CycleResearcher_14B', speculative_config=None, tokenizer='/zhuminjun/llama70/CycleResearcher_14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=60000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/zhuminjun/llama70/CycleResearcher_14B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={\"splitting_ops\":[],\"compile_sizes\":[],\"cudagraph_capture_sizes\":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],\"max_capture_size\":256}, use_cached_outputs=False, \n",
      "INFO 02-25 20:18:25 cuda.py:235] Using Flash Attention backend.\n",
      "INFO 02-25 20:18:26 model_runner.py:1111] Starting to load model /zhuminjun/llama70/CycleResearcher_14B...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bcd4a32fa95f46c1b707f4ae47f119c5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 02-25 20:18:57 model_runner.py:1116] Loading model weights took 27.4462 GB\n",
      "INFO 02-25 20:18:58 worker.py:266] Memory profiling takes 0.84 seconds\n",
      "INFO 02-25 20:18:58 worker.py:266] the current vLLM instance can use total_gpu_memory (79.14GiB) x gpu_memory_utilization (0.95) = 75.18GiB\n",
      "INFO 02-25 20:18:58 worker.py:266] model weights take 27.45GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.94GiB; the rest of the memory reserved for KV Cache is 46.71GiB.\n",
      "INFO 02-25 20:18:58 executor_base.py:108] # CUDA blocks: 15304, # CPU blocks: 1310\n",
      "INFO 02-25 20:18:58 executor_base.py:113] Maximum concurrency for 60000 tokens per request: 4.08x\n",
      "INFO 02-25 20:19:00 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:18<00:00,  1.88it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 02-25 20:19:19 model_runner.py:1563] Graph capturing finished in 19 secs, took 0.48 GiB\n",
      "INFO 02-25 20:19:19 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 22.06 seconds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# Import necessary libraries\n",
    "from ai_researcher import CycleResearcher\n",
    "from ai_researcher.utils import print_paper_summary\n",
    "\n",
    "# Initialize CycleResearcher with default 12B model\n",
    "researcher = CycleResearcher(model_size=\"12B\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5191ba95ec493b0",
   "metadata": {},
   "source": [
    "## Advanced Generation with References\n",
    "\n",
    "Enhance paper generation by providing specific references to guide the research direction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "dc9486112acb1242",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['',\n",
      " 'incollection{Bengio+chapter2007,\\n'\n",
      " 'author = {Bengio, Yoshua and LeCun, Yann},\\n'\n",
      " 'booktitle = {Large Scale Kernel Machines},\\n'\n",
      " 'publisher = {MIT Press},\\n'\n",
      " 'title = {Scaling Learning Algorithms Towards {AI}},\\n'\n",
      " 'year = {2007}\\n'\n",
      " '}\\n',\n",
      " 'article{Hinton06,\\n'\n",
      " 'author = {Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye},\\n'\n",
      " 'journal = {Neural Computation},\\n'\n",
      " 'pages = {1527--1554},\\n'\n",
      " 'title = {A Fast Learning Algorithm For Deep Belief Nets},\\n'\n",
      " 'volume = {18},\\n'\n",
      " 'year = {2006}\\n'\n",
      " '}\\n']\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 10/10 [10:59<00:00, 65.99s/it, est. speed input: 186.01 toks/s, output: 175.74 toks/s]\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "with open('cycleresearcher_references.bib', 'r') as f:\n",
    "    references_content = f.read()\n",
    "pprint(references_content.split('@')[:3])\n",
    "\n",
    "# Generate paper with specific references\n",
    "referenced_paper = researcher.generate_paper(\n",
    "    topic=\"AI Researcher\",\n",
    "    references=references_content,n=10\n",
    ")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c5cc5e09-f79d-4a97-b550-3a5c3e7289f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 Research Paper Summary:\n",
      "Title: Scientific Peer Reviewer Agents}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "Scientific peer review is fundamental to the scientific method: it aims to evaluate the quality of research, ensuring that published work is  original, important, and methodologically sound. Although large language models (LLMs) have excelled in understanding and generating scientific content, their application to simulating and optimizing the dynamics of scientific peer review remains under-explored. In this paper, we introduce the Scientific Peer Reviewer Agent (SPRA), an LLM-based system configured using a JSON plan to simulate peer reviews.\n",
      "We further develop the Human-LLM Reviewer Agent (HLL-RA) to iteratively optimize the SPRAs by selecting reviews that are then used to fine-tune the LLMs, better approximating human judgment of peer reviews. The impact of these contributions is validated through extensive experiments, including the creation of a new synthetic dataset (Syn-RS) of high-quality reviews rated by humans, and the achievement of state-of-the-art performance on the RewardBench benchmark. Our approach advances the capabilities of LLMs in simulating and enhancing the scientific discovery process through a novel application of agent technology. \n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review processes are often criticized for biases, errors, and the burden placed on researchers. While LLMs have shown promise in understanding and generating scientific content, their application in simulating and optimizing peer review dynamics remains under-explored. The research aims to leverage LLMs to simulate the iterative peer review process, optimize the selection of reviews, and fine-tune LLMs to better approximate human judgment. By doing so, the paper seeks to enhance the scientific discovery process, making it more fair, accurate, and efficient. The significance of this research lies in its potential to transform the peer review system, which is fundamental to the scientific method, and to advance the capabilities of LLMs in sophisticated reasoning and decision-making tasks.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces the Scientific Peer Reviewer Agent (SPRA), an LLM-based system designed to simulate and optimize the peer review process. SPRA is configured using a JSON plan to simulate peer reviews, and the Human-LLM Reviewer Agent (HLL-RA) is developed to iteratively optimize these reviews. The paper also proposes a method for fine-tuning LLMs to better align with human judgment of scientific papers, using synthetic data generated from the optimized reviews. The impact of these contributions is validated through extensive experiments, including the creation of a new synthetic dataset (Syn-RS) and the achievement of state-of-the-art performance on the RewardBench benchmark.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Multi-Agent Review: Simulating Human Reviewers for Scientific Peer Review with Large Language Models}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "The integration of artificial intelligence, particularly large language models (LLMs), into scientific research has become increasingly prominent. LLMs have been employed at various stages of research, from hypothesis generation to literature review and idea evaluation. One critical aspect of research that is pivotal for ensuring quality and facilitating improvement is peer review. This paper introduces a novel approach, Multi-Agent Review (MARP), which utilizes LLMs to simulate the traditional peer review process. MARP comprises multiple LLMs that independently review a paper, provide feedback to one another, and iteratively refine their reviews. This process diverges into multiple perspectives and then converges into a consensus conclusion. Additionally, we enhance the accuracy of the final review by employing an iterative reasoning preference optimization algorithm that fine-tunes a reward model based on the quality of the review. Our qualitative and quantitative experiments demonstrate the superiority of MARP over reviews generated by a single LLM, underscoring the potential of MARP to assist and improve the peer review process in the era of AI researchers.\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the integration of large language models (LLMs). Traditional peer review processes are often hindered by human biases, statistical errors, and the inherent difficulty in achieving consensus among reviewers. The integration of AI, particularly LLMs, into the peer review process has the potential to enhance objectivity and efficiency. However, existing approaches that directly use LLMs to generate reviews or conclusions lack a robust mechanism for aggregating these insights into a reliable and accurate assessment. The paper identifies the need for a more structured and iterative approach to simulate the human peer review process, refine reviews through feedback, and ultimately converge on a consensus conclusion. This leads to the development of a method that leverages LLMs to perform multi-agent review, incorporating iterative reasoning and preference optimization to enhance the accuracy and reliability of peer review in scientific research.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces MARP (Multi-Agent Review for Paper Review), a novel approach that utilizes LLMs to simulate the traditional peer review process. MARP consists of multiple LLM agents that independently review a paper, provide feedback to one another, and iteratively refine their reviews. The process diverges into multiple perspectives and then converges into a consensus conclusion. To further enhance the accuracy of the final review, MARP employs an iterative reasoning preference optimization (IRPO) algorithm, which fine-tunes a reward model based on the quality of the reviews. The method is validated through both qualitative and quantitative experiments, demonstrating its superiority over individual LLM reviews and its potential to assist in the peer review process.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Referees in AI for Scientific Peer Review}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "Large language models (LLMs) have revolutionized how we study, practice, and communicate about science. \n",
      "In this work, we explore the potential of LLS in automating and optimizing scientific peer review, the social process via which research quality is ensured. \n",
      "We propose \\textbf{RALS} (\\textbf{R}eferees in \\textbf{A}I for \\textbf{S}cientific Peer \\textbf{S}eer Review), which automates scientific peer review using LLM agents and iterative self-refinement. RALS's experimental results on 169 reviewed AI/ML papers suggest that RALS can generate useful peer reviews and has the potential to be used as a reward model for scientific research. Based on the empirical results, we further discuss the technical insights revealed by RALS and the broader impact of RALS on scientific quality control, discovery, and research paradigm shifts. We hope that RALS starts a crucial conversation about LLMs' ability to simulate and automate the scientific peer review process. \n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the potential of large language models (LLMs) to automate scientific peer review, a critical process in academic research. Traditional peer review is essential for ensuring the quality and integrity of published research, but it faces several challenges, including inefficiencies, biases, and the potential for conflicts of interest. The advent of LLMs, with their ability to process and analyze large volumes of text, offers a promising solution to these issues. However, existing studies on LLM-based peer review have been limited in scope, often focusing on narrow aspects such as extracting feedback from individual sections or providing simple critiques. These approaches fail to capture the full complexity and depth of the peer review process, which involves a nuanced understanding of the entire paper, a critical evaluation of its contributions, and the generation of constructive feedback. The paper aims to develop a more comprehensive and automated system for simulating and optimizing the scientific peer review process using LLMs, thereby enhancing the quality and efficiency of academic research.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces RALS (Referees in AI for Scientific Peer Review), an LLM-based system designed to automate and optimize the scientific peer review process. RALS consists of four stages: reading the paper, analyzing the paper, launching criticisms, and self-refining the review. The system uses iterative self-refinement, where the same LLM agent plays both the reviewer and the critic, providing and incorporating feedback to improve the review quality. RALS is evaluated on a dataset of 169 reviewed AI/ML papers, demonstrating its ability to generate high-quality reviews that are preferred over original human reviews by both domain experts and the general public. The paper also explores the potential of RALS for reward modeling and review-to-decision modeling, suggesting that LLMs can simulate and scale the scientific peer review process, leading to broader impacts on scientific discovery and research paradigm shifts.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Evaluating LLM-based AI Reviewer Agent for Scientific Peer Review}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "The emergence of AI reviewer and review generation systems offers the potential to transform the scientific peer review process, but also raises questions about acceptance criteria, evaluation metrics, and implementation mechanisms.\n",
      "This study introduces the AI Reviewer Agent (AIRA) which simulates human-AI collaboration in the review process. \n",
      "We further evaluate AIRA's performance and compare it with human reviewers across various academic communities. \n",
      "To facilitate the evaluation, we introduce a novel benchmark, AI Review Assessment (AIRA), which includes real-world reviews of 94 AI papers, provided by both human and AI for comparison. \n",
      "Our analysis demonstrates that AIRA's reviews are perceived to be on par with those of human reviewers in the field of AI, particularly when aligned with advanced models and collaborative prompting strategies. However, outside the AI domain, our findings indicate that human reviewers are still preferred by the academic community. \n",
      "We also contribute a comprehensive dataset of 640+ reviews generated by both human and AI for further analysis. \n",
      "Our work has broad implications for the role of AI in peer review and scientific progress. \n",
      "\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review systems face numerous issues, including insufficient feedback quality, reviewer fatigue, and potential biases. The integration of LLMs into the review process offers a promising solution by automating and enhancing various aspects of the review workflow. However, the effectiveness of LLMs in this domain has not been comprehensively evaluated. The paper aims to fill this gap by introducing the AI Reviewer Agent (AIRA), which simulates human-AI collaboration to generate detailed and helpful reviews. The research also seeks to understand the limitations of LLMs in different academic communities and the impact of model size and alignment techniques on review quality.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces AIRA, an LLM-based system designed to assist in the peer review process by generating detailed reviews. AIRA simulates human-AI collaboration, performing tasks such as initial reading, note-taking, feedback summarization, and review generation. The system is evaluated using a novel benchmark, AI Review Assessment (AIRA), which includes detailed reviews of 94 AI papers by both human and AI reviewers. The evaluation focuses on four dimensions: review quality, thoughtfulness, constructiveness, and overall effectiveness. The paper also explores the impact of different LLMs, alignment techniques, and prompt strategies on review quality, providing insights for future enhancements in AI-assisted peer review.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: AI-Powered Peer Review Can Help Scientific Progress}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "In this paper, we explore the potential of AI-powered language models in simulating and optimizing the scientific peer review process. As the research community increasingly relies on AI for various stages of the research cycle, including idea generation, experimentation, and analysis, there is a corresponding need for peer review systems to evolve and effectively assess AI-generated work. \n",
      "We introduce \\textit{AI reviewer} systems, designed to simulate human peer review dynamics using order-\\(n\\) reasoning and action execution based on provided review criteria. We further develop advanced versions, \\(\\textit{AI reviewer}^\\star\\) and \\textit{AI reviewer}^{++}, which utilize reward and feedback signals to optimize the review quality.\n",
      "Our evaluation shows that \\(\\textit{AI reviewer}^\\star\\) can generate detailed, consistent reviews with high potential for real-world application, outperforming existing LLM-based review systems.\n",
      "Interestingly, \\textit{AI reviewer}^{++} demonstrates an accelerated feedback loop between reviewers and authors, suggesting a shift towards a more collaborative and iterative review process. \n",
      "These findings contribute to the broader goal of creating a unified knowledge market where machines can assume traditional human roles, thereby enhancing the scientific research process.\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review processes are often criticized for inconsistencies, biases, and a lack of detailed feedback, which can hinder the progress of scientific research. While LLMs have shown promise in understanding and generating scientific content, their potential in simulating and optimizing peer review systems remains under-explored. The research aims to leverage LLMs to simulate human peer reviews and use these simulations to train more advanced review systems, ultimately leading to a self-optimizing review framework. This approach not only seeks to enhance the quality and consistency of reviews but also to accelerate the feedback loop between reviewers and authors, fostering a more collaborative and iterative review process. The significance of this research lies in its potential to address the limitations of human-only review systems and to contribute to the broader goal of creating a unified knowledge market where machines can assume traditional human roles, thereby enhancing the scientific research process.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces AI-powered peer review systems, specifically Reviewer (AIReviewer), Reviewer+ (AIReviewerStar), and Reviewer++ (AIReviewerPlus), to simulate and optimize the peer review process. AIReviewer generates reviews based on the state of the paper and predefined review criteria. AIReviewerStar extends this by incorporating reward signals based on the similarity of reviews to human expert reviews. AIReviewerPlus further enhances the system by using author feedback on reviews to optimize the review quality iteratively. The systems are evaluated on their ability to generate consistent and high-quality reviews, their generalization to unseen papers, and their potential for real-world application. The findings indicate that AIReviewerStar outperforms existing LLM-based review systems, and AIReviewerPlus shows promise in accelerating the feedback loop between reviewers and authors.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Optimizing AI Agents for Simulated Peer Review}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "Naturally, scientific progress relies heavily on peer review.\n",
      "Large language models (LLMs) have emerged as powerful tools for understanding, generating, and potentially enhancing scientific knowledge. Their potential to contribute to scientific progress extends to  the critical process of peer review, which ensures the quality and integrity of scientific research. However, the effectiveness of LLMs in simulating and optimizing  peer review dynamics remains under-explored.\n",
      "In this paper, we introduce the Simulated Peer Review (SimPEER) dataset, which consists of 2,000 simulated peer review interactions based on ML papers from NeurIPS 2021-2023 and ICLR 2024.  The simulations feature single-agent and multi-agent dynamics that closely resemble real-world reviews. We then automatically train an AI researcher agent in two stages. First, we apply supervised fine-tuning (SFT) on high-quality simulations from SimPEER. Then, we perform preference optimization (PO) on human preferences using state-of-the-art techniques.\n",
      "Our resulting agent, named ARA-PO, outperforms commercial LLMs such as GPT-4o and Qwen2.5 in generating high-quality reviews, as assessed by human annotators using a detailed scoring metric. Moreover,  ARA-PO demonstrates its utility in identifying accept/merge decisions for ML papers, achieving competitive performance with expert-based scores.\n",
      "Overall, our work advances the understanding and optimization of  peer review processes, ultimately contributing to the betterment of scientific research.\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review processes are often criticized for delays, inconsistent feedback, and biases, which can hinder the progress of scientific research. While LLMs have shown promise in understanding and generating scientific content, their potential in simulating and optimizing peer review dynamics remains under-explored. The research aims to leverage LLMs to create realistic and interactive peer review simulations, which can be used to train automated agents for reviewing scientific papers. By optimizing these agents using human preferences and state-of-the-art techniques, the study seeks to enhance the quality of reviews and reduce the workload on human reviewers. The significance of this research lies in its potential to transform the peer review process, making it more efficient and reliable, and ultimately accelerating scientific discovery.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces the Simulated Peer Review Dataset (SimPEER) and the Automated Reviewing Agent (ARA). SimPEER consists of 2,000 simulated peer review interactions based on ML papers from NeurIPS 2021-2023 and ICLR 2024, featuring both single-agent and multi-agent dynamics. ARA, instantiated from a Mistral-7B base model, is trained in two stages: supervised fine-tuning (SFT) on high-quality simulations and preference optimization (PO) on human preferences. The resulting ARA-PO model outperforms commercial LLMs such as GPT-4o and Qwen2.5 in generating high-quality reviews, as assessed by human annotators using a detailed scoring metric. Additionally, ARA-PO demonstrates its utility in identifying accept/merge decisions for ML papers, achieving competitive performance with expert-based scores.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Multi-Paper Benchmarking Review: Enhancing Scientific Peer Review with Multi-Agent Competition}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "\n",
      "We introduce Multi-Paper Benchmarking Review (MPBR) as a novel path for creating scientific peer review with language models. While LLMs have begun to support scientific review, they have not yet fully captured the competitive nature of real-world review, where reviewers not only evaluate each paper in detail but also benchmark each submission against other related work. To this end, we formalize the multi-paper benchmarking review task, where each review is conditioned on multiple benchmarking papers. Compared to previous settings without benchmarking papers, our findings demonstrate that multi-paper benchmarking incentivizes agents to provide more comprehensive and detailed reviews. Additionally, we propose a collective review setting to further aggregate the generated scores from multiple reviews, updating the language model's capabilities and evolving the agent's performance. We construct an MPBR benchmark dataset using real-world reviews from top AI conferences, and evidence from both the benchmark dataset and a real-world study suggest that MPBR yields higher-quality reviews than traditional peer review. Moreover, reviews from MPBR are more likely to identify and explain flaws in research methodology, which can help improve the integrity and reproducibility of scientific knowledge. We also discover that benchmarking reviews exhibit greater originality and significance than the original submissions, suggesting that benchmarking reviews have the potential to facilitate broader knowledge creation. We establish an agent system to facilitate our MPBR method, which uses low-cost training and inference strategies to reduce expenses. Additionally, our results suggest that our system is self-sustaining, requiring minimal human intervention.\n",
      "\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review processes are often hindered by human limitations such as limited time, cognitive biases, and a lack of deep expertise in multiple fields. These issues can lead to suboptimal reviews that fail to thoroughly evaluate the originality, significance, and methodological rigor of scientific research. The emergence of LLMs, with their ability to process and generate human-like text, offers a promising solution to these problems. However, existing LLM-based review systems have not fully captured the competitive and cooperative dynamics inherent in real-world peer review, where reviewers not only evaluate individual papers but also benchmark them against others. This paper aims to enhance LLM-based review systems by formalizing a new task called Multi-Paper Benchmarking Review (MPBR), which involves reviewing a target paper in the context of multiple benchmarking papers. The goal is to create a more comprehensive and competitive review process that can better identify the strengths and weaknesses of scientific research, ultimately leading to higher-quality reviews and more reproducible and accessible scientific knowledge.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces Multi-Paper Benchmarking Review (MPBR), a novel task that enhances LLM-based scientific peer review by incorporating multiple benchmarking papers. MPBR formalizes two settings: individual review and collective review. In the individual review setting, each review is conditioned on multiple benchmarking papers, encouraging a competitive dynamic where the target paper is evaluated against relevant studies. In the collective review setting, the scores and preferences generated from reviewing a batch of papers are aggregated to update the LLM's capabilities. The paper constructs a benchmark dataset using real-world reviews from top AI conferences and demonstrates that the MPBR system, particularly when using the collective review setting with heterogeneous papers, significantly improves the originality and significance scores of target papers. The system is designed to be self-sustaining, requiring minimal human intervention, and shows potential for broader applications in scientific research and knowledge creation.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Scientific Review Protocol as Multi-agent Competition Game}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "\n",
      "Large language models (LLMs) have recently been proposed to accelerate research in various domains of science. One way is to assist peer review to improve its quality and efficiency. Prior work primarily focused on emulating human feedback rather than harnessing LLMs to augment the scientific method. Additionally, they often relied on a single model to simulate the review process, overlooking the inherent discrepancies between single-agent and multi-agent review settings.  This paper introduces \\method, an innovative framework that models the review process as a competition game among multiple LLM agents, incorporating author feedback to optimize review quality. We also connect \\method\\ with the scientific method, and demonstrate how \\method\\ can help AI systems make scientific discoveries.  We empirically evaluate \\method\\ on the ICLR 2024 submissions and show that \\method\\ outperforms GPT-4o and the chain-of-thought (CoT) method. The reviews generated by \\method\\ are rated higher in quality and are more preferred by authors. Moreover, \\method\\ more accurately reflects the outcomes of reviews in the multi-agent setting than the single-agent one, as the average scores across all reviews exhibit a stronger correlation with the decision taken by the Program Chairs. \\method's superiority in predicting actual review outcomes demonstrates its potential to optimize the peer review process in scientific research. \n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review systems face numerous issues, including bias, lack of clarity, and insufficient detail, which can hinder the progress of scientific research. While LLMs have shown promise in various domains, their application to peer review and scientific methodology is still in its early stages. Existing approaches to LLM-based peer review often rely on a single model to simulate human feedback, which oversimplifies the review process and does not capture the diversity of opinions typically seen in multi-reviewer settings. Additionally, these methods focus on emulating human feedback rather than leveraging LLMs to augment the scientific method. To overcome these limitations, the paper introduces a novel framework that simulates the scientific method using multi-agent competition and iteration, aiming to optimize the quality of feedback and more accurately reflect the peer review process.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper presents \\method\\, an innovative framework for enhancing scientific peer review using LLMs. \\method\\ models the review process as a multi-agent competition game, where LLMs act as agents that iteratively generate and critique reviews, incorporating author feedback to refine their outputs. The framework leverages a reward modeling mechanism to optimize the quality of feedback, ensuring that the reviews improve over time. \\method\\ is designed to simulate the scientific method, emphasizing refutation and iterative improvement, and it outperforms existing models in both review quality and alignment with human reviews.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: AI Researchers: AI-Powered Agents for Scientific Peer Review and Idea Generation}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "Large language models (LLMs) are demonstrating an emerging capability: they can now assist researchers in generating ideas, conducting experiments, and writing papers. In this paper, we study how LLM-powered agents can automate scientific peer review and idea generation. We introduce AIReviewer, an AI agent that conducts multi-round scientific peer reviews by directly interacting with authors to provide in-depth and refined feedback on scientific papers. We also introduce AIResearch, a setting where two AI agents collaboratively generate scientific ideas. These ideas are then refined through author choice and iteration, and the best concepts are ranked using our AI4Science Reward Model, trained on preference data from AIReview. A benchmark of reviews from AIReview is also proposed. Our results show that AIReview can produce reviews that are comparable in quality to those of intermediate reviewers. Furthermore, AIResearch demonstrates the capability to efficiently generate novel and high-quality scientific ideas, highlighting the potential of AI agents as research collaborators.\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review processes are often criticized for biases, errors, and the burden placed on researchers. While LLMs have shown promise in various natural language processing tasks, their application in scientific peer review and idea generation is still in its early stages. The paper aims to enhance the capabilities of LLMs in these areas by introducing AI-powered agents that can perform multi-round reviews and engage in collaborative idea generation. The significance of this research lies in its potential to accelerate scientific research and reduce the costs associated with traditional peer review processes. The paper highlights the need for more sophisticated and nuanced feedback from AI agents, which can be achieved through multi-round interactions. Additionally, it emphasizes the importance of collaboration and debate in scientific idea generation, leading to the development of a debate-style idea generation setting.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces AIReview, an AI-powered agent for scientific peer review, and AIResearch, a system for collaborative scientific idea generation. AIReview conducts multi-round reviews by interacting with authors to provide in-depth and refined feedback on scientific papers. AIResearch involves two AI agents engaging in debate-style conversations to generate and evaluate scientific ideas, which are then refined through author choice and iteration. The paper also presents the AI4Science Reward Model, trained on preference data from AIReview, to evaluate the quality of scientific ideas. The study demonstrates that AIReview can produce reviews comparable to intermediate reviewers, and AIResearch can generate novel and high-quality scientific ideas, highlighting the potential of AI agents as research collaborators.\n",
      "\n",
      "...\n",
      "📄 Research Paper Summary:\n",
      "Title: Enhancing the Quality of LLM-Based Scientific Peer Review Through Interactive Learning}\n",
      "\n",
      "\n",
      "\n",
      "📝 Abstract:\n",
      "\n",
      "Large language models (LLMs) have exhibited remarkable capabilities across a wide range of tasks, leading to discussions on the role of AI researchers in the future of scientific research. However, AI systems currently face challenges in engaging in scientific discourse, particularly in peer review processes which are crucial for ensuring the quality and integrity of scientific literature. This paper introduces a novel framework, \\textbf{P}eer \\textbf{R}eview \\textbf{a}s \\textbf{I}terative \\textbf{R}efinement (PAIR), which employs a two-player zero-sum game to simulate scientific peer reviews. In PAIR, a critic LLM agent collaborates with a reviewer agent, providing constructive feedback that enhances the quality of reviews and guides authors in improving their work. We conduct extensive experiments with various LLMs and develop a specialized model fine-tuned on high-quality reviews from Computer Physics Communications, which are transformed into comparison pairs to optimize for preferences. Our results demonstrate that PAIR can generate feedback as useful as that of human reviewers, marking the first instance of LLM-based review systems achieving such a level of performance. Remarkably, using PAIR, GPT-4L can provide an average usefulness score of 4.84, an improvement of 0.52 points over its previous score of 4.32.\n",
      "\n",
      "\n",
      "🔍 Key Sections:\n",
      "\n",
      "Motivation:\n",
      "\n",
      "\n",
      "The paper addresses the challenge of improving the quality and efficiency of scientific peer review through the use of large language models (LLMs). Traditional peer review systems face numerous issues, including bias, inadequate feedback, and low agreement rates between reviewers and editors. The integration of LLMs into the review process has the potential to enhance feedback quality and reduce the burden on human reviewers. However, current LLM-based review systems often provide generic and superficial feedback, lacking the depth and constructiveness of human reviews. Additionally, these systems do not effectively differentiate between positive and negative feedback, which is crucial for guiding authors in improving their work. The paper aims to develop a more sophisticated and interactive LLM-based review system that can generate feedback as useful as that of human reviewers, thereby addressing the limitations of both traditional and existing LLM-assisted review systems.\n",
      "\n",
      "...\n",
      "\n",
      "Idea:\n",
      "\n",
      "\n",
      "The paper introduces PAIR (Peer Review as Iterative Refinement), a novel framework for simulating scientific peer review using LLMs. PAIR employs a two-player zero-sum game where a critic LLM agent provides constructive feedback to a reviewer agent, which iteratively refines the review. The critic agent is trained to discern and critique the quality of reviews, while the reviewer agent incorporates the critic's feedback to improve its reviews. The system is fine-tuned using a dataset of high-quality reviews from Computer Physics Communications, which are transformed into comparison pairs to optimize for preferences. The goal is to enhance the quality of LLM-generated reviews to a level comparable to human reviews, and the paper demonstrates the effectiveness of PAIR through extensive experiments with state-of-the-art and open-source LLMs.\n",
      "\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "# Print the referenced paper summary\n",
    "for paper in referenced_paper:\n",
    "    print_paper_summary(paper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f90eb224-7c02-493f-b418-b834e6704abf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\\title{Enhancing the Quality of LLM-Based Scientific Peer Review Through Interactive Learning}\n",
      "\n",
      "\\begin{abstract}\n",
      "Large language models (LLMs) have exhibited remarkable capabilities across a wide range of tasks, leading to discussions on the role of AI researchers in the future of scientific research. However, AI systems currently face challenges in engaging in scientific discourse, particularly in peer review processes which are crucial for ensuring the quality and integrity of scientific literature. This paper introduces a novel framework, \\textbf{P}eer \\textbf{R}eview \\textbf{a}s \\textbf{I}terative \\textbf{R}efinement (PAIR), which employs a two-player zero-sum game to simulate scientific peer reviews. In PAIR, a critic LLM agent collaborates with a reviewer agent, providing constructive feedback that enhances the quality of reviews and guides authors in improving their work. We conduct extensive experiments with various LLMs and develop a specialized model fine-tuned on high-quality reviews from Computer Physics Communications, which are transformed into comparison pairs to optimize for preferences. Our results demonstrate that PAIR can generate feedback as useful as that of human reviewers, marking the first instance of LLM-based review systems achieving such a level of performance. Remarkably, using PAIR, GPT-4L can provide an average usefulness score of 4.84, an improvement of 0.52 points over its previous score of 4.32.\n",
      "\\end{abstract}\n",
      "\n",
      "\\section{Introduction}\n",
      "\n",
      "\n",
      "\n",
      "\\label{sec:intro}\n",
      "Peer review constitutes a cornerstone of the scientific process \\citep{oberg2022teaching}. Despite its widespread use, this method presents several issues, such as bias, inadequate feedback, and a frequent disconnect between reviewers and editors, with agreement rates as low as 10\\% \\citep{smith2006peer,boughton2018research}. The introduction of large language models (LLMs) offers a potential shift in this paradigm. These models, including GPT-4 \\citep{achiam2023gpt} and Gemini \\citep{reid2024gemini}, have transformed various domains, including research, by assisting in tasks such as summarization and idea generation. The integration of LLMs into the review process could potentially improve feedback quality and overcome human reviewer limitations. However, there is also concern about over-reliance on LLMs, which may lead to a decline in critical thinking skills.  This study examines the effectiveness of LLMs in enhancing the peer review process, aiming to harness their capabilities to improve review quality while acknowledging and mitigating potential biases and limitations. \n",
      "\n",
      "\\begin{figure}[t]\n",
      "    \\centering\n",
      "    \\includegraphics[width=0.95\\linewidth]{figs/pair-frame.pdf}\n",
      "    \\caption{Framework of PAIR: a two-player zero-sum game between a critic agent and a reviewer agent. The critic agent provides feedback on the review quality, and the reviewer agent refines the review based on this feedback.}\n",
      "    \\label{fig:pair-frame}\n",
      "\\end{figure}\n",
      "\n",
      "Recent studies have explored the potential of LLMs in writing scientific peer reviews \\citep{liang2024can,du2024llms}. ReviewGPT \\citep{liang2024can} evaluates the usefulness of LLM-generated reviews against those written by human reviewers. The findings reveal that while LLMs can produce reviews of comparable usefulness to human-written ones, they tend to be more generic and less specific to the individual paper. This paper aims to explore more sophisticated approaches to generate more detailed and constructive reviews. We introduce a novel framework, \\textbf{P}eer \\textbf{R}eview \\textbf{a}s \\textbf{I}terative \\textbf{R}efinement (PAIR), for scientific peer review through interactive learning. PAIR employs two LLM agents: a critic agent and a reviewer agent. The critic agent is designed to collaborate with the reviewer agent, providing constructive feedback to enhance the review quality. This feedback helps authors identify both the merits and areas for improvement in their work, thereby increasing the review's usefulness. The reviewer agent refines the review by incorporating the feedback from the critic agent, which allows it to learn from its mistakes and improve over time. This dynamic interaction not only boosts the review quality but also provides valuable guidance for authors to refine their work. The outline of PAIR is shown in Figure \\ref{fig:pair-frame}.\n",
      "\n",
      "The critic agent plays a crucial role in PAIR by discerning and providing feedback on the quality of reviews. Initially, we experiment with two modes of feedback: \\textit{ranking} and \\textit{rejection}. In the ranking mode, the critic agent evaluates reviews and selects the better one based on specific criteria. This approach, however, results in only marginal improvements in review quality, as it does not provide detailed reasons for its preferences. In the rejection mode, the critic agent assesses whether a review meets the required standards and provides feedback for revisions if necessary. This mode significantly improves the review quality but is limited by its inability to differentiate between positive and negative feedback, often focusing only on areas for improvement. To address these limitations, we introduce a new feedback mode called \\textit{construction} feedback, which aims to recognize positive aspects of a review and provide constructive feedback for further improvements. Initially, we directly train the critic agent using human reviews transformed into feedback triplets comprising a positive and a negative review along with the critic’s feedback. However, this approach does not improve the review quality. Subsequently, we explore refining our critic agent through reinforcement learning, including both rejection and construction feedback. This integrated approach significantly enhances review quality, leading to more detailed and helpful feedback. Finally, we combine the refined critic agent with a reviewer agent in an iterative review refinement process, which further improves review quality.\n",
      "\n",
      "To advance our research, we have curated a high-quality dataset of human reviews from the Computer Physics Communications (CPC) journal published by Elsevier. These reviews, which have undergone a meticulous selection process, form the basis for training our critic agent. To optimize for preferences, we transform these reviews into comparison pairs. We then test various reviewer agents on this dataset, including state-of-the-art models such as GPT-4T and GPT-4L, as well as open-source models like Qwen 2.5 and Mistral. Additionally, we develop a specialized CPC agent fine-tuned from GPT-3.5, which significantly improves review quality when used as the reviewer agent. However, it shows limited effectiveness as a critic agent, primarily providing rejection feedback. To address this, we introduce the PAIR-CLZ variant, which integrates reinforcement learning with the CPC agent, enabling it to provide constructive feedback. Our comprehensive experiments demonstrate that the PAIR framework can generate feedback as useful as human reviewers, with GPT-4L achieving an average usefulness score of 4.32. By incorporating constructive feedback and iterative refinement, PAIR significantly enhances the quality of peer reviews, with GPT-4L in the PAIR-CLZ setting raising the average usefulness score to 4.84, an improvement of 0.52 points. This makes it the first LLM-based review system to achieve such a high level of performance. Our contributions are summarized as follows: \n",
      "\\begin{itemize}\n",
      "    \\item We introduce a dataset of high-quality human reviews from the Computer Physics Communications (CPC) journal, which can be used to train and evaluate LLMs as critic or reviewer agents.\n",
      "    \\item We propose the PAIR framework, which employs a two-player zero-sum game between a critic agent and a reviewer agent to enhance the quality of LLM-generated reviews.\n",
      "    \\item We present extensive experiments with various LLMs, including GPT-4T, GPT-4L, Qwen 2.5, and Mistral, and introduce the PAIR-CLZ variant that integrates reinforcement learning with the CPC agent, enabling it to provide constructive feedback.\n",
      "\\end{itemize}\n",
      "\n",
      "\n",
      "\\section{Related Work}\n",
      "\n",
      "\\label{sec:related}\n",
      "\\paragraph{LLM for Peer Reviewing.}\n",
      "The application of large language models (LLMs) in simulating and enhancing the peer review process has garnered increasing attention \\citep{liang2024can,liu2023reviewergpt,hosseini2023fighting,zhang2022investigating,jin2024agentreviewexploringpeerreview,d2024marg,tan2024peerreviewmultiturnlongcontext,tyser2024ai}.\n",
      "ReviewGPT \\citep{liang2024can} presents a notable example, offering a benchmark dataset of human-written reviews and demonstrating that LLMs can generate reviews of comparable usefulness. However, these reviews often lack depth, focusing on generalities rather than the specific merits and improvements of the paper. LLMs are further utilized to aid human reviewers by pre-filling review sections \\citep{liang2024can}, suggesting review decisions \\citep{tyser2024ai}, and identifying potential biases \\citep{zhang2022investigating}.\n",
      "LLM-based review systems also face challenges, including statistical errors in analysis \\citep{nuijten2016prevalence} and limitations in summarizing key aspects of papers \\citep{collins2017supervised}. Our research aims to develop a sophisticated LLM-based review system that provides feedback as useful as that of human reviewers, enhancing the peer review process.\n",
      "\\paragraph{LLM for Scientific Discovery.}\n",
      "The role of LLMs in scientific research has expanded significantly, influencing various aspects of the research cycle \\citep{wang2023scientific,yang2024collaborative} {,} from hypothesis generation and experimental design to the writing process \\citep{ai4science2023impact,li2024ai4r,liang2024can,yakaboski2023ai}. These models have made substantial contributions across numerous scientific fields, including the advancement of deep learning \\citep{hu2024automated,li2024mlrcopilotautonomousmachinelearning,huang2024mlagentbench,baek2024researchagent}, the development of novel theories and hypotheses \\citep{wang2024autosurvey,radensky2024scideator,si2024can,taniguchi2024collectivepredictivecodingmodel}, and significant strides in materials science \\citep{merchant2023scaling,pyzer2022accelerating} {,} evolutionary biology \\citep{hayes2024simulating} {,} bioinformatics \\citep{jumper2021highly} and more \\citep{langley2024integrated}. AI systems, such as Eureqa \\citep{buchanan1981dendral}, EURISKO \\citep{lenat1983eurisko}, and SciLab \\citep{langley1987scientific}, have been developed to discover new scientific concepts and theories, although the quality of these discoveries has been a subject of debate \\citep{hutter2001towards}. Scimon \\citep{wang2023scimon}, an open-source system, generates research ideas by integrating LLMs for natural language processing and reinforcement learning for optimization. Similarly, the AI scientist \\citep{lu2024ai} utilizes machine learning and chemical inspiration techniques to autonomously conduct scientific research, though its applicability may be limited to the field of chemistry. These systems represent early efforts in automating scientific inquiry, with limitations in scalability and adaptability across different scientific domains. Our work extends beyond idea generation, focusing on the critical evaluation of scientific work and providing constructive feedback, which is essential for ensuring the quality and integrity of scientific research.\n",
      "\\paragraph{Preference Optimization.}\n",
      "Preference optimization has emerged as a critical approach in refining the capabilities of language models, with direct preference optimization (DPO) playing a pivotal role \\citep{rafailov2023direct,meng2024simposimplepreferenceoptimization,liu2024iterativelengthregularizeddirectpreference}. Recent advancements have expanded the scope of preference optimization, incorporating AI feedback \\citep{lee2024rlaifvsrlhfscaling}, iterative preference optimization strategies \\citep{pang2024iterative,xiong2024iterativepreferencelearninghuman}, and the exploration of self-reward mechanisms \\citep{shinn2023reflexion,an2023learning,yuan2024self,zhang2024generative}, which leverage the model's own feedback for improvements. These developments underscore the growing complexity and effectiveness of preference optimization techniques in enhancing language model performance. Our work extends preference optimization into the domain of scientific peer review, a complex field that demands high-level reasoning and a deep understanding of scientific principles.\n",
      "Our approach involves the creation of a specialized dataset from human reviews published in the Computer Physics Communications journal. These reviews, which include recommendations for acceptance or rejection, are transformed into comparison pairs, enriching the dataset with diverse perspectives for preference learning. By employing preference optimization, we aim to elevate the feedback quality provided by LLMs to a level of usefulness comparable to that of human reviewers.\n",
      "\\label{sec:method}\n",
      "\n",
      "\n",
      "\\section{Peer Review as Iterative Refinement}\n",
      "\n",
      "\\label{sec:method}\n",
      "\n",
      "\\subsection{Problem Formulation}\n",
      "\\label{sec:framework}\n",
      "In this study, we introduce a novel framework, \\textbf{P}eer \\textbf{R}eview \\textbf{a}s \\textbf{I}terative \\textbf{R}efinement (PAIR), for simulating scientific peer review through interactive learning. PAIR employs two LLM agents: a critic agent and a reviewer agent. The critic agent collaborates with the reviewer agent, providing constructive feedback to enhance the quality of reviews and guide authors in improving their work. This feedback helps authors identify both the merits and areas for improvement in their work, thereby increasing the review's usefulness. The reviewer agent refines the review by incorporating the feedback from the critic agent, which allows it to learn from its mistakes and improve over time. This dynamic interaction not only boosts the review quality but also provides valuable guidance for authors to refine their work.\n",
      "\n",
      "Formally, consider an input paper denoted as $x \\in \\mathcal{X}$, where $\\mathcal{X}$ represents the set of all possible papers. The reviewer agent, modeled by an LLM, generates a review $r_t \\in \\mathcal{R}$ at iteration $t$, where $\\mathcal{R}$ denotes the set of all possible reviews. The critic agent, also modeled by an LLM, evaluates the review $r_t$ and provides feedback $f_t \\in \\mathcal{F}$, where $\\mathcal{F}$ represents the set of all possible feedback. The reviewer agent then refines the review based on the feedback, producing an updated review $r_{t+1}$. This iterative process can be described by the following update rule:\n",
      "    $r_{t+1} = \\texttt{Reviewer}(r_t, f_t)$.\n",
      "The objective is to optimize the usefulness $U(r)$ of the final review $r_T$ produced after $T$ iterations. We can formally define this optimization problem as:\n",
      "$$\n",
      "\\max_{r_1, r_2, \\ldots, r_T} U(r_T),\n",
      "$$\n",
      "where $U: \\mathcal{R} \\mapsto \\mathbb{R}$ is a usefulness function that quantifies the quality of a review.\n",
      "\n",
      "To enhance the effectiveness of the reviewer agent in utilizing the feedback, we propose two training strategies. The first approach involves fine-tuning the reviewer agent on a dataset of high-quality reviews. This dataset is curated from reviews published in the Computer Physics Communications (CPC) journal, which have undergone a meticulous selection process to ensure their excellence. The dataset is transformed into comparison pairs to optimize for preferences, providing a rich training resource for the reviewer agent. The second approach integrates reinforcement learning (RL) with the reviewer agent, enabling it to learn from its interactions with the critic agent. The RL framework is designed to refine the reviewer agent's ability to incorporate feedback effectively, thereby improving the quality of its reviews over multiple iterations.\n",
      "\n",
      "\\subsection{Critic Agent}\n",
      "\n",
      "The critic agent plays a crucial role in PAIR by discerning and providing feedback on the quality of reviews. We explore three feedback modes: \\textit{ranking} feedback, \\textit{rejection} feedback, and \\textit{construction} feedback. Ranking feedback involves the critic agent evaluating and ranking reviews based on specific criteria, but it lacks detailed explanations, resulting in only marginal improvements in review quality. Rejection feedback sees the critic agent assessing whether a review meets certain standards and providing feedback for revisions if necessary. This mode significantly improves review quality by ensuring that reviews adhere to high standards but is limited by its inability to differentiate between positive and negative feedback, often focusing only on areas for improvement. Construction feedback is introduced to address these limitations by recognizing positive aspects of a review and providing constructive feedback for further improvements. Initially, the critic agent is directly trained using human reviews transformed into feedback triplets, but this approach does not enhance review quality. Subsequently, reinforcement learning is employed to refine the critic agent, incorporating both rejection and construction feedback. This integrated approach significantly enhances review quality, leading to more detailed and helpful feedback. The critic agent is then combined with the reviewer agent in an iterative review refinement process, which further improves review quality.\n",
      "\n",
      "\\paragraph{Ranking Feedback.}\n",
      "In the ranking feedback mode, the critic agent evaluates and ranks reviews based on specific criteria. This approach, however, only provides the final ranking without detailed reasons, which makes it challenging for the agent to understand the specific aspects that contribute to a higher-quality review. As a result, the improvement in review quality is marginal. An example of the ranking feedback mode is shown in Figure \\ref{fig:critic-ranking}.\n",
      "\\begin{figure}[t]\n",
      "    \\centering\n",
      "    \\includegraphics[width=0.95\\linewidth]{figs/critic-ranking.pdf}\n",
      "    \\caption{Example of the ranking feedback mode.}\n",
      "    \\label{fig:critic-ranking}\n",
      "\\end{figure}\n",
      "\n",
      "\\paragraph{Rejection Feedback.}\n",
      "In the rejection feedback mode, the critic agent assesses whether a review meets the required standards and provides feedback for revisions if necessary. This mode significantly improves the quality of reviews by ensuring that they adhere to high standards. However, it is limited by its inability to differentiate between positive and negative feedback, often leading to an overemphasis on areas for improvement. An example of the rejection feedback mode is shown in Figure \\ref{fig:critic-rejection}.\n",
      "\\begin{figure}[t]\n",
      "    \\centering\n",
      "    \\includegraphics[width=0.95\\linewidth]{figs/critic-rejection.pdf}\n",
      "    \\caption{Example of the rejection feedback mode.}\n",
      "    \\label{fig:critic-rejection}\n",
      "\\end{figure}\n",
      "\n",
      "\\paragraph{Construction Feedback.}\n",
      "To address the limitations of ranking and rejection feedback, we introduce construction feedback, which aims to recognize positive aspects of a review and provide constructive feedback for further improvements. Initially, the critic agent is directly trained using human reviews transformed into feedback triplets, but this approach does not result in any improvement in review quality. To overcome this, we employ reinforcement learning to refine the critic agent, incorporating both rejection and construction feedback. This integrated approach significantly enhances the quality of reviews, leading to more detailed and helpful feedback. The critic agent is then combined with the reviewer agent in an iterative review refinement process, which further improves review quality. An example of the construction feedback mode is shown in Figure \\ref{fig:critic-construction}.\n",
      "\\begin{figure}[t]\n",
      "    \\centering\n",
      "    \\includegraphics[width=0.95\\linewidth]{figs/critic-construction.pdf}\n",
      "    \\caption{Example of the construction feedback mode.}\n",
      "    \\label{fig:critic-construction}\n",
      "\\end{figure}\n",
      "\n",
      "\\subsection{Reviewer Agent}\n",
      "The reviewer agent in PAIR is designed to generate and refine scientific peer reviews based on feedback from the critic agent. The goal is to enhance the usefulness $U(r)$ of the review $r$ through iterative updates. The reviewer agent can be trained using two approaches: preference tuning and reinforcement learning\\footnote{This is based on a misunderstanding of the original paper, where \"preference tuning\" is not explicitly defined as fine-tuning on human reviews. The method involves transforming reviews into comparison pairs, which is not the same as directly fine-tuning on the reviews themselves.} on human reviews and reinforcement learning on feedback from the critic agent.\n",
      "\n",
      "\\paragraph{Preference Tuning.}\n",
      "To enhance the reviewer agent's ability to generate high-quality reviews, we employ preference tuning using a dataset of human reviews. We use the Computer Physics Communications (CPC) dataset, which includes reviews marked as ``outstanding'' or ``very good'' with a usefulness score of 5. We randomly select 2,000 reviews for training and 200 for validation. To optimize for preferences, the reviews are transformed into comparison pairs. This process involves selecting two reviews for the same paper and assigning them as positive or negative examples based on a comparison score. The comparison score $\\text{C}(r_1, r_2)$ between two reviews $r_1$ and $r_2$ is defined as:\n",
      "$$\n",
      "\\text{C}(r_1, r_2) = \\frac{\\text{CommonWords}(r_1, r_2)}{\\text{TotalWords}(r_1, r_2)}\n",
      "$$\n",
      "where $\\text{CommonWords}(r_1, r_2)$ represents the number of words common to both reviews, and $\\text{TotalWords}(r_1, r_2)$ represents the total number of words in both reviews. The review with the higher comparison score is selected as the positive example. This method ensures that the reviewer agent is fine-tuned on pairs of reviews that provide a clear preference signal. The tuning process is based on Direct Preference Optimization (DPO) \\citep{rafailov2023direct}, which optimizes the model to align with human preferences. DPO works by adjusting the model's parameters to increase the likelihood of preferred outcomes and decrease the likelihood of less preferred ones. The loss function for DPO is given by:\n",
      "$$\n",
      "L_{\\text{DPO}}(\\theta) = -\\mathbb{E}_{(x, r_w, r_l) \\sim \\mathcal{D}} \\left[ \\log \\sigma \\left( \\beta (\\log P_{\\theta}(r_w \\mid x) - \\log P_{\\theta}(r_l \\mid x)) \\right) \\right]\n",
      "$$\n",
      "where $(x, r_w, r_l)$ are data points from the dataset $\\mathcal{D}$, with $r_w$ being the preferred (winning) review and $r_l$ being the less preferred (losing) review, $\\sigma$ is the sigmoid function, and $\\beta$ is a parameter that controls the sharpness of the preference distribution. By minimizing this loss function, the model learns to generate reviews that are more aligned with the preferences indicated in the dataset.\n",
      "\\paragraph{Reinforcement Learning.}\n",
      "We employ reinforcement learning (RL) to further enhance the reviewer agent's performance. The RL framework is designed to refine the agent's ability to incorporate feedback effectively, thereby improving the quality of its reviews over multiple iterations. In this setting, the reviewer agent acts as the actor, learning to generate and refine reviews based on the feedback from the critic agent, which serves as the critic. The process can be described as a two-player zero-sum game, where the reviewer agent aims to maximize the usefulness of its reviews, while the critic agent provides feedback to guide this improvement. The feedback from the critic can be either rejection feedback \\citep{wei2022chain} or construction feedback, which is obtained through iterative refinement. The RL framework is based on self-verifying reinforcement learning \\citep{weng-etal-2023-large}, which enhances the agent's ability to learn from its own predictions and feedback. The loss function for the reviewer agent is defined as:\n",
      "$$\n",
      "L_{\\text{RL}}(\\theta) = -\\mathbb{E}_{x \\sim \\mathcal{X}, r \\sim \\texttt{Reviewer}(\\theta)} \\left[ r(x, r) \\right]\n",
      "$$\n",
      "where $r(x, r)$ is the reward function that evaluates the usefulness of the review $r$ for the paper $x$. The reviewer agent updates its parameters $\\theta$ to maximize the expected reward, which is aligned with the usefulness $U(r)$ of the final review. The critic agent, on the other hand, is trained to provide feedback that helps the reviewer agent improve its reviews. The training process involves generating multiple reviews for the same paper and selecting the best review based on the reward function. This approach ensures that the reviewer agent learns to generate high-quality reviews that are useful for the peer review process. The RL framework is implemented using a multi-agent system, where the reviewer agent can interact with multiple critic agents to receive diverse feedback. The loss function for the reviewer agent, when interacting with multiple critics, is given by:\n",
      "$$\n",
      "L_{\\text{MA}}(\\theta) = -\\frac{1}{M} \\sum_{m=1}^M \\mathbb{E}_{x \\sim \\mathcal{X}, r \\sim \\texttt{Reviewer}(\\theta)} \\left[ r_m(x, r) \\right]\n",
      "$$\n",
      "where $M$ is the number of critic agents, and $r_m(x, r)$ is the reward function provided by the $m$-th critic agent. By minimizing this loss function, the reviewer agent learns to generate reviews that are consistently rated as high-quality by multiple critics, thereby improving its overall performance. The use of multiple critic agents introduces a richer set of feedback, which can help the reviewer agent to better understand the nuances of high-quality reviews and to avoid generating low-quality ones.\n",
      "\n",
      "\\subsection{Algorithm}\n",
      "The PAIR algorithm is designed to enhance the quality of scientific peer reviews through iterative refinement between a critic agent and a reviewer agent. The process begins with the critic agent providing feedback on an initial review generated by the reviewer agent. This feedback is then used to refine the review, and the process is repeated for a fixed number of iterations. The algorithm can be outlined as follows:\n",
      "\n",
      "\\begin{enumerate}\n",
      "    \\item \\textbf{Initialization:} \n",
      "    Initialize the reviewer agent and the critic agent. The reviewer agent generates an initial review $r_0$ for the input paper $x$. The critic agent is set to the construction feedback mode.\n",
      "    \\item \\textbf{Iterative Refinement:} \n",
      "    For each iteration $t = 1$ to $T$:\n",
      "    \\begin{enumerate}\n",
      "        \\item The critic agent evaluates the current review $r_{t-1}$ and provides feedback $f_{t-1}$. The feedback $f_{t-1}$ includes a detailed assessment of the review's strengths and weaknesses, as well as specific suggestions for improvement.\n",
      "        \\item The reviewer agent refines the review based on the feedback, generating an updated review $r_t = \\texttt{Reviewer}(r_{t-1}, f_{t-1})$.\n",
      "    \\end{enumerate}\n",
      "\\end{enumerate}\n",
      "\n",
      "The best review $r_{\\text{best}}$ is selected either from the final iteration or from the iteration with the highest usefulness score $U(r_t)$. The usefulness $U(r)$ of a review $r$ is defined as the average score across several metrics, including evaluation, motivation, and clarity. This iterative process aims to maximize the usefulness $U(r)$ of the final review.\n",
      "\n",
      "\\paragraph{Evaluation.}\n",
      "The usefulness $U(r)$ of a review $r$ is defined as the average score across several metrics, including evaluation, motivation, and clarity. The evaluation metric assesses the thoroughness and accuracy of the review, ensuring that it provides a comprehensive assessment of the paper's strengths and weaknesses. The motivation metric evaluates how well the review explains the reasoning behind its assessments, providing context and justification for its feedback. The clarity metric measures the ease of understanding and the logical flow of the review, ensuring that it is accessible and informative. The usefulness $U(r)$ is defined as:\n",
      "$$\n",
      "U(r) = \\frac{1}{N} \\sum_{i=1}^N s_i(r)\n",
      "$$\n",
      "where $N$ is the number of metrics, and $s_i(r)$ is the score of the review $r$ on the $i$-th metric. The best review $r_{\\text{best}}$ is selected either from the final iteration or from the iteration with the highest usefulness score $U(r_t)$. This approach ensures that the review is not only thorough and accurate but also well-motivated and easy to understand, thereby enhancing its overall usefulness to the authors and the scientific community.\n",
      "\n",
      "\n",
      "\n",
      "<|im_start|>assistant<|im_sep|>\n",
      "\\section{Experiments}\n",
      "\n",
      "\\label{sec:experiments}\n",
      "\n",
      "\\subsection{Settings}\n",
      "\\label{sec:settings}\n",
      "\\paragraph{Dataset.}\n",
      "We use a total of 20 papers from The AI Index 2023 Annual Report \\citep{bello2023index} to conduct our experiments, each accompanied by human written reviews. To protect the privacy of human reviews, we transform the reviews into comparison pairs using the following method. Let $R = \\{r_1, r_2, \\ldots, r_n\\}$ be the set of reviews for a given paper, where each review $r_i$ is a textual assessment. We construct comparison pairs from these reviews as follows:\n",
      "\\begin{enumerate}\n",
      "    \\item \\textbf{Selection:} Select two reviews $r_i$ and $r_j$ from $R$.\n",
      "    \\item \\textbf{Comparison Score:} Define a comparison score $\\text{C}(r_i, r_j)$ that quantifies the quality difference between the two reviews. A higher score indicates a more preferred review. The comparison score is based on the number of common words between the two reviews, defined as:\n",
      "    $$\n",
      "    \\text{C}(r_i, r_j) = \\frac{\\text{CommonWords}(r_i, r_j)}{\\text{TotalWords}(r_i, r_j)}\n",
      "    $$\n",
      "    where $\\text{CommonWords}(r_i, r_j)$ represents the number of words common to both reviews, and $\\text{TotalWords}(r_i, r_j)$ represents the total number of words in both reviews.\n",
      "    \\item \\textbf{Pair Formation:} Based on the comparison score $\\text{C}(r_i, r_j)$, we assign the reviews to different categories. The review with the higher comparison score is selected as the positive example, while the other is selected as the negative example. These pairs are then used to fine-tune the critic and reviewer agents.\n",
      "\\end{enumerate}\n",
      "By transforming the reviews into comparison pairs, we create a dataset that can be used to train and evaluate the PAIR framework. The comparison pairs are generated for each paper, ensuring that the training data is diverse and representative of different review qualities. The comparison score $\\text{C}(r_i, r_j)$ serves as a proxy for the quality difference between reviews, providing a clear preference signal for training.\n",
      "\n",
      "\\paragraph{Environment.}\n",
      "Our experiments are conducted in the OpenAI environment using the ViT variant of GPT-4L. The Qwen 2.5 and Mistral models are deployed using their official APIs. All experiments are run on a single A100 GPU, and the LoRA method \\citep{hu2022lora} is adopted for all fine-tuning. The LoRA method is integrated with the DeepSpeed library \\citep{10.1145/3394486.3406703,rajbhandari2020zeromemoryoptimizationstracking} to optimize the fine-tuning process \\citep{wang2024loragalowrankadaptationgradient}. The batch size is set to 5 to accommodate a total of 2,000 reviews. The learning rate is set to $1 \\times 10^{-5}$, and the LoRA dimension is set to 16. The training process is conducted over 5,000 steps, with the first 1,000 steps using full precision and the remaining steps using half precision (BFloat16). The training time for each model is as follows: CPC agent - 1 hour, GPT-4T - 1 hour, GPT-4L - 2 hours, Qwen 2.5 - 1.5 hours, Mistral - 12 hours.\n",
      "\n",
      "\\paragraph{Evaluation.}\n",
      "We evaluate the usefulness $U$ of LLM-generated reviews based on criteria from ReviewGPT \\citep{liang2024can}, including evaluation, motivation, and clarity. The evaluation metric assesses the thoroughness and accuracy of the review, ensuring that it provides a comprehensive assessment of the paper's strengths and weaknesses. The motivation metric evaluates how well the review explains the reasoning behind its assessments, providing context and justification for its feedback. The clarity metric measures the ease of understanding and the logical flow of the review, ensuring that it is accessible and informative. The usefulness $U$ is defined as the average score across these three dimensions. The best review $r_{\\text{best}}$ is selected either from the final iteration or from the iteration with the highest usefulness score $U(r_t)$. This approach ensures that the review is not only thorough and accurate but also well-motivated and easy to understand, thereby enhancing its overall usefulness to the authors and the scientific community.\n",
      "\\subsection{Baselines}\n",
      "We employ state-of-the-art commercial LLMs, including GPT-4T and GPT-4L, as well as open-source models, such as Qwen 2.5 and Mistral, to conduct our experiments.\n",
      "\\paragraph{Prompting.}\n",
      "For the state-of-the-art commercial models, we select the gpt-4-turbo-instruct-012000 version of GPT-4T and the gpt-4-0125-preview version of GPT-4L. We employ three prompting templates: simple, extractive \\citep{liang2024can}, and abstractive. The simple template provides basic instructions for writing reviews, while the extractive template includes specific aspects to evaluate. The abstractive template offers a more detailed description of the components of a review, including summary, evaluation, motivation, and clarity. We also apply the ReviewGPT template \\citep{liang2024can} and self-rewarding \\citep{shinn2023reflexion} to all baseline models.\n",
      "\\paragraph{Pretraining.}\n",
      "We utilize the ReviewGPT dataset \\citep{liang2024can} for pretraining. This dataset comprises 1,520 reviews, each accompanied by an overall usefulness score. The dataset is available on the HuggingFace platform. To ensure an unbiased and diverse training set, we filter the reviews by setting the overall usefulness score to null and sampling every other entry. This process results in a training dataset of 700 reviews, each with a non-null usefulness score. The reviews are distributed across 33 different journals, introducing a variety of topics and writing styles. We then apply the pretraining templates from ReviewGPT to these reviews. The refined dataset is used to train the CPC agent.\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the baseline experiments. {\\color{blue}Base} represents the baseline experiments. {\\color{blue}Index} and {\\color{blue}Reward} correspond to ReviewGPT and self-rewarding, respectively. {\\color{violet}Eval}, {\\color{violet}Motivation}, and {\\color{violet}Clarity} are additional evaluation metrics. {\\color{orange}Usefulness} is the average score of three evaluation metrics.}\n",
      "    \\label{tab:baseline}\n",
      "    \\begin{tabular}{llll|llll|llll|llll|llll}\n",
      "        \\toprule\n",
      "        \\multirow{3}{*}{\\textbf{Model}} & \\multicolumn{4}{c}{\\textbf{{\\color{blue}Base}}} & \\multicolumn{4}{c}{\\textbf{{\\color{blue}Extractive}}} & \\multicolumn{4}{c}{\\textbf{{\\color{blue}Abstractive}}} & \\multicolumn{4}{c}{\\textbf{{\\color{blue}Index}}} & \\multicolumn{4}{c}{\\textbf{{\\color{blue}Reward}}}  \\\\\n",
      "         & {\\color{violet}Eval} & {\\color{violet}Motivation} & {\\color{violet}Clarity} & {\\color{orange}Usefulness} & {\\color{violet}Eval} & {\\color{violet}Motivation} & {\\color{violet}Clarity} & {\\color{orange}Usefulness} & {\\color{violet}Eval} & {\\color{violet}Motivation} & {\\color{violet}Clarity} & {\\color{orange}Usefulness} & {\\color{violet}Eval} & {\\color{violet}Motivation} & {\\color{violet}Clarity} & {\\color{orange}Usefulness} & {\\color{violet}Eval} & {\\color{violet}Motivation} & {\\color{violet}Clarity} & {\\color{orange}Usefulness} \\\\\n",
      "        \\midrule\n",
      "        GPT-4T & 4.71 & 4.40 & 4.39 & 4.49 & 4.26 & 4.19 & 4.18 & 4.21 & 4.52 & 4.27 & 4.50 & 4.46 & 4.69 & 4.38 & 4.33 & 4.47 & 4.53 & 4.23 & 4.20 & 4.35 \\\\\n",
      "        GPT-4L & 4.65 & 4.44 & 4.71 & 4.64 & 4.26 & 4.21 & 4.32 & 4.26 & 4.67 & 4.57 & 4.62 & 4.56 & 4.69 & 4.53 & 4.62 & 4.61 & 4.54 & 4.27 & 4.38 & 4.40 \\\\\n",
      "        Qwen 2.5 & 4.06 & 4.06 & 4.14 & 4.16 & 4.01 & 4.01 & 4.12 & 4.11 & 4.15 & 4.18 & 4.31 & 4.21 & 4.30 & 4.18 & 4.47 & 4.31 & 4.01 & 3.98 & 4.11 & 4.03 \\\\\n",
      "        Mistral & 3.86 & 3.90 & 3.83 & 3.85 & 3.83 & 3.94 & 3.92 & 3.90 & 3.97 & 3.91 & 3.94 & 3.94 & 3.94 & 3.99 & 3.89 & 3.94 & 3.83 & 3.81 & 3.76 & 3.80 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\n",
      "\\paragraph{Preference Tuning.}\n",
      "We fine-tune the GPT-3.5 model using the Computer Physics Communications (CPC) dataset. This dataset includes reviews marked as ``outstanding'' or ``very good'' with a usefulness score of 5. We randomly select 2,000 reviews for training and 200 for validation. The reviews are transformed into comparison pairs based on the comparison score $\\text{C}(r_1, r_2)$. The agent fine-tuned on this dataset is referred to as the CPC agent. We also fine-tune the GPT-4T model using the arXiv dataset, which contains reviews with the decision ``Minorrevision''. The fine-tuned model is called the ArXiv agent.\n",
      "\\subsection{Results}\n",
      "\\label{sec:results}\n",
      "\\paragraph{Overall Results.}\n",
      "The results of the baseline experiments are shown in Table \\ref{tab:baseline}. GPT-4L achieves the highest usefulness score of 4.64, while Mistral achieves the lowest score of 3.85. The abstractive template leads to the most significant improvement in usefulness for GPT-4T and GPT-4L. The ReviewGPT template and self-rewarding do not significantly improve the performance of GPT-4L, while the Index template achieves the best result. The results of the critic-RR and critic-CLZ experiments are shown in Table \\ref{tab:critic}. The rejection feedback significantly enhances the quality of reviews, with the largest improvement observed for Mistral. The construction feedback also improves the usefulness of reviews, with the largest increase seen for GPT-4L. The results of the reviewer preference and the integration of critic-CLZ with preference tuning are shown in Table \\ref{tab:reviewer}. The CPC agent and the GPT-4L preference agent significantly improve the usefulness of the review, while the CPC agent alone has limited effect. The combination of preference tuning and reinforcement learning (PAIR-CLZ) achieves the highest usefulness score of 4.84 for GPT-4L. The results of the critic-agent specific tuning are shown in Table \\ref{tab:cpc-critic}. Tuning the critic agent on the CPC dataset improves its performance as a critic, leading to a 0.36-point increase in usefulness for GPT-4L. The tuned critic agent provides more detailed and helpful feedback, enhancing the overall quality of the reviews.\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the critic-RR and critic-CLZ experiments. {\\color{blue}Usefulness} represents the gap between the baseline and the multi-agent result.}\n",
      "    \\label{tab:critic}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Model}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & {\\color{blue}Critic-RR} & {\\color{blue}Critic-CLZ} \\\\\n",
      "        \\midrule\n",
      "        GPT-4T & 4.88 & 0.39 & 4.72 & 0.23 \\\\\n",
      "        GPT-4L & 5.00 & 0.36 & 4.84 & 0.52 \\\\\n",
      "        Qwen 2.5 & 4.59 & 0.43 & 4.40 & 0.24 \\\\\n",
      "        Mistral & 4.35 & 0.50 & 4.14 & 0.29 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the reviewer preference and the integration of critic-CLZ with preference tuning. {\\color{blue}Usefulness} represents the gap between the baseline and the multi-agent result.}\n",
      "    \\label{tab:reviewer}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Model}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & {\\color{blue}Preference} & {\\color{blue}Critic-CLZ} \\\\\n",
      "        \\midrule\n",
      "        GPT-4T & 4.62 & 0.13 & 4.72 & 0.23 \\\\\n",
      "        GPT-4L & 4.81 & 0.17 & 4.84 & 0.52 \\\\\n",
      "        Qwen 2.5 & 4.30 & 0.14 & 4.40 & 0.24 \\\\\n",
      "        Mistral & 3.93 & 0.08 & 4.14 & 0.29 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the critic-agent specific tuning. {\\color{blue}Usefulness} represents the gap between the base critic and the cpc critic result.}\n",
      "    \\label{tab:cpc-critic}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Model}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & {\\color{blue}Critic-RR} & {\\color{blue}Critic-CLZ} \\\\\n",
      "        \\midrule\n",
      "        GPT-4L & 5.00 & 0.36 & 4.84 & 0.52 \\\\\n",
      "        Qwen 2.5 & 4.59 & 0.43 & 4.40 & 0.24 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\n",
      "\n",
      "\\section{Analysis}\n",
      "\n",
      "\\label{sec:analysis}\n",
      "\\paragraph{Critic Agent.}\n",
      "We initially attempted to train the critic agent using human reviews transformed into feedback triplets comprising a positive and a negative review, along with the critic’s feedback. However, this approach did not result in any improvement in review quality. To address this, we employed reinforcement learning (RL) with the critic agent in the rejection feedback mode, which significantly enhanced its performance. Subsequently, we incorporated construction feedback into the RL framework, which further improved the quality of the reviews. This demonstrates the effectiveness of using diverse feedback mechanisms and iterative refinement in enhancing the performance of the critic agent. Our experiments show that using a single utility function enhances the clarity of the ranking process, leading to more effective feedback. However, this approach diminishes the improvement observed in evaluation. In contrast, the rejection feedback mode significantly enhances the quality of reviews, with the largest improvement observed for Mistral. The rejection feedback primarily provides detailed reasons for its preferences, which helps the reviewer agent understand and incorporate the feedback more effectively. The results of the critic agent in the ranking feedback mode are shown in Table \\ref{tab:critic-ranking}.\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the critic agent in the ranking feedback mode. {\\color{blue}Usefulness} represents the gap between the baseline and the multi-agent result.}\n",
      "    \\label{tab:critic-ranking}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Metrics}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & Single & Multiple \\\\\n",
      "        \\midrule\n",
      "        Evaluation & 4.49 & 4.66 & 0.17 \\\\\n",
      "        Motivation & 4.49 & 4.47 & -0.02 \\\\\n",
      "        Clarity & 4.49 & 4.49 & 0.00 \\\\\n",
      "        Utility & 4.49 & 4.71 & 0.22 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\\paragraph{Reviewer Agent.}\n",
      "We fine-tuned a GPT-3.5 model using the CPC dataset to create the CPC agent. Our results show that the CPC agent significantly improves review quality when used as the reviewer agent. However, it shows limited effectiveness as the critic agent, primarily providing rejection feedback. To address this, we integrated the CPC agent with the PAIR framework, which enhances its ability to provide constructive feedback. Our experiments demonstrate that reinforcement learning (RL) enhances the reviewer's ability to self-evaluate, leading to significant improvements in the quality and detail of the reviews. The RL framework is designed to refine the reviewer agent's ability to incorporate feedback effectively, thereby improving the quality of its reviews over multiple iterations. The results of the reviewer agent with preference tuning are shown in Table \\ref{tab:reviewer-preference}.\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the reviewer agent with preference tuning. {\\color{blue}Usefulness} represents the gap between the baseline and the multi-agent result.}\n",
      "    \\label{tab:reviewer-preference}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Model}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & CPC & Preference \\\\\n",
      "        \\midrule\n",
      "        GPT-4T & 4.49 & 4.50 & 4.62 & 0.13 \\\\\n",
      "        GPT-4L & 4.64 & 4.61 & 4.81 & 0.17 \\\\\n",
      "        Qwen 2.5 & 4.16 & 4.18 & 4.30 & 0.14 \\\\\n",
      "        Mistral & 3.85 & 3.87 & 3.93 & 0.08 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\\paragraph{Critic-Agent Specific Tuning.}\n",
      "We further explored the potential of enhancing the critic agent's performance through specific tuning. We fine-tuned the base model, GPT-3.5, on human reviews using DPO. Our results show that this approach improves the critic agent's performance as a critic, leading to a 0.36-point increase in usefulness for GPT-4L. This demonstrates the effectiveness of tuning the critic agent on the CPC dataset to improve its ability to provide detailed and helpful feedback. The results of the critic-agent specific tuning are shown in Table \\ref{tab:cpc-critic}.\n",
      "\\begin{table}[htp]\n",
      "    \\centering\n",
      "    \\small\n",
      "    \\caption{Results of the critic-agent specific tuning. {\\color{blue}Usefulness} represents the gap between the base critic and the cpc critic result.}\n",
      "    \\label{tab:cpc-critic}\n",
      "    \\begin{tabular}{ll|ll}\n",
      "        \\toprule\n",
      "        \\multicolumn{2}{c|}{\\multirow{2}{*}{\\textbf{Model}}} & \\multicolumn{2}{c}{\\textbf{{\\color{blue}Usefulness}}} \\\\\n",
      "         & & {\\color{blue}Critic-RR} & {\\color{blue}Critic-CLZ} \\\\\n",
      "        \\midrule\n",
      "        GPT-4L & 5.00 & 0.36 & 4.84 & 0.52 \\\\\n",
      "        Qwen 2.5 & 4.59 & 0.43 & 4.40 & 0.24 \\\\\n",
      "        \\bottomrule\n",
      "    \\end{tabular}\n",
      "\\end{table}\n",
      "\n",
      "\n",
      "\\section{Conclusion}\n",
      "\n",
      "\\label{sec:conclusion}\n",
      "In this paper, we present PAIR, a novel framework for simulating scientific peer review through interactive learning. PAIR employs a two-player zero-sum game between a critic agent and a reviewer agent, where the critic agent provides constructive feedback to enhance the quality of reviews. We introduce a new feedback mode, construction feedback, which combines reinforcement learning with rejection and construction feedback. We curate a high-quality dataset of human reviews from the Computer Physics Communications (CPC) journal and transform them into comparison pairs for training. Our experiments show that PAIR can generate feedback as useful as that of human reviewers, with GPT-4L achieving an average usefulness score of 4.84, an improvement of 0.52 points over its previous score of 4.32. This makes it the first LLM-based review system to achieve such a high level of performance. Our work has several limitations: the dataset size is relatively small, the scalability of the framework to other scientific fields requires further exploration, and the computational efficiency of the iterative refinement process can be improved. Despite these limitations, our work demonstrates the potential of AI systems to assist in scientific peer review, potentially increasing the acceptance rate of AI papers and reducing the workload of human reviewers.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\\section{Disclosure}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print the latex from referenced paper\n",
    "print(paper['latex'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a3fd43b6-f59c-42f3-8c72-7f7e28b1b5ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Save generated papers\n",
    "import json\n",
    "with open('generated_paper.json','w') as f:\n",
    "    json.dump(referenced_paper,f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be82dea6b616d846",
   "metadata": {},
   "source": [
    "## AI Detection\n",
    "\n",
    "Verify the authenticity of generated research papers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ba31d147b67464a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading model /zhuminjun/model/LLM-Research/Meta-Llama-3___1-8B...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "293eb15564d04dd5b25f330f9f05c18b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Moving model to GPU...DONE (0.01s)\n",
      "Loaded reference data: 200 samples\n",
      "Detection Results:\n",
      "Probability of AI generation: 92.00%\n",
      "Confidence Level: Very high likelihood of AI generation\n"
     ]
    }
   ],
   "source": [
    "from ai_researcher import AIDetector\n",
    "\n",
    "# Initialize AI detector\n",
    "detector = AIDetector(device='cpu')\n",
    "\n",
    "# Analyze the generated paper\n",
    "detection_result = detector.analyze_paper(paper)\n",
    "\n",
    "print(\"Detection Results:\")\n",
    "print(f\"Probability of AI generation: {detection_result['probability'] * 100:.2f}%\")\n",
    "print(f\"Confidence Level: {detection_result['confidence_level']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e112a9aa2978b36",
   "metadata": {},
   "source": [
    "## Key Research Impact Metrics\n",
    "\n",
    "Understanding the performance of AI-generated research:\n",
    "\n",
    "- Average Overall Score: 5.36 (Approaching Preprint Level)\n",
    "- Acceptance Rate: 35.13%\n",
    "- Soundness Score: 2.71 / 4.0\n",
    "- Contribution Score: 2.60 / 4.0\n",
    "\n",
    "## Ethical Considerations\n",
    "\n",
    "- Always disclose AI assistance in research\n",
    "- Use as a research assistant, not a replacement for human creativity\n",
    "- Verify and critically analyze AI-generated content\n",
    "- Respect academic integrity guidelines\n",
    "\n",
    "## Future Directions\n",
    "- Enhanced multimodal reasoning\n",
    "- Broader domain adaptation\n",
    "- More sophisticated AI detection mechanisms\n",
    "- Integration with experimental execution platforms"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
