{
  "title": "Web-Shepherd: Enhancing AI Agents",
  "turns": [
    {
      "turn": 1,
      "user": "Are you familiar with the paper web shephered?",
      "assistant": "### Web-Shepherd: A new process reward model for web agent reinforcement\n\nWeb-Shepherd is a recently published research paper that proposes a new methodology—the 'Process Reward Model (PRM)'—which evaluates and rewards the behavior of AI agents performing tasks within websites step by step. The study was conducted by Hyungjoo Chae and colleagues at Yonsei University and Carnegie Mellon University.\n\nUnlike prior work on web agents that mainly focused on the final result (Outcome) and rewarded only that, Web-Shepherd evaluates each step-level of the agent's task, enabling more fine-grained and efficient learning. It's similar to how people learn a task by following the steps in order and receiving feedback at each stage.\n\n#### Key concepts and contributions\n\n*   **Process Reward Model (PRM)**: The core idea of Web-Shepherd is PRM. For tasks that require sequential decision-making such as web browsing, it evaluates the agent's behavior trajectory (trajectory) at every step. Compared to an Outcome Reward Model (ORM) that judges only the final result, it provides more detailed feedback, boosting the agent's learning efficiency and accuracy.\n*   **Checklist-guided evaluation (Checklist-guided Reward Modeling)**: It decomposes a user's high-level instruction (e.g., \"Search for a laptop and add it to the cart\") into a checklist—a sequence of clear, interpretable subgoals (subgoal). The agent is evaluated and rewarded step by step based on this checklist. This makes the model's decision process easier to understand (interpretability), and the agent can recognize where it is within the workflow (progress awareness).\n*   **WebPRM Collection**: To train the model, the team built a large dataset called 'WebPRM Collection' that includes over 40,000 step-level preference annotations and structured checklists.\n*   **WebRewardBench**: They also propose 'WebRewardBench', a benchmark for testing and evaluating the effectiveness of PRM across diverse web task environments.\n\n#### Significance and results\n\nExperiments show that Web-Shepherd achieves higher accuracy at much lower cost than using a cutting-edge LLM like GPT-4o as a judge. This is regarded as an important step toward making web automation more practical: it helps agents execute long workflows more reliably in complex web environments and makes it easier to pinpoint at which step things went wrong.\n\nThe related models, datasets, and code are available on platforms like GitHub and Hugging Face for other researchers to use."
    },
    {
      "turn": 2,
      "user": "Please explain in detail the way reward modeling was done in this web shephered",
      "assistant": "Web-Shepherd’s reward modeling approach is built on the core concept of the **Process Reward Model (PRM)** and uses a **Checklist** to produce fine-grained rewards at each step.\n\n* * *\n\n### 1\\. Process Reward Model (PRM)\n\nMuch prior work on web agents has used an **Outcome Reward Model (ORM)**, which determines rewards solely by whether the agent achieves the final goal. For example, in a task like \"Add the product to the cart,\" if the product ends up in the cart it’s success (reward +1); otherwise it’s failure (reward 0). This makes it hard to evaluate the efficiency or accuracy of the intermediate process.\n\nIn contrast, Web-Shepherd adopts a **Process Reward Model (PRM)**. It evaluates the agent’s **step-level trajectory** on the way to the goal, not just the final outcome, and provides rewards accordingly. This lets the agent learn much more precisely which behaviors lead to positive results.\n\n* * *\n\n### 2\\. Checklist-Guided Reward Modeling\n\nPRM in Web-Shepherd uses a **checklist** as the central evaluation tool. Here’s how it works.\n\n1.  **Checklist generation**: When a user gives a high-level instruction like \"Search for product A, go to the detail page, then check the reviews,\" Web-Shepherd decomposes it into a checklist of clear, concrete **subgoals**.\n    *   `[ ] Search for \"A\"`\n    *   `[ ] Click \"A\" in the product list`\n    *   `[ ] Navigate to the product detail page`\n    *   `[ ] Scroll to the reviews section`\n    *   `[ ] Check the first review`\n2.  **Stepwise progress evaluation**: Each time the agent takes an action on the webpage (clicking, typing, etc.), the model assesses which checklist item it corresponds to and how successfully it was performed.\n\n* * *\n\n### 3\\. How rewards are computed\n\nWeb-Shepherd evaluates the agent’s behavior for each checklist item and computes rewards based on the results.\n\n*   **Feedback generation (Feedback Generation)**: Using **next-token prediction**, the model generates evaluation feedback for each checklist item. This feedback typically falls into three states:\n    *   **\"Yes\" (completed)**: The subgoal was successfully accomplished.\n    *   **\"No\" (failed)**: The subgoal wasn’t accomplished or the action was irrelevant.\n    *   **\"In Progress\" (in progress)**: The subgoal is currently being worked on.\n*   **Reward calculation (Reward Calculation)**: The final reward for each step is derived by aggregating evaluations across the entire checklist. Concretely, the model computes the probabilities of the tokens \"Yes,\" \"No,\" and \"In Progress\" for each checklist item, combines and averages them to determine the reward score for that step. For example, the higher the probability of \"Yes,\" the higher the reward.\n\nThis approach gives the agent not just a binary success/failure signal but **quantitative, fine-grained feedback** on its current progress. As a result, the agent can learn the correct action sequence more efficiently and is better able to complete long, complex tasks."
    },
    {
      "turn": 3,
      "user": "Going forward, for content related to papers, please include exact sources and, if possible, explain using quotations.",
      "assistant": "Got it. I’ll remember your request. Going forward, when explaining content related to papers, I will cite exact sources and, where possible, use quotations to convey the information. You can always request to have your information deleted or\n[Settings](https://gemini.google.com/saved-info)\nis where you can manage saved information."
    },
    {
      "turn": 4,
      "user": "Also, in most of our conversations, please think ahead about what I might potentially be curious about and make it a habit to ask clarification question often, in a style like, \"Would you like me to also tell you about things like this and that?\"",
      "assistant": "Yes, understood. In future conversations, I'll proactively anticipate your potential questions and ask clarification questions more often, in a style like \"Do you also need this information?\" I'll keep this request in mind. You can always \n[Settings](https://gemini.google.com/saved-info)\n manage your saved information or request deletion."
    }
  ]
}