Human-Guided Harm Recovery for Computer Use Agents

Human-Guided Harm Recovery for Computer Use Agents

ICLR 2026 Conference Submission18984 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Computer Use Agent, Harm Remediation, Alignment, Safety

TL;DR: We introduce harm recovery—a post-execution safety method that uses human preferences to guide computer-use agents in optimally recovering from harmful scenarios.

Abstract: As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also to detect and remediate harm when prevention fails. Existing safety work predominantly focuses on pre-execution safeguards, such as training harm classifiers or writing comprehensive safety specifications to avoid ever enacting harmful behavior. However, it is often infeasible or impossible in practice to anticipate every consequence of each action, especially in environments as dynamic and contextually rich as computer use. We first formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: optimally steering an agent from a harmful state back to a safe one. We then introduce BackBench—a benchmark of 50 computer-use tasks that test an agent’s ability to mitigate and backtrack from states of harm, and find that baseline computer-use agents perform poorly, frequently executing slow, unsafe, and misaligned fixes. Finally, we develop a human preference-guided scaffold that generates multiple candidate recovery plans and reranks them at test time using a principled rubric of recovery plan attributes. This rubric is derived from a formative user study identifying the dimensions people value when judging remediation quality; building on it, we also contribute a dataset of 1,150 pairwise multiattribute human judgments on recovery plans, enabling a systematic analysis of how attribute importance shifts across scenarios. Incorporating these human preference signals yields substantial, statistically significant improvements in agent backtracking success rates under both human and automatic evaluation. Together, these contributions lay the foundation for a new class of agent safety methods—ones that confront harm not only by preventing it, but by learning how to navigate its aftermath with alignment and intent.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 18984

Loading