Welcome to the Era of Delayed Rewards for Language Agents: On Non-Verifiable Tasks

ACL ARR 2026 January Submission6285 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language agents, non-verifiable tasks, delayed rewards, social simulation, self-reflection, multi-agent systems, reinforcement learning
Abstract: Agent tasks divide into verifiable (e.g., math, code) with immediate ground-truth rewards, and non-verifiable (e.g., marketing, policy, research communication) where rewards appear instant but are fundamentally delayed when considering users' long-term goals. A user asking an LLM to write marketing copy may receive immediate output, but their true objective—readership, engagement, influence—unfolds over time through social propagation. Current approaches to non-verifiable tasks—self-refine, LLM-as-judge, multi-agent debate—rely on instant feedback that cannot capture this delayed, emergent value. We argue for a paradigm shift: from instant feedback to delayed reward derivation via task-appropriate simulation environments. Different tasks require different simulations: social media simulators for viral content, academic platforms for research communication, or policy debate forums for proposals. We validate this using OASIS, a scalable social media simulator with LLM-powered agents, comparing self-refine with our method MARFE (Multi-Agent Reward-Free Evolution). Across four tasks evaluated by three frontier LLM judges, MARFE achieves 58.3% win rate versus baseline's 41.7%, demonstrating that delayed social feedback provides superior signal for non-verifiable tasks.
Paper Type: Short
Research Area: AI/LLM Agents
Research Area Keywords: AI / LLM Agents
Contribution Types: Position papers
Languages Studied: English
Submission Number: 6285
Loading