Welcome to the Era of Delayed Rewards for Language Agents:  On Non-Verifiable Tasks

Welcome to the Era of Delayed Rewards for Language Agents: On Non-Verifiable Tasks

ACL ARR 2026 January Submission6285 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language agents, non-verifiable tasks, delayed rewards, social simulation, self-reflection, multi-agent systems, reinforcement learning

Abstract: Agent tasks divide into verifiable (e.g., math, code) with immediate ground-truth rewards, and non-verifiable (e.g., marketing, policy, research communication) where rewards appear instant but are fundamentally delayed when considering users' long-term goals. A user asking an LLM to write marketing copy may receive immediate output, but their true objective—readership, engagement, influence—unfolds over time through social propagation. Current approaches to non-verifiable tasks—self-refine, LLM-as-judge, multi-agent debate—rely on instant feedback that cannot capture this delayed, emergent value. We argue for a paradigm shift: from instant feedback to delayed reward derivation via task-appropriate simulation environments. Different tasks require different simulations: social media simulators for viral content, academic platforms for research communication, or policy debate forums for proposals. We validate this using OASIS, a scalable social media simulator with LLM-powered agents, comparing self-refine with our method MARFE (Multi-Agent Reward-Free Evolution). Across four tasks evaluated by three frontier LLM judges, MARFE achieves 58.3% win rate versus baseline's 41.7%, demonstrating that delayed social feedback provides superior signal for non-verifiable tasks.

Paper Type: Short

Research Area: AI/LLM Agents

Research Area Keywords: AI / LLM Agents

Contribution Types: Position papers

Languages Studied: English

Submission Number: 6285

Loading