Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai agents, real world, agentic environment, agent evaluations, llm
TL;DR: We introduce a benchmark dataset constructed from diverse, real world tasks, grounded in economic value in order to advance LLM capabilities in the knowledge work domain.
Abstract: Evaluating large language models (LLMs) on complex end-to-end digital work remains an open challenge. Many existing benchmarks are synthetic, static, or single-domain, limiting real world applicability and economic relevance. We present LaborMarketplaceBenchmark, a dataset and an evaluation pipeline derived from real tasks on LaborMarketplace. Starting from the marketplace corpus, we construct LaborMarketplaceBenchmark Qualified via heuristics-based filtering of fixed-price, single-milestone tasks and an automated feasibility assessment (Qualification Agent). We then derive LaborMarketplaceBenchmark Verified, a manually validated, PII-safe subset suitable for research use by the community. LaborMarketplaceBenchmark spans nine work categories and 572 unique task types, with tasks that resulted in an accepted deliverable and payouts ranging from \$35 to \$250 per job on average, enabling economically grounded and dynamically refreshable evaluation. We show initial results for several leading LLMs on real-world Writing tasks, with human-in-the-loop experiments where agents iterate on their work based on human feedback. LaborMarketplaceBenchmark provides a practical, reproducible path to measure real-world progress while illuminating where current systems fall short.
Submission Number: 35
Loading