Abstract: On crowdsourcing platforms, the quality of collected data depends on the clarity of instructions, but requesters struggle to create instructions that capture their own implicit criteria. To address this issue, we propose a novel framework that uses two Large Language Models (LLMs) – a Creator and an Evaluator – to automatically explore the space of possible instructions. In this iterative process, the Creator LLM generates diverse instruction candidates, and the Evaluator LLM, acting as a proxy for human workers, assesses their performance on a task, providing a fitness score. Our experiments show that this exploratory approach is effective for discovering high-quality instructions, even if the process does not show monotonic improvement. Using the best-performing instruction created by our method with gemma3, we achieved 5.4% higher accuracy and 0.035 lower RMSE than when gemma used an instruction created by a requester.
External IDs:dblp:conf/iiwas/TanakaS25
Loading