Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Xiyang Wu; Zongxia Li; Guangyao Shi; Alexander Duffy; Tyler Marques; Matthew Lyle Olson; Tianyi Zhou; Dinesh Manocha

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, Dinesh Manocha

Published: 15 May 2026, Last Modified: 22 May 2026AgentSkills 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM agents, Agent skills, Skill bank, Long-horizon Decision Making, Co-evolution, Reinforcement Learning, GRPO, Game Playing

TL;DR: COS-PLAY is a co-evolution framework that enables LLM agents to discover, retain, and reuse structured skills for long-horizon game decision-making, achieving strong gains with an 8B model across diverse interactive environments.

Abstract: Long-horizon interactive environments provide a natural testbed for evaluating agents’ ability to use skills, as they require multi-step reasoning, skill chaining over many timesteps, and robust decision-making under delayed rewards and partial observability. Games offer a particularly diverse and reproducible class of such environments, making them a controllable setting for studying multi-skill long-horizon behavior. While Large Language Models (LLMs) are increasingly used as game-playing agents, they often struggle with consistent long-horizon decision-making because they lack mechanisms to discover, retain, and reuse structured skills across episodes. We introduce COS-PLAY, a co-evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action generation, while an agent-managed skill pipeline discovers reusable skills from the agent’s unlabeled rollouts. The two components improve jointly: the decision agent learns more effective skill retrieval and action-taking policies, while the skill bank agent continually extracts, refines, and updates skills together with their effect contracts. Across six game environments, COS-PLAY with an 8B base model achieves a 25.1% average reward improvement over four frontier LLM baselines on single-player game benchmarks, while remaining competitive on multi-player social reasoning games.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 27

Loading