Abstract: Large language models (LLMs) have recently
been used for sequential decision making in in-
teractive environments. However, leveraging en-
vironment reward signals for continual LLM ac-
tor improvement is not straightforward. We pro-
pose Skill Set Optimization (SSO) for improv-
ing LLM actor performance through constructing
and refining sets of transferable skills. SSO con-
structs skills by extracting common subtrajecto-
ries with high rewards and generating subgoals
and instructions to represent each skill. These
skills are provided to the LLM actor in-context
to reinforce behaviors with high rewards. Then,
SSO further refines the skill set by pruning skills
that do not continue to result in high rewards.
We evaluate our method in the classic videogame
NetHack and the text environment ScienceWorld
to demonstrate SSO’s ability to optimize a set of
skills and perform in-context policy improvement.
SSO outperforms baselines by 40% in our cus-
tom NetHack task and outperforms the previous
state-of-the-art in ScienceWorld by 35%.
0 Replies
Loading