Keywords: Agent Skills, LLM Agents, Procedural Knowledge, Skill Benchmarking, Financial AI, Investment Management, Tool Use, Domain-Specific Agents, SkillsBench, AI Evaluation
TL;DR: FinSkillBench benchmarks how curated procedural skills improve LLM agent performance in high-stakes investment management tasks, showing large gains over no-skill and self-generated skill approaches.
Abstract: Investment management is a high-stakes domain in which agentic AI systems must do more than generate plausible text. They must retrieve point-in-time data, assemble correct computational inputs, invoke specialized methods, and produce auditable structured outputs. We introduce FinSkillBench, an evaluation suite designed to measure whether language model agents can effectively use financial domain skills to solve investment management tasks. The benchmark spans three domains—portfolio construction, risk management, and fundamental analysis—and includes 12 subtasks with 2,603 task episodes. Each episode provides point-in-time inputs, hidden ground truth, and a task-specific verifier. We compare three conditions: no skill, curated skill packages consisting of procedural documents and executable components, and self-generated skills in which the agent writes and reuses its own procedures within an episode. Across 9 models and a large-scale evaluation, curated skills consistently improve performance, raising mean scores from 0.366 to 0.528, with the largest gains in portfolio construction and risk management. In contrast, self-generated skills provide little benefit despite higher computational cost. An independent evaluation using a separate agent framework reproduces the directional pattern across all three domains, with skill effects varying by subtask and harness. These findings show that in investment-management agents, reliable procedural skills can be as important as model choice, while naive self-generation is often ineffective. FinSkillBench provides a rigorous benchmark for evaluating domain-specific procedural knowledge as a first-class component of agent design.
Presentation Mode: Yes, at least one author will attend and present in person.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading