COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu; Meng Cao; Xinyuan Shi; Xiaodan Liang

COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu, Meng Cao, Xinyuan Shi, Xiaodan Liang

Published: 13 Jan 2026, Last Modified: 13 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tool), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use finetuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering "catastrophic forgetting" of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then, relevant tools are dynamically selected based on the similarity between user instructions and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=NT9tHHTlXn

Changes Since Last Submission: We have made the following revisions in response to the editor’s and reviewers’ comments: 1. Incorporation of rebuttal experiments and clarifications. We have incorporated the additional experiments and clarifications introduced during the rebuttal into the main paper and the appendix at appropriate locations, ensuring that the final manuscript is self-contained and complete. 2. Refinement of scalability claims. We have refined and re-scoped the claims regarding scalability to more accurately reflect the empirical evidence presented in the paper. The revised discussion avoids overgeneralization and clearly states the conditions under which the observed scalability holds. 3. Discussion of VideoToolBench limitations. We have added a more prominent discussion of the limitations of VideoToolBench as a synthetic benchmark in the main text. In particular, these limitations are explicitly addressed in Section 6 (Limitations).

Assigned Action Editor: ~Yuhang_Zang1

Submission Number: 6392

Loading