CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

ACL ARR 2025 July Submission107 Authors

23 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9) and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: LLM, MLLM, Instruction tuning,

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 5.2 Dataset

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Section 5.2 Dataset

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 4.2.2 CoTasks construction, 5.3 Performance

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 5 Experiments

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5.3 Performance

C4 Parameters For Packages: Yes

C4 Elaboration: We use GPT-4 as evaluator, detailed in Section 5.1 LLM-based Evaluator

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: No

D3 Data Consent: No

D4 Ethics Review Board Approval: No

D5 Characteristics Of Annotators: No

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: ChatGPT

Author Submission Checklist: yes

Submission Number: 107

Loading