# FrameThinker

This is the official repository for the core code of the paper: **FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting**.

## 📖 About The Project

FrameThinker is a novel framework for long-video reasoning that challenges the inefficient, passive methods of traditional models. Instead of processing a fixed set of pre-sampled frames, FrameThinker **actively** interrogates video content through a **multi-turn, iterative process**. It intelligently spotlights relevant frame sequences to gather evidence, guided by a Cognitive Consistency Verification (CCV) module that ensures its reasoning is logical and interpretable. Across six challenging benchmarks, FrameThinker achieves an **average +10.4% accuracy improvement** over the baseline. As a highlight, it surpasses the strong LongVILA-R1 to set a new state-of-the-art on the LongVideo-Reason benchmark, using just **20.6 frames on average compared to 512**.

## 💻 Core Code Overview

This repository contains the core code components for FrameThinker. Our full implementation is built upon the verl framework for reinforcement learning. The files provided here represent the key modules we developed to enable the FrameThinker's unique capabilities.

The main components are:

-    `reward.py`: This file implements our custom reward functions used during the RL phase, defining the logic for final accuracy rewards and conditional action bonuses. It also includes the Cognitive Consistency Verification (CCV) module. 

-   `agent/tool_envs.py` & `agent/parallel_env.py`: These files define the environment for multi-turn action. They manage the stateful interaction between the model and the video data, handling action calls from the model and returning the corresponding observations (e.g., a new set of frames). 

-   `agent/envs/visual_agent/frame_thinker.py`: This is the heart of the FrameThinker's action module. It implements the core logic for the defined actions, such as `choose frames` and `get frame number`. This script is responsible for parsing the model's generated text, identifying the chosen action and its parameters, and translating these into executable operations on the video data.

