Revision History for Frame-Voyager: Learning to Query...

Camera Ready Revision Edit by Authors

  • 28 Mar 2025, 03:14 Coordinated Universal Time
  • Title: Frame-Voyager: Learning to Query Frames for Video Large Language Models
  • Authors: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
  • Authorids: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
  • Keywords: Video-LLM, Adaptive Frame Sampling
  • TLDR: Frame-Voyager queries optimal frame combinations for Video-LLMs based on task-specific queries, outperforming existing SOTA methods and achieving best results in Video Question Answering benchmarks as a plug-and-play solution.
  • Abstract:

    Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

  • PDF: pdf
  • Primary Area: applications to computer vision, audio, language, and other modalities

    Edit Info


    Readers: Everyone
    Writers: ICLR 2025 Conference, ICLR 2025 Conference Submission13344 Authors
    Signatures: ICLR 2025 Conference Submission13344 Authors

    Camera Ready Revision Edit by Authors

    • 28 Mar 2025, 03:13 Coordinated Universal Time
    • Title: Frame-Voyager: Learning to Query Frames for Video Large Language Models
    • Authors: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
    • Authorids: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
    • Keywords: Video-LLM, Adaptive Frame Sampling
    • TLDR: Frame-Voyager queries optimal frame combinations for Video-LLMs based on task-specific queries, outperforming existing SOTA methods and achieving best results in Video Question Answering benchmarks as a plug-and-play solution.
    • Abstract:

      Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

    • PDF: pdf
    • Primary Area: applications to computer vision, audio, language, and other modalities

      Edit Info


      Readers: Everyone
      Writers: ICLR 2025 Conference, ICLR 2025 Conference Submission13344 Authors
      Signatures: ICLR 2025 Conference Submission13344 Authors

      Camera Ready Revision Edit by Authors

      • 27 Feb 2025, 07:14 Coordinated Universal Time
      • Title: Frame-Voyager: Learning to Query Frames for Video Large Language Models
      • Authors: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
      • Authorids: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
      • Keywords: Video-LLM, Adaptive Frame Sampling
      • TLDR: Frame-Voyager queries optimal frame combinations for Video-LLMs based on task-specific queries, outperforming existing SOTA methods and achieving best results in Video Question Answering benchmarks as a plug-and-play solution.
      • Abstract:

        Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

      • PDF: pdf
      • Primary Area: applications to computer vision, audio, language, and other modalities

        Edit Info


        Readers: Everyone
        Writers: ICLR 2025 Conference, ICLR 2025 Conference Submission13344 Authors
        Signatures: ICLR 2025 Conference Submission13344 Authors

        Rebuttal Revision Edit by Authors

        • 28 Nov 2024, 10:17 Coordinated Universal Time
        • Abstract:

          Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

        • PDF: pdf

          Edit Info


          Readers: Everyone
          Writers: ICLR 2025 Conference, ICLR 2025 Conference Submission13344 Authors
          Signatures: ICLR 2025 Conference Submission13344 Authors

          Rebuttal Revision Edit by Authors

          • 27 Nov 2024, 14:44 Coordinated Universal Time
          • Abstract:

            Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.

          • PDF: pdf

            Edit Info


            Readers: Everyone
            Writers: ICLR 2025 Conference, ICLR 2025 Conference Submission13344 Authors
            Signatures: ICLR 2025 Conference Submission13344 Authors