Abstract: This paper addresses the challenge of understanding and interpreting human activities in procedural video content. Unlike previous approaches that rely solely on human annotations, we propose to leverage the vast amount of information stored in large language models (LLMs) to improve the prediction capabilities of vision models for procedural videos. Our framework uses LLMs to extract detailed procedural instructions and contextually relevant information to enhance the training process of video models. We demonstrate that this methodology not only refines the model’s ability to predict complex hierarchical human activities but also extends its zero-shot capabilities, allowing it to generalize to unseen activities, as well as across hierarchical levels. Our simple approach outperforms baselines across different activity recognition tasks and datasets, demonstrating the benefits of exploiting the structured knowledge within LLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Addressed reviewers suggestions, mostly involving:
- Clarifications in the text, providing more details or correcting errors.
- Reordering the figures and tables to better match the location where they are mentioned in the text.
- Add additional citations.
Assigned Action Editor: ~Efstratios_Gavves1
Submission Number: 3079
Loading