EAGLE: Egocentric AGgregated Language-vision Engine

Descriptive text for image

Abstract

Recent advancements in egocentric video analysis have been significantly propelled by the expansion of datasets and a growing array of diverse understanding tasks. Despite this progress, the field faces challenges arising from fragmented model development and inconsistent annotation standards, hindering a holistic interpretation of video content. In response, we present the Egocentric AGgregated Language Engine (EAGLE) model, accompanied by the comprehensive EAGLE-400K dataset. EAGLE, an innovative video-based multimodal large language model (MLLM), has been specifically fine-tuned for egocentric video content, utilizing the extensive data in EAGLE-400k. This dataset comprises 400K visual instruction-tuning data from diverse sources and enhances the learning of activities and procedural knowledge. Moreover, we introduce a novel evaluation metric that facilitates a thorough and nuanced assessment of MLLM in the realm of egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing MLLMs, underscoring its effectiveness. This work presents an advance in balancing task specialization with comprehensive egocentric video understanding.

The qualitative results showcased in this report provide intriguing insights into the unique strengths and weaknesses of each model. We encourage reviewers to delve into these detailed analyses, as doing so will offer a comprehensive understanding of how each model performs in the realm of egocentric video understanding.

Dataset Annotation

Here are showcases of the annotations used in creating the EAGLE-400K dataset. In contrast to traditional image captioning, egocentric object annotation may not encompass a wide array of objects within the field of view, as it primarily focuses on labeling the most essential or relevant objects. Moreover, due to the high cost of fine-grained annotation. The Epic-Kitchen dataset contains annotations at the object-level for 119 out of 700 videos.

All videos in PTG subset are processed and the object trajectories have been processed, and object trajectories are annotated as illustrated below. We apply filtering criteria, removing bounding boxes that either exceed 0.4 of the image area or are smaller than 0.01 of the image area, while also retaining objects with trajectories lasting more than 2 seconds. We retain solely the object's positional information, discarding the class label from the detector, as indicated by the unclear label

Epic w/o. obj

Epic w/. obj

Ego4d

PTG

A person is holding a spoon and spreading peanut butter on a plate.

A person is pouring a liquid into a bowl, which appears to be a sauce or dip.

A person is holding a knife and peanut butter jar, preparing to spread the peanut butter on a piece of bread.

A red cutting board is the key element in the scene, and it appears to be used for cutting food.

Statistic

Instructions

As indicated by the word frequency list, the instructions phrases reveal a detailed focus on describing and understanding kitchen activities. Key terms like video, describe, and current step suggest an analysis of the sequential and procedural aspects of tasks in the videos. There's a significant emphasis on visual and temporal details, with words like frame, second, and visual evidence highlighting attention to both spatial and temporal dimensions. Additionally, the use of terms like confirm if, suggest that, and evidence support points to a process of confirmation, and inference, indicating a comprehensive approach to interpreting the video content.

Descriptive text for image
Descriptive text for image

Responses

We observed that the use of phrases such as "consistent with", "as indicated", "suggests that", and "aligns with" are used to align observations with known information and context. The responses also pay attention to temporal relations, evident in phrases like "current step", "future action", and "making process", indicating a focus on the sequence and timing of cooking actions. Furthermore, the use of terms like "visual evidence", "object trajectories", and "video content" implies a reliance on visual analysis of the cooking process. This indicates that the model's responses are expected to be not only detailed and specific but also analytical and instructional, providing a comprehensive understanding of video content.

Descriptive text for image
Descriptive text for image

Qualitative Result

We present the qualitative results of each model across all different subsets. Since the Video-LLaMA model samples eight frames from arbitrarily long videos, we provide two sets of results: one from even and one from odd samples. LaViLa processes four consecutive frames, yielding four distinct outcomes. We tasked other models with describing each frame individually, using their official prompts.

Our observations indicate that BLIP-1 tends to generate single-word object descriptions, which fails to provide insights into the key activities in the video. Both BLIP-2 and InstructBLIP provide more extensive and accurate descriptions. However, the response includes some discrepancies and hullucinations about the object-associated actions LaViLa shows proficiency in capturing movement, yet it struggles with object identification. LLaVA, expected to excel in comprehensive frame description, often misses key objects and tends to roam over the image without focus. A notable issue is its occasional failure to generate any description at all.

Shikra stands out in object identification, as it is specifically trained for object grounding. However, its responses often lack depth, failing to adequately describe action transitions. ImageBind-LLM also demonstrates effectiveness in scene description and object identification, including associated actions. However, many of its frame descriptions include actions hallucinated by the model, which significantly reduces its accuracy. Video-LLaMA, though trained to generate lengthy descriptions from the video content, surprisingly often produces descriptions unrelated to the actual video content. Despite generating long descriptions, their irrelevance to the video context frequently results in lower scores.

Compared with other models, EAGLE accurately reflects the sequence of events and provides a comprehensive and specific description, detailing each action along with the probable setting. The response is succinct, conveying the necessary information without unnecessary elaboration. The narrative remains largely consistent, maintaining a clear and logical sequence. However, there are areas for further enhancement. While the model excels in capturing event sequences and setting descriptions, it can be improved to better capture the objects shown in the view and their potential usage. Given that obtaining fine-grained annotations is an expensive endeavor, the model should be designed to leverage an object detector, complemented by the imperfect annotations present in the PTG subset, to enhance its understanding of the scene.

In summary, each model shows unique strengths and weaknesses, offering insights into their capabilities and areas for improvement.


EPIC P28_25-176-192

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image


EPIC P32_06-256-272

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image


PTG A_pin_mevo_20_video-0000-0-16

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image


PTG B_coffee_HL2_14_video-0017-16-32

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image


E4D d36f586a-7b7c-43fd-b18d-a90b3e32535f_128_144

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image


E4D f281dde6-b964-4bc2-90a2-49ca2830e05d_320_336

Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image Descriptive text for image