LLAVIDAL: Benchmarking \underline{L}arge \underline{LA}nguage \underline{VI}sion Models for \underline{D}aily \underline{A}ctivities of \underline{L}iving

29 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Activities of Daily Living
Abstract: Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing internet videos, yet they struggle with the visually perplexing dynamics present in Activities of Daily Living (ADL) due to limited pertinent datasets and models tailored to relevant cues. To this end, we propose a framework for curating ADL multiview datasets to fine-tune LLVMs, resulting in the creation of \textbf{\datasetname}, comprising 100K RGB video-instruction pairs, language descriptions, 3D skeletons, and action-conditioned object trajectories. We introduce \textbf{\modelname}, an LLVM capable of incorporating 3D poses and relevant object trajectories to understand the intricate spatiotemporal relationships within ADLs. Furthermore, we present a novel benchmark, \textbf{ADLMCQ}, for quantifying LLVM effectiveness in ADL scenarios. When trained on \datasetname, \modelname~consistently achieves state-of-the-art performance across all ADL evaluation metrics. Qualitative analysis reveals \modelname's temporal reasoning capabilities in understanding ADL. The link to the dataset is provided at: \href{https://adl-x.github.io/}{https://adl-x.github.io/}
Supplementary Material: zip
Submission Number: 1877
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview