Track: Short Paper Track (up to 3 pages)
Keywords: Large Language Vision Model, ADL
Abstract: With the increasing pervasiveness of video content throughout society, the demand for robust video-language models is increasingly urgent. In this work we introduce LLAVIDAL, a Large Language Vision Model tailored for Activities of Daily Living (ADL). Unlike existing models primarily trained on curated web videos, LLAVIDAL leverages a novel multiview RGB-D dataset, ADL-X, which includes 100K untrimmed video-instruction pairs, enriched with 3D skeletons and object trajectories to mimic real-world complexities. The model integrates these features to effectively understand intricate human behaviors and spatiotemporal dynamics typical of daily activities. We also introduce ADLMCQ, a new benchmark designed to evaluate the proficiency of video-language models in interpreting ADL content. Our evaluations demonstrate that LLAVIDAL significantly outperforms existing models, showcasing superior ability to process and reason about real-life video scenarios. The insights gained underscore the necessity for advanced processing techniques to handle the scale and multimodality of video data, alongside a need for comprehensive benchmarks that reflect real-world use cases more accurately. The instruction tuning data is available at https://adl-x.github.io
Submission Number: 39
Loading