Prediction of actions and places by the time series recognition from images with Multimodal LLM

Published: 01 Jan 2024, Last Modified: 31 Oct 2024ICSC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, the risk of accidents in the homes of older adults in an aging society has increased, and there is a need to address this problem. We took up the challenge of utilising explainable AI techniques to identify accident risks at home and suggest safer alternatives. This study combined knowledge graphs and large-scale language models to solve real-world problems. Specifically, we addressed answering questions using a multimodal dataset of videos recording daily activities and a knowledge graph. The dataset represents the living activities in the virtual space and provides environmental information. The task is divided into two main tasks. Task 1 utilises knowledge graph to answer direct questions and processes the data using SPARQL queries. Task 2 addresses more complex questions that cannot be answered by search alone. Consequently, in Task 1, the system could answer all questions using information from the SPARQL knowledge graph. In Task 2, a certain degree of success was achieved for complex questions by reasoning with images created by concatenating multimodal LLMs and time-series images. The source code used in the experiment is available at https://github.com/tomo1115tomo/kg_reasoning_challenge.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview