Lightweight Multi-modal Emergency Detection and Translation for Extremely Low-Resource Contexts

22 Oct 2025 (modified: 23 Dec 2025)Submitted to MMLoSo 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal systems, Disaster Response, Federated Learning
TL;DR: A lightweight model to identify disasters/accidents/emergencies with low-resource hardware
Abstract: Emergency reporting in remote regions is often delayed by infrastructural challenges and language barriers. While multimodal AI offers a solution, its deployment is hindered by scarce localized data and computational constraints. This paper addresses extreme data scarcity by proposing and evaluating a lightweight, Vision-to-Telugu emergency classification pipeline. We use a novel, imbalanced 70-image dataset (6 categories, e.g., fire, snake bite) to simulate this data constraint. We benchmark 15 vision encoders, pairing the classifier with a zero-overhead dictionary for 100% accurate Telugu translation. To validate our small-set results, we conduct a Bootstrap-Wilcoxon statistical analysis. Our findings show DINOv2-Base (82.45% mean accuracy) statistically significantly outperforms a CLIP-ViT-B32 baseline (53.91%) with a large effect size ($p < 0.001$, $\delta = +0.820$). This work provides a blueprint and robust validation methodology for effective multi-modal systems in severe data-constrained, social-impact settings.
Submission Number: 18
Loading