StableSynthNet: Disentangled HyperNetworks for Enhanced On-device Multi-modal Model Generalization

ACL ARR 2024 June Submission9 Authors

04 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the modern interconnected landscape, the proliferation of smart devices leads to the continuous collection of extensive and varied personal multi-modal data. This situation necessitates the development of sophisticated, personalized, and device-aware services. Traditional AI systems, mainly cloud-based, face considerable hurdles in adapting to the dynamic data flow between cloud services and devices. While HyperNetworks have enhanced performance and real-time processing over conventional fine-tuning approaches, they tend to be over-parameterized due to the underutilization of consistent data types. Our solution, StableSynthNet, is a novel system consisting of three components: Driver Contrastive Training, Template-Driver Extraction, and Offset-Driver Separation. This design uniquely separates the template parameter driver, which houses common data characteristics, from the offset parameter driver, where individual data specifics are stored. The resulting combined driver achieves an optimal mix of consistency and adaptability. Our extensive testing in the fields of video question answering and video retrieval has demonstrated the superior efficiency and effectiveness of StableSynthNet.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal application, image text matching, multimodality
Languages Studied: English
Submission Number: 9
Loading