StableSynthNet: Disentangled HyperNetworks for Enhanced On-device Multi-modal Model Generalization

StableSynthNet: Disentangled HyperNetworks for Enhanced On-device Multi-modal Model Generalization

ACL ARR 2024 June Submission9 Authors

04 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the modern interconnected landscape, the proliferation of smart devices leads to the continuous collection of extensive and varied personal multi-modal data. This situation necessitates the development of sophisticated, personalized, and device-aware services. Traditional AI systems, mainly cloud-based, face considerable hurdles in adapting to the dynamic data flow between cloud services and devices. While HyperNetworks have enhanced performance and real-time processing over conventional fine-tuning approaches, they tend to be over-parameterized due to the underutilization of consistent data types. Our solution, StableSynthNet, is a novel system consisting of three components: Driver Contrastive Training, Template-Driver Extraction, and Offset-Driver Separation. This design uniquely separates the template parameter driver, which houses common data characteristics, from the offset parameter driver, where individual data specifics are stored. The resulting combined driver achieves an optimal mix of consistency and adaptability. Our extensive testing in the fields of video question answering and video retrieval has demonstrated the superior efficiency and effectiveness of StableSynthNet.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, cross-modal application, image text matching, multimodality

Languages Studied: English

Submission Number: 9

Loading