Tri-Modal Streaming Fusion: Real-Time Vision Integration for LLaMA-Omni2-0.5B via Sparse Cross-Attention Networks
Keywords: Multimodal AI, Edge Computing, Streaming Fusion, Sparse Attention, Real-time Vision-Language Models
Presentation Preference: Yes
Abstract: We propose TriStream-Omni, a novel architecture that extends LLaMA-Omni2-0.5B's speech-language capabilities to include vision processing while maintaining sub-600ms latency. Our approach introduces three groundbreaking innovations:
First, we implement Sparse Temporal Vision Encoding (STVE), which processes visual inputs through a lightweight MobileViT backbone with temporal pooling, reducing computational overhead by 73% compared to traditional vision transformers. STVE extracts only salient visual tokens using learned importance masks, dynamically adjusting token density based on image complexity.
Second, our Asynchronous Tri-Modal Fusion (ATF) mechanism enables parallel processing of speech, text, and vision streams through independent encoding pathways that converge via learned routing weights. Unlike conventional sequential processing, ATF employs a novel "fusion-on-demand" strategy where modalities are combined only when cross-modal reasoning is required, preserving the model's original 583ms speech latency for audio-only queries.
Third, we introduce Cascaded Mixture-of-Experts (CMoE) routing, where specialized expert networks handle different modal combinations. Each expert (speech-only, vision-only, speech-vision, full tri-modal) is activated based on input characteristics, allowing the 0.5B model to achieve performance comparable to 3B parameter models. The cascade design processes simple queries through lightweight experts first, engaging complex tri-modal experts only when necessary, reducing average compute by 67%.
Submission Number: 13
Loading