MTIVE: Multi-Task Image Verification Engine Using Vision-Language Models for E-commerce

Published: 18 Apr 2026, Last Modified: 24 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLM, LLM, multi task learning, E-commerce
TL;DR: We present MTIVE, a comprehensive framework for adapting VLMs to multi-task e-commerce scenarios.
Abstract: Vision-language models show promise for e-commerce automation but struggle with noisy real-world images and multi-task requirements. We introduce MTIVE, a curriculum learning framework that progressively adapts base models through three stages: continued pre-training on large-scale e-commerce datasets with contrastive learning and diverse dialogue templates, instruction tuning on synthetic data, and modular task-specific expert training. Our architecture uses frozen base weights with stacked LoRA adapters—shared modules for domain knowledge and lightweight task-specific experts—enabling continual learning without catastrophic forgetting. MTIVE outperforms open-source and proprietary baselines in both standard and continual learning settings.
Submission Type: Deployed
Copyright Form: pdf
Submission Number: 512
Loading