This research addresses the challenging task of intent recognition in multimodal dialogue systems by proposing an innovative approach leveraging large language models (LLMs). By fine-tuning a state-of-the-art framework using LoRA (Low-Rank Adaptation), we significantly enhance model performance. To overcome the limitations of traditional methods, we employ a comprehensive set of augmentation techniques, including OCR extraction, image cropping, rotation, color adjustments, and text-based methods such as synonym replacement and syntactic reordering. Drawing inspiration from cutting-edge techniques like knowledge distillation and Retrieval-Augmented Generation (RAG), we integrate these with large language models such as Qwen2-VL incorporating external knowledge bases for further performance improvement. Through rigorous ablation studies and careful parameter tuning, our model outperforms baseline performance by 6 percentage points, demonstrating the substantial advances achievable by leveraging large language models in multimodal intent recognition.
Keywords: Large Language Models, Intent Recognition, Multimodal Dialogue Systems, Knowledge Distillation, Retrieval-Augmented Generation, Data Augmentation, Model Fusion
Abstract:
Submission Number: 9
Loading