Abstract: Multimodal named entity recognition (MNER) enhances text-based NER by incorporating images. Traditional MNER methods struggle to align text and images, leading to suboptimal performance. Recently, large vision-language models (LVLMs) have achieved success in various multimodal applications due to their powerful text-image alignment capabilities, providing a new generative paradigm for the MNER task. However, due to the limited reasoning capabilities of small-parameter LVLMs when dealing with complex tasks, directly fine-tuning these models for MNER task often yields unsatisfactory results. To address these challenges, we decompose the complex task of MNER into two simpler subtasks, namely multimodal span detection and multimodal span classification, and propose a novel Multi-Expert Collaborative (MEC) framework, which employs four experts based on the LVLMs to work together on the task. The sequential output from each expert serves as input for the subsequent one, culminating in an accurate and cohesive entity recognition result. Experiments conducted on two widely used MNER datasets demonstrate the effectiveness of our method.
Loading