Abstract: Multimodal Large Language Models (MLLMs) have shown tremendous potential in Multimodal Entity Linking (MEL). However, they are still far from achieving the expected effectiveness in practical applications. This could be due to limitations in the MEL dataset used for training. Existing MEL datasets primarily focus on simple tasks and only consider the direct matching of mentions with labeled entities within a multimodal context, ignoring mentions of unmatched entities. Factors such as the presence of the ground-truth entity within the candidate set and its position directly impact the performance of MLLMs on MEL tasks. To tackle these obstacles, we constructed DiffMEL, the first large-scale difficulty-graded dataset for MEL of MLLMs. DiffMEL contains 79,625 instances and 318.5K instance-related high-resolution images, covering 3 various difficulty graded linking tasks and 5 different entity themes. We utilize DiffMEL to train several open-source MLLMs. Experiment results demonstrate DiffMEL empowers MLLMs with stronger capabilities in MEL by a large-margin (5%-56.1%). our dataset is now available at https://github.com/ww-ffff/DiffMEL.
Loading