G-Cap: A Game Character Caption Generator

G-Cap: A Game Character Caption Generator

ACL ARR 2026 January Submission1749 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Vision-Language Models; image captioning

Abstract: While Large Vision-Language Models (LVLMs) have demonstrated remarkable proficiency in image captioning, existing research primarily focuses on real-world scenarios, leaving surreal, highly stylized, and semantically hybrid virtual-world scenarios significantly underexplored. In this work, we introduce \textbf{Game Character Captioning}, a novel task designed to evaluate LVLMs' capability to perceive and describe game character from the virtual-world. To facilitate evaluation, we establish \textbf{GC-Bench}, a manually annotated benchmark, and propose \textbf{Graph-F1} to effectively assess performance on this task. Our evaluation reveals that: (1) current state-of-the-art LVLMs, including closed-source giants such as \texttt{Gemini 3 Pro} and \texttt{GPT-5.1}, struggle to maintain the high performance seen in real-world scenarios; and (2) a notable gap exists between open-source and closed-source models. To bridge this gap, we construct \textbf{GC-148K}, a large-scale dataset generated via a specialized data pipeline, and develop the \textbf{G-Cap} series. Experiments demonstrate that G-Cap series rivals the performance of advanced closed-source models at a lower cost, offering an efficient solution for industrial-grade production environment.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality; image text matching

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 1749

Loading