Abstract: We propose OMNICAPTIONER, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains.
Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural
images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically
rich textual representations, our framework bridges the gap between visual and
textual modalities. Our results highlight three key advantages: (i) Enhanced Visual
Reasoning with LLMs, where long-context captions of visual modalities empower
LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like
text-to-image generation and image transformation; and (iii) Efficient Supervised
Fine-Tuning (SFT), which enables faster convergence with less data. We believe the
versatility and adaptability of OMNICAPTIONER can offer a new perspective for
bridging the gap between language and visual modalities.
Loading