Subobject-level Image Tokenization

Delong Chen; Samuel Cahyawijaya; Jianfeng Liu; Baoyuan Wang; Pascale Fung

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A subword style image tokenizer for VLM

Abstract: Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection--a simple task that can be handled well by a compact model--with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Lay Summary: Understanding images with AI typically involves breaking pictures into fixed grids of squares, which often misses the natural boundaries and meaningful parts of objects. Our research addresses this limitation by proposing a new way to divide images, called subobject-level tokenization, that dynamically adapts to the actual shapes and structures within the image. Instead of uniform squares, our method identifies and segments the image into visually coherent parts, much like how words in language are composed of smaller meaningful units (subwords). We integrated this adaptive segmentation with vision-language models, which combine visual perception with text understanding. We found that this more intuitive division of images significantly improved the model’s ability to accurately interpret and reason about pictures. This innovation makes AI systems more aligned with human visual understanding, enhancing their usefulness in real-world applications such as detailed image captioning, object recognition, and visual question answering, ultimately making AI vision models both smarter and more intuitive.

Link To Code: https://github.com/ChenDelong1999/subobjects

Primary Area: Applications->Computer Vision

Keywords: image tokenization, vision-language models, image captioning, boundary detection, watershed segmentation

Submission Number: 10200

Loading