FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann; Jesse Allardice; David Mizrahi; Enrico Fini; Oğuzhan Fatih Kar; Elmira Amirloo; Alaaeldin El-Nouby; Amir Zamir; Afshin Dehghan

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present FlexTok, an image tokenizer capable of resampling images into 1D token sequences of flexible length.

Abstract: We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

Lay Summary: Generating high-quality images with artificial intelligence typically requires representing images in simplified forms that computers can handle more efficiently. Traditionally, images are split into grids of small patches, but this method doesn't adapt well to images of varying complexity. We developed FlexTok, a new method that converts images into a flexible number of simple building blocks, or "tokens," based on how detailed or complex an image is. For instance, a complex image could be broken into more tokens for greater detail, while a simpler image might need fewer tokens. FlexTok effectively compresses information by using fewer tokens without losing crucial visual details. By training image generation models with FlexTok, we achieved a quality comparable to leading methods but with significantly fewer tokens, making the process faster and more efficient. Additionally, FlexTok can describe images from coarse to detailed features, similar to how humans gradually perceive visual information. This flexible and efficient method could greatly enhance applications where rapid, high-quality image generation is important, like long-horizon video generation, as well as visual understanding and reasoning.

Link To Code: https://github.com/apple/ml-flextok

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: tokenization, variable rate compression, image generation, computer vision

Submission Number: 3792

Loading