FALCON++: Enabling Elastic Efficiency and Robust Perception for High-Resolution Multimodal Large Language Model via Visual Registers

Weili Guan, Renshan Zhang, Gongwei Chen, Liqiang Nie, Rui Shao

Published: 13 Jan 2026, Last Modified: 27 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception. However, most existing high-resolution MLLMs rely on cropping-based approach, which leads to fragmented visual encoding and massive redundant tokens. To tackle these issues, we propose FALCON, which introduces a novel visual register to simultaneously: 1) Eliminate redundancy during visual encoding. To directly address the visual redundancy, we propose a Register-based Representation Compacting (ReCompact) mechanism. By using learnable registers to adaptively aggregate essential information, it enables the encoder to directly produce compact visual features with minimal tokens. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. To further enhance the efficiency and perceptivity of FALCON, we advance this framework to FALCON++. It incorporates two novel techniques to simultaneously: 3) Enable elastic visual encoding. To realize flexible inference efficiency, we introduce the Elastic Register Generator (ReGen). By utilizing Multi-Scale Fourier Feature Mapping to synthesize registers, it allows for training-free adjustment of register counts to meet varying computational budgets. 4) Facilitate text-guided visual perception. To empower goal-oriented feature extraction, we propose the Instruction-aware Register Modulation (ReMod). By injecting textual semantic via low-rank weight modulation, it enables registers to aggregate query-relevant information for robust perception. Comprehensive experiments across diverse benchmarks demonstrate that FALCON and FALCON++ achieve superior performance with 9-fold reduction in visual tokens and 6.1× fewer FLOPs.
Loading