Cameras as Relative Positional Encoding

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-view transformer; positional encoding; projective transformation
TL;DR: Camera conditioning as relative projective transformation in multi-view transformer
Abstract: Transformers are increasingly prevalent for multiview computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multiview transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose—Projective Positional Encoding (PRoPE)—that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative conditioning methods improve performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 20136
Loading