Light of Normals: Unified Feature Representation for Universal Photometric Stereo

ICLR
Submission #18051
Teaser Image

(left) Given multi-light images from a fixed viewpoint, LINO UniPS recovers sharper, more faithful normals than UniPS/SDM-UniPS and visually rivals a 3D scanner. (right) On the DiLiGenT, a clear correlation exists between the consistency of encoder features (CSIM/SSIM) and the final reconstruction accuracy (1/MAE).

Abstract

Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key components: (i) Light Register Tokens with light alignment supervision to aggregate point, direction, and environment lights; (ii) Interleaved Attention Block featuring global cross-image attention that takes all lighting conditions together so the encoder can factor out lighting while retaining normal-related evidence. Second, high-frequency geometric details are easily lost. We address this with (i) a Wavelet-based Dual-branch Architecture and (ii) a Normal-gradient Perception Loss. These techniques yield a unified feature space in which lighting is explicitly represented by register tokens, while normal details are preserved via wavelet branch. We further introduce PS-Verse, a large-scale synthetic dataset graded by geometric complexity and lighting diversity, and adopt curriculum training from simple to complex scenes. Extensive experiments show new state-of-the-art results on public benchmarks (e.g., DiLiGenT, Luces), stronger generalization to real materials, and improved efficiency; ablations confirm that Light Register tokens + Interleaved Attention Block drive better feature decoupling, while Wavelet-based Dual-branch Architecture + Normal-gradient Perception Loss recover finer details.

Method

Overview of the LiNo-UniPS architecture, featuring a Light-Normal Contextual Encoder, Decoder, and loss computation.

Left Top Image

LiNo-UniPS significantly performs better when processing data characterized by high-frequency information.

Left Bottom Image

Attention maps of lighting registers tokens on the encoder's final-layer. Different tokens exhibit specialized attention on diverse lighting information from multiple directions.

Right Top Image

The features extracted by our LiNO-UniPS encoder effectively disentangle lighting from surface normal information and concurrently exhibit enhanced consistency.

Right Bottom Image

Some Visual Results

Hover to view an example from the multi-light input images and the corresponding surface normals reconstructed by LiNo-UniPS.

Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7 Image 8 Image 9 Image 10 Image 11 Image 12 Image 13 Image 14 Image 15 Image 16 Image 17 Image 18 Image 19 Image 20 Image 21 Image 22 Image 23 Image 24 Image 25 Image 26 Image 27 Image 28 Image 29 Image 30

PS-Verse Dataset

Level 1

Showcase Image 1
Showcase Image 2
Showcase Image 10
Showcase Image 11
Showcase Image 12
Showcase Image 20
Showcase Image 21
Showcase Image 30
Showcase Image 31
Showcase Image 40

Level 2

Showcase Image 1
Showcase Image 2
Showcase Image 10
Showcase Image 11
Showcase Image 12
Showcase Image 20
Showcase Image 21
Showcase Image 30
Showcase Image 31
Showcase Image 40

Level 3

Showcase Image 1
Showcase Image 2
Showcase Image 10
Showcase Image 11
Showcase Image 12
Showcase Image 20
Showcase Image 21
Showcase Image 30
Showcase Image 31
Showcase Image 40

Level 4

Showcase Image 1
Showcase Image 2
Showcase Image 10
Showcase Image 11
Showcase Image 12
Showcase Image 20
Showcase Image 21
Showcase Image 30
Showcase Image 31
Showcase Image 40

Level 5

Showcase Image 1
Showcase Image 2
Showcase Image 10
Showcase Image 11
Showcase Image 12
Showcase Image 20
Showcase Image 21
Showcase Image 30
Showcase Image 31
Showcase Image 40