When Does Embedding Arithmetic Fail? A Systematic Analysis in Remote Sensing Vision-Language Models

Jinpyo Hong; Le Yu

When Does Embedding Arithmetic Fail? A Systematic Analysis in Remote Sensing Vision-Language Models

Jinpyo Hong, Le Yu

Published: 01 Mar 2026, Last Modified: 05 Apr 2026ML4RS @ ICLR 2026 (Main) OralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show embedding arithmetic in remote-sensing CLIP models fails mainly from concept entanglement, and propose a text-only similarity metric that predicts failure (AUC up to 0.818) and guides when arithmetic will work.

Abstract: Embedding arithmetic promises flexible compositional queries over remote sensing imagery---transforming a harbor into an airport by subtracting "water" and adding "runway"---yet when this actually works remains poorly understood. We systematically evaluate four CLIP-based models across five RS datasets and identify concept entanglement as the dominant failure mode (40--60\% of failures): semantically related concepts occupy overlapping embedding subspaces that confound arithmetic. We propose a pre-hoc entanglement metric---requiring only text embeddings---that predicts failure with AUC up to 0.818, with GeoRSCLIP showing the most consistent predictions (mean AUC=0.675). Notably, embedding geometry does not reliably predict compositional capability ($r$=0.30, $p$=0.20), suggesting discriminative and compositional reasoning require different representational properties. We provide practical guidelines: arithmetic succeeds for well-separated concepts (88\%) but fails predictably for structurally similar classes (42\%).

Submission Number: 15

Loading