TL;DR: We show embedding arithmetic in remote-sensing CLIP models fails mainly from concept entanglement, and propose a text-only similarity metric that predicts failure (AUC up to 0.818) and guides when arithmetic will work.
Abstract: Embedding arithmetic promises flexible compositional queries over remote sensing imagery---transforming a harbor into an airport by subtracting "water" and adding "runway"---yet when this actually works remains poorly understood. We systematically evaluate four CLIP-based models across five RS datasets and identify concept entanglement as the dominant failure mode (40--60\% of failures): semantically related concepts occupy overlapping embedding subspaces that confound arithmetic. We propose a pre-hoc entanglement metric---requiring only text embeddings---that predicts failure with AUC up to 0.818, with GeoRSCLIP showing the most consistent predictions (mean AUC=0.675). Notably, embedding geometry does not reliably predict compositional capability ($r$=0.30, $p$=0.20), suggesting discriminative and compositional reasoning require different representational properties. We provide practical guidelines: arithmetic succeeds for well-separated concepts (88\%) but fails predictably for structurally similar classes (42\%).
Submission Number: 15
Loading