everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Similarity is a key construct in psychology, neuroscience, linguistics and computer vision. Similarity can manifest in various forms, including visual, semantic, and contextual similarity. Among these, semantic similarity is particularly important. Not only it serves as an approximation of how humans categorize objects by capturing connections and hierarchies based on shared functionality, evolutionary traits, and contextual meaning, but also offers practical advantages in computational modeling via the lexical structures such as WordNet. Unlike human polls, WordNet-defined similarity is constant and interpretable, making it an important baseline for evaluation. As in the domain of deep vision models there is still a lack of a clear understanding about the emergence of similarity perception, we introduce Deep Similarity Inspector (DSI). It is a systematic framework to inspect and visualize how deep vision networks develop their similarity perception during training and how it aligns with semantic similarity. Our experiments show that both Convolutional Neural Networks' (CNNs) and Vision Transformers' (ViTs) develop a rich similarity perception during learning with 3 phases (initial similarity surge, refinement, stabilization), while clear differences are found in their dynamics. Both CNNs and ViTs, besides the gradual mistakes elimination, improve the quality of mistakes being made (the mistakes refinement phenomenon).