How CNNs and ViTs Perceive Similarities Between Categories

Katarzyna Filus, Joanna Domanska

Published: 2025, Last Modified: 25 Mar 2026ECML/PKDD (4) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) trained for supervised tasks are the leading networks used in practical computer vision. Despite using different techniques, they both perfect their object recognition skills. In this race, it is overall accuracy that matters at most. But is it enough? Should not we care about the correct perception of inter-class similarities? We believe we should, as similarity is a fundamental aspect of categorization and the structure of the world is highly correlated. Models should reasonably assess similarities for more nuanced perception, and we should examine it for more transparency and trust. That is why, we analyzed what state-of-the-art object recognition networks perceive as similar. We proposed a framework to visually and numerically examine and compare the perception of different trained models. We used it to answer a series of similarity-related questions based on experiments on a large population of 42 models.

External IDs:dblp:conf/pkdd/FilusD25