Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Shuangpeng Han; Mengmi Zhang

Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Shuangpeng Han, Mengmi Zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present a novel approach where one AI model is utilized to predict the errors made by another AI model.

Abstract: AI models make mistakes when recognizing images—whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a "mentor" model—a deep neural network designed to predict another "mentee" model’s errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems. Our data and code are available at [here](https://github.com/ZhangLab-DeepNeuroCogLab/UnveilAIBlindSpot).

Lay Summary: As AI becomes increasingly embedded in our daily life and high-stakes fields such as healthcare, finance, and autonomous driving, ensuring its safety and reliability is critical. Our research focuses on predicting mistakes made by AI in image recognition tasks. Specifically, we explore the idea of using one AI model, referred to as the “mentor”, to predict the errors of another AI model called the “mentee”. We investigate which types of mistakes made by the mentee serve as the most effective training sources for the mentor to learn the mentee’s error patterns. This framework offers a promising direction for anticipating and correcting AI behavior, ultimately helping to build more trustworthy AI systems.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Robustness

Keywords: error prediction

Submission Number: 5948

Loading