# Generalized Out-of-Distribution Detection: A Survey

## 1 Introduction to Out-of-Distribution Detection

### 1.1 Conceptual Foundation of OOD Detection

Out-of-distribution (OOD) detection plays a crucial role in ensuring the reliability and safety of machine learning systems, particularly in high-stakes applications such as autonomous driving. The core objective of OOD detection is to recognize input data points that fall outside the distribution of the training dataset, thereby identifying scenarios where the predictive model's assumptions no longer hold [1]. Models trained on finite sets of in-distribution (ID) data often fail or produce misleading predictions when confronted with OOD samples, a challenge amplified by the complexity and variability of real-world environments.

Consider the case of an autonomous vehicle navigating through urban traffic. Such a vehicle depends on machine learning algorithms to detect various road signs, obstacles, and pedestrian behaviors. However, the system may encounter situations not present in its training dataset, such as unexpected road closures, construction zones, or unusual weather conditions [1]. In these scenarios, the model might generate erroneous predictions, potentially leading to unsafe driving decisions. For example, an unfamiliar road sign could be misclassified as a common one, causing the autonomous vehicle to take incorrect actions. Thus, an effective OOD detection mechanism is essential to ensure safe operation and to facilitate the transition to human control when faced with unfamiliar or unpredictable conditions.

Beyond safety, OOD detection also enhances the robustness and trustworthiness of machine learning systems across various domains. In healthcare, diagnostic models trained on patient data must be able to identify when they are presented with data from patients whose conditions or diseases are not covered by the training data. Similarly, in financial fraud detection, models need to recognize anomalies in transaction patterns that deviate significantly from known fraudulent behavior. OOD detection supports these scenarios by flagging suspicious or anomalous instances that fall outside the model's learned distribution, thereby enabling timely intervention and decision-making.

At its foundation, OOD detection involves assessing the likelihood of input data belonging to the same distribution as the training data. This assessment is commonly performed through the computation of OOD scores, which indicate how well the input data conforms to the learned distribution. Various methods exist for calculating these scores, including density-based approaches that estimate the probability of the input under the ID distribution and distance-based methods that measure the distance of the input from the ID data manifold [1]. These methods utilize different characteristics of the input data and the trained model to infer the presence of OOD samples.

Density estimation techniques represent a popular approach to OOD detection. They aim to model the distribution of the training data and evaluate the probability of new inputs under this model [2]. By quantifying the probability density of an input, these methods can identify samples with low densities as potential OOD instances. For example, variational autoencoders (VAEs) can be used to model the ID distribution, and the likelihood of new inputs under this model can serve as an OOD score [1]. Additionally, density ratio estimation methods compute the ratio of probabilities between the ID and OOD distributions, allowing for a more nuanced distinction between ID and OOD samples [1].

Distance-based methods, another prominent approach, measure the distance of an input from the ID data manifold [1]. These methods typically involve training a model to capture the structure of the ID data and then using this model to compute distances from new inputs to the ID manifold. If the distance surpasses a predefined threshold, the input is flagged as OOD. For instance, energy-based models (EBMs) define a scalar-valued energy function that assigns lower energies to ID data points and higher energies to OOD data points, thereby enabling OOD detection through energy thresholding [1]. Such methods are particularly beneficial when the ID distribution has complex geometrical structures that are challenging to capture with simple density-based approaches.

Recent advancements in OOD detection have focused on leveraging the intrinsic properties of deep learning models to enhance detection capabilities. For instance, the 'Unleashing Mask' technique highlights that models trained on ID data inherently possess OOD detection capabilities that diminish as training progresses [2]. This insight has led to the development of methods aimed at restoring this initial OOD detection ability by manipulating the model's internal representations. By applying a mask to identify memorized atypical samples and refining the model with this information, researchers have achieved significant improvements in OOD detection performance.

Furthermore, the rise of large language models (LLMs) has introduced new challenges and opportunities for OOD detection in natural language processing (NLP) applications. Due to their extensive capacity and multimodal capabilities, these models are particularly vulnerable to generating coherent but misleading outputs when exposed to OOD inputs [3]. Therefore, OOD detection in NLP not only aims to identify syntactically or semantically anomalous text but also seeks to mitigate the risk of generating harmful or inappropriate responses. Novel methods, such as the use of peer-class generated by LLMs, have shown promise in enhancing OOD detection by leveraging the rich contextual information provided by LLMs [4].

In summary, the conceptual basis of OOD detection centers on identifying input data points that do not conform to the training distribution, thereby safeguarding the reliability and safety of machine learning systems. Through a variety of methods, including density estimation and distance-based approaches, and by harnessing the intrinsic properties of models, OOD detection significantly contributes to the robustness and trustworthiness of AI systems in critical applications. As the field continues to advance, ongoing research is vital for developing more sophisticated and adaptable OOD detection techniques capable of addressing the complexities of real-world scenarios.

### 1.2 Historical Context and Evolution of OOD Research

The emergence of out-of-distribution (OOD) detection as a distinct area of research marks a significant evolution in the field of machine learning, driven by the growing need to ensure the safety and reliability of AI systems in real-world applications. Since its inception in 2017, OOD detection has evolved from a niche concern to a robust and increasingly sophisticated field, characterized by numerous milestones and transitions.

Early research in OOD detection focused on identifying instances where the input data significantly deviated from the training distribution, often referred to as "semantic shift" [5]. This shift involves changes in the underlying semantics of the data, leading to misleading predictions when using traditional classification models trained on fixed datasets. Initial studies typically evaluated the performance of classifiers on unseen categories or synthetic out-of-distribution samples, providing foundational insights into how models respond to unexpected input conditions.

A pivotal moment in the evolution of OOD detection came in 2018 when researchers recognized the importance of distinguishing between semantic and covariate shifts [5]. Covariate shifts refer to changes in the statistical properties of the input data without altering the underlying class labels, posing unique challenges for models trained on a specific distribution. This realization spurred the development of new techniques aimed at capturing subtle distributional changes, marking the beginning of a more nuanced approach to OOD detection that encompassed a broader spectrum of distributional alterations.

As OOD detection gained traction, the community witnessed a surge in methodologies designed to address various types of distributional shifts. The introduction of the OpenOOD benchmark framework [6] was a notable advancement, providing a comprehensive and standardized evaluation platform for OOD detection methods. This initiative facilitated rigorous comparisons among different techniques and highlighted the need for robust evaluation metrics that accurately reflect performance under diverse conditions.

Another critical development was the recognition of the importance of continuous adaptation in OOD detection [7]. Real-world systems often encounter evolving distribution shifts, necessitating models that can dynamically adjust their detection mechanisms. This led to the proposal of continuously adaptive OOD detection (CAOOD) methods, which leverage meta-learning to enable rapid adaptation to new distributions with minimal data. Such approaches underscore the growing emphasis on creating flexible and resilient OOD detection systems suitable for dynamic environments.

Integration of OOD detection with related fields, such as anomaly detection (AD) and open set recognition (OSR), further enriched the research landscape. These intersections expanded theoretical foundations and enabled the development of hybrid methodologies tackling broader distributional challenges. The introduction of a generalized OOD detection framework, which unifies OOD detection, AD, ND, OSR, and OD, signifies a move towards a more holistic and comprehensive approach [1].

The advent of large language models (LLMs) and their application in NLP has also profoundly impacted OOD detection research [3]. These models, processing vast amounts of text data, raise questions about the robustness and reliability of language processing systems in the face of out-of-distribution text. Researchers have thus explored specialized techniques for detecting OOD text, contributing to more robust and versatile OOD detection methods.

In recent years, the focus has shifted towards developing OOD detection methods that are not only effective but also scalable and efficient. Techniques like SUOD, which accelerates large-scale unsupervised outlier detection by optimizing for speed and accuracy [8], highlight efforts to make OOD detection more practical for real-world applications, where computational resources and efficiency are critical.

Specialized applications, such as medical imaging, have further emphasized the need for domain-specific approaches. For instance, dual-conditioned diffusion models for medical imaging OOD detection illustrate the importance of tailoring methods to meet the unique challenges of each application domain, enhancing effectiveness and relevance.

The evolution of OOD detection research reflects a continuous cycle of innovation and refinement, driven by the recognition of its critical role in ensuring AI system reliability and safety. As the field matures, it will continue to play a significant role in shaping machine learning's future, especially in high-stakes applications where misclassification can have severe consequences. This journey from early conceptualizations to a robust and multifaceted field underscores the ongoing commitment to advancing OOD detection methods, aiming for greater accuracy, adaptability, and efficiency.

### 1.3 Limitations of Traditional Approaches

Traditional approaches to out-of-distribution (OOD) detection have primarily centered on identifying instances that belong to classes unseen during the training phase, commonly referred to as "new-class detection." This approach seeks to recognize instances falling outside the semantic categories present in the in-distribution (ID) dataset. However, such methods often overlook the broader spectrum of distributional shifts, including covariate shifts and other subtle variations that do not directly correspond to new classes. Covariate shifts refer to changes in the input data distribution while maintaining the same output distribution, whereas semantic shifts involve alterations in both input and output distributions, typically stemming from unseen categories or changes in the relationship between features and labels. The narrow focus of traditional OOD detection methods on new-class detection has led to several significant limitations that impede their effectiveness in real-world applications.

Firstly, traditional OOD detection methods often assume that the out-of-distribution data originates from distinct classes or categories, which is rarely the case in practical scenarios. Real-world environments are inherently dynamic and unpredictable, meaning that distributional shifts can arise from various sources such as changes in sensor conditions, lighting, or contextual variations, rather than solely introducing new classes. For instance, a traffic surveillance system trained to recognize different vehicle types might encounter varying weather conditions, lighting changes, or partial occlusions, leading to covariate shifts that traditional OOD detection methods may fail to detect accurately. These methods generally lack the flexibility needed to identify subtle distributional shifts that do not strictly align with new class labels, thereby limiting their applicability in real-world settings.

Secondly, traditional OOD detection approaches often rely heavily on the assumption that the in-distribution data is well-defined and cleanly separated from out-of-distribution data. This assumption is frequently unrealistic given the complexity and diversity of real-world data distributions. In many practical applications, the boundary between in-distribution and out-of-distribution data can be blurred, making it difficult for traditional methods to draw clear distinctions. For example, consider a medical imaging system trained to identify different types of tumors. Variations in tumor size, shape, and texture can create a continuum rather than clear-cut categories, complicating the identification of out-of-distribution cases. Such scenarios highlight the limitations of methods that depend on crisp class boundaries for OOD detection, underscoring the need for more nuanced approaches that can manage continuous distributional changes.

Moreover, traditional OOD detection methods often struggle to generalize well across different datasets and scenarios due to their reliance on specific characteristics of the in-distribution data. Many methods assume that the in-distribution data exhibits certain statistical properties that can be leveraged for OOD detection, such as the presence of high confidence predictions for known classes or adherence to certain feature distributions. However, these assumptions may not hold true in diverse or heterogeneous environments, leading to degraded performance. For instance, methods based on the maximum softmax probability baseline have been shown to perform poorly in detecting covariate shifts, as they tend to assign high probabilities to familiar patterns rather than recognizing subtle deviations indicative of OOD instances. This underscores the need for methods that can adapt to varying data distributions and are less dependent on specific in-distribution characteristics.

Additionally, traditional OOD detection approaches often face challenges in scaling to larger and more complex datasets, particularly in high-dimensional spaces where the number of classes or data points is substantial. Many methods are optimized for smaller datasets with limited class sizes, such as CIFAR-10 or ImageNet, but struggle to generalize to larger semantic spaces. For example, methods designed for OOD detection in small-scale image classification tasks may not perform well when applied to large-scale datasets with hundreds or thousands of classes. This limitation becomes particularly evident when dealing with applications like medical imaging, where the semantic space is vast and continuously expanding. Thus, the scalability of traditional OOD detection methods poses a significant barrier to their widespread adoption in real-world applications.

Furthermore, traditional OOD detection methods often lack robustness against spurious correlations present in the training data, which can lead to misleading OOD detections. Spurious correlations occur when the model learns to associate certain features with the target class, even though these associations do not hold true in out-of-distribution scenarios. For instance, a model trained to recognize different species of birds might learn to associate a specific background color with a particular bird type, leading to incorrect classifications when presented with out-of-distribution images featuring the same background color but different bird species. Such spurious correlations can significantly impair the reliability of OOD detection methods, necessitating approaches that can mitigate their effects and ensure more robust performance.

In summary, traditional OOD detection methods exhibit several notable limitations that restrict their effectiveness in real-world applications. Their focus on identifying new classes overlooks the broader spectrum of distributional shifts, leading to inadequate coverage of practical scenarios. Additionally, the assumptions underlying these methods often do not hold in diverse or heterogeneous environments, resulting in poor generalization and scalability. Addressing these limitations requires the development of more flexible and adaptable methods capable of handling continuous distributional changes and mitigating spurious correlations. By acknowledging and overcoming these challenges, researchers can pave the way for more reliable and robust OOD detection systems that are better suited for real-world deployments.

### 1.4 Unified Perspective on OOD Detection

To adopt a unified perspective on out-of-distribution (OOD) detection, it is crucial to acknowledge the breadth and diversity of distributional shifts encountered in real-world scenarios. Traditionally, OOD detection has often been constrained by a narrow focus on identifying unknown classes, thereby overlooking other forms of distributional shifts such as covariate shifts and semantic shifts. However, recent advancements, exemplified by the Unleashing Mask technique [2], offer a broader and more inclusive framework that recognizes the multifaceted nature of distributional shifts.

Recognizing the diversity of distributional shifts necessitates a paradigm shift in how we conceptualize and implement OOD detection mechanisms, moving beyond the confines of merely identifying unknown classes. At the heart of this unified perspective is the understanding that distributional shifts can manifest in various ways and affect models differently. These shifts encompass a wide spectrum, ranging from changes in data distribution due to environmental variations [9] to subtle alterations in the underlying data generation processes. Acknowledging this diversity is essential for developing more robust and adaptable OOD detection strategies.

One of the pivotal insights from recent advancements, particularly highlighted by the Unleashing Mask technique, is the realization that the intrinsic OOD detection capability of a model can be harnessed and enhanced through strategic modifications. This approach contrasts with conventional methods that rely heavily on external mechanisms, such as additional training data or explicit outlier exposure [10]. Instead, it advocates for refining the model itself, leveraging its internal structure and training dynamics to improve OOD detection performance. This shift underscores the model-specific nature of OOD detection, suggesting that different models might benefit from tailored approaches that align with their unique characteristics and training histories.

Furthermore, the Unleashing Mask technique introduces a novel method to restore the OOD discriminative capabilities of well-trained models by utilizing a mask to identify and eliminate memorized atypical samples. This innovation demonstrates the feasibility of enhancing OOD detection performance without requiring additional data or labels, aligning with broader trends in OOD detection that emphasize the importance of model-internal mechanisms and training dynamics [11].

In addition to refining model-based approaches, the unified perspective also incorporates advancements in handling the complexity and variability inherent in real-world data distributions. Emerging techniques, such as the Meta OOD Learning framework [7], address this challenge by integrating elements of domain adaptation and continual learning. This framework offers a novel approach to handling dynamic and evolving data distributions, underscoring the importance of adaptability in OOD detection systems.

Moreover, the integration of domain-specific knowledge and contextual cues has emerged as a critical factor in enhancing OOD detection performance across diverse applications. For instance, medical imaging presents unique challenges due to the presence of subtle abnormalities and the need for precise localization [12]. By leveraging specialized models and techniques tailored to the nuances of medical data, such as dual-conditioned diffusion models [13], researchers can achieve more reliable and accurate OOD detection in healthcare settings.

Advancements in multi-modal OOD detection have also brought forth sophisticated frameworks designed to handle the complexities of multi-source data streams. The WOOD framework [14] exemplifies this trend by combining binary classifiers with contrastive learning components to detect OOD samples in a weakly-supervised manner. Such approaches enhance the robustness of OOD detection and pave the way for more versatile and adaptable systems capable of accommodating a variety of data modalities.

Beyond technical innovations, the unified perspective on OOD detection also highlights the importance of standardized evaluation protocols and benchmarks. Comprehensive benchmarking platforms, such as the OpenOOD framework [6], facilitate more rigorous and fair comparisons among different OOD detection methods. This consolidation of evaluation standards fosters a more coherent and systematic approach to assessing the performance and reliability of OOD detection systems.

In summary, the unified perspective on OOD detection emphasizes the necessity of adopting a holistic and inclusive approach that encompasses various types of distributional shifts and acknowledges the model-specific nature of OOD detection. By integrating insights from recent advancements, such as the Unleashing Mask technique and other cutting-edge methodologies, the field is poised to evolve towards more robust, adaptable, and effective OOD detection strategies. As research continues to advance, the unified perspective serves as a foundational framework for fostering innovation and addressing the evolving challenges of OOD detection in real-world applications.

## 2 Related Problems and Terminology Clarification

### 2.1 Definitions and Motivations

Out-of-distribution (OOD) detection, anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD) represent distinct yet interconnected areas within the broader field of machine learning. Each term refers to specific scenarios where the goal is to identify data points that deviate from the expected patterns within a given dataset, though the underlying mechanisms, goals, and applications can vary significantly. Understanding these distinctions is crucial for selecting the most appropriate methodologies tailored to the specific needs of various machine learning applications.

Anomaly detection (AD) focuses on identifying rare events or observations that are considered abnormal or outliers relative to the majority of the data. The primary motivation for AD lies in uncovering patterns or behaviors that do not conform to the usual or expected characteristics of the dataset, indicating potential errors, fraud, or critical changes in behavior. AD finds extensive applications in diverse fields, including finance for fraud detection [1], healthcare for disease identification [3], and cybersecurity for threat identification [15]. The central challenge in AD is distinguishing true anomalies from noise or irrelevant variations within the dataset, which often requires sophisticated statistical or machine learning techniques to filter and interpret the data accurately.

Novelty detection (ND) shares some similarities with AD but focuses specifically on recognizing new types of patterns or events that have not been encountered during the training phase of a model. Unlike AD, which targets aberrant instances within a known distribution, ND aims to detect entirely new classes or phenomena that might arise due to evolving conditions or changes in the underlying environment. ND is particularly relevant in dynamic environments where the data distribution can shift over time, necessitating continuous monitoring and adaptation. Applications of ND range from scientific research to industrial settings, where new materials or processes may emerge, requiring systems to adapt to these novelties without retraining [1].

Open set recognition (OSR) addresses a scenario where a model is trained on a finite set of classes but must operate in an environment where it could encounter examples belonging to previously unseen classes or even entirely new categories. The primary motivation here is to enable robust and flexible systems that can handle unknown or unseen data gracefully without making incorrect classifications or failing to recognize the out-of-distribution nature of such data. OSR is particularly pertinent in areas like autonomous driving, where vehicles must be prepared to deal with unexpected obstacles or scenarios that were not explicitly covered in the training dataset [6]. Ensuring that a model can reliably identify and reject unknown classes is critical for maintaining the integrity and safety of the system's outputs.

Outlier detection (OD) involves identifying data points that lie far from the rest of the dataset, potentially indicating unusual occurrences or errors. While OD is sometimes used interchangeably with AD, it generally focuses more on identifying isolated data points rather than entire groups of anomalies. OD techniques are valuable in scenarios where individual extreme values can have significant impacts, such as in financial market analysis, where sudden price spikes or drops may indicate market anomalies or fraudulent activities [3]. The challenge in OD is similar to that in AD, where distinguishing true outliers from noise or natural variations in the data is essential for accurate decision-making.

Out-of-distribution (OOD) detection, as defined in this survey, encompasses the broader challenge of identifying data points that come from a distribution different from the one used for training the model. This includes situations covered by AD, ND, and OSR, but extends to any scenario where the model encounters data outside its training experience. The primary motivation for OOD detection is to prevent models from making unreliable or unsafe decisions when presented with data that differs from their training conditions. This is especially critical in safety-critical applications such as autonomous driving, where a model's failure to recognize out-of-distribution inputs could lead to hazardous outcomes [16]. The unifying framework proposed for generalized OOD detection aims to integrate the nuances of AD, ND, OSR, and OD, providing a more holistic view of the challenges and solutions in detecting out-of-distribution data [1].

In summary, while AD, ND, OSR, and OD each focus on specific aspects of detecting unusual or outlying data, OOD detection offers a more inclusive perspective that captures the essence of these different problems. By adopting a unified framework for generalized OOD detection, researchers and practitioners can better address the complexities of real-world data and enhance the reliability and robustness of machine learning systems across a wide range of applications.

### 2.2 Methodological Approaches

Comparing and contrasting the methodologies employed in anomaly detection (AD), novelty detection (ND), open set recognition (OSR), outlier detection (OD), and out-of-distribution (OOD) detection reveals a nuanced landscape of algorithmic strategies, evaluation metrics, and performance benchmarks. Each approach aims to address specific challenges in identifying data points or patterns that deviate from expected behavior or training distributions, yet they vary significantly in their implementation and underlying assumptions.

**Algorithmic Strategies**

Anomaly detection (AD) methodologies primarily focus on identifying rare events or observations that markedly deviate from the majority of the data. These strategies often involve threshold-based approaches, clustering, or statistical modeling [3]. Clustering algorithms like K-means or DBSCAN group similar instances together, with anomalies identified as outliers far from clusters. Threshold-based approaches set a boundary around normal data points, flagging any data outside this boundary as anomalous [17]. Statistical models, such as Gaussian Mixture Models (GMMs), estimate the distribution of the data and flag points with low likelihood as anomalies [8].

Novelty detection (ND), closely related to AD, aims to identify novel patterns not seen during training. ND methods often rely on density estimation, where the density of the training data is estimated and used to identify regions of lower density as potential novelties [5]. Unlike AD, ND does not assume a background of anomalies, making it suitable for scenarios where the training data represents the complete set of possible observations.

Open set recognition (OSR) extends the binary classification problem to accommodate unknown categories or classes not present during training. OSR methodologies include confidence-score-based methods, which adjust the confidence scores of the model to better reflect uncertainty when encountering out-of-distribution data [6]. Another popular approach is the use of open-set classifiers, which learn to classify known classes while simultaneously learning to reject unknown classes [17].

Outlier detection (OD) shares similarities with AD but emphasizes robustness to extreme values or influential observations. Methods like Local Outlier Factor (LOF) and Isolation Forests are commonly used for OD, leveraging the idea that outliers have fewer neighbors in high-density regions of the data space [8].

Out-of-distribution (OOD) detection methodologies are designed to detect data points from distributions not seen during training. These methods can be broadly categorized into those that require access to OOD data during training and those that do not. Methods requiring OOD data often use these data to calibrate the decision boundaries or thresholds of the model, enabling it to better distinguish between in-distribution (ID) and OOD data [5]. Methods without access to OOD data, such as density-based methods, rely on the inherent properties of the training data to identify OOD samples. For instance, likelihood-based methods estimate the probability density of the data under the model and flag samples with low densities as OOD [6].

**Evaluation Metrics and Performance Benchmarks**

The evaluation of AD, ND, OSR, OD, and OOD detection methods relies on a range of metrics that assess both precision and recall in identifying the respective anomalies or out-of-distribution data points. Common metrics include the Area Under the Receiver Operating Characteristic Curve (AUROC), Precision-Recall (PR) curves, and F1-scores [3]. For instance, AUROC measures the trade-off between true positive rate and false positive rate, providing a comprehensive assessment of a model's ability to discriminate between ID and OOD data [5].

In the context of OOD detection, specific metrics like the False Positive Rate at 95% True Positive Rate (FPR95) and Area Under the ROC curve (AUROC) have gained prominence [5]. These metrics evaluate a model's ability to minimize false alarms while maintaining a high detection rate for actual OOD samples. For example, a method achieving a lower FPR95 indicates a higher level of certainty in rejecting OOD samples without sacrificing sensitivity to true ID data.

For AD and OD, metrics like the Matthews Correlation Coefficient (MCC) and the Cohen's Kappa are frequently utilized to balance true and false positives and negatives, providing a more balanced assessment of the model's performance [8].

OSR methods often report metrics that focus on the rejection accuracy of unknown classes, such as the open-set recognition accuracy (OSRA), which measures the proportion of correctly rejected unknown classes [6]. Additionally, the closed-set accuracy (CSA) metric evaluates the model's performance on known classes, ensuring that the emphasis on rejecting unknown classes does not come at the cost of decreased performance on known categories [17].

**Performance Benchmarks**

Performance benchmarks for these methodologies are often established through comprehensive evaluations on standardized datasets designed to represent real-world challenges. For AD and OD, datasets like the MNIST dataset for digit recognition or the CIFAR-10 dataset for image classification provide a baseline for evaluating the performance of outlier and anomaly detection methods [8]. These datasets are characterized by well-defined inliers and outliers, allowing for a clear assessment of a method's ability to accurately identify anomalies.

In the realm of OOD detection, benchmarks like the ImageNet dataset have been pivotal in evaluating the effectiveness of OOD detection methods. Specifically, datasets such as ImageNet-Vid, ImageNet-A, and ImageNet-Sketch provide a diverse array of OOD samples, enabling researchers to assess the robustness of their methods against various distribution shifts [5]. The ImageNet-O dataset, for example, serves as a clean semantic shift dataset that minimizes the interference of covariate shift, providing valuable insights into the behavior of OOD detection algorithms [5].

For OSR, benchmarks like the Open Images dataset and the CIFAR-10C dataset offer a rich set of challenges, including the presence of unknown classes and corrupted versions of known classes, respectively [6]. These datasets enable researchers to evaluate the performance of OSR methods in handling the complexity of real-world data, where unknown categories and corruptions can significantly impact model performance [17].

Overall, the methodologies, metrics, and benchmarks used in AD, ND, OSR, OD, and OOD detection reflect the diverse challenges and nuances of detecting deviations from expected behavior or training distributions. While each approach has its unique strengths and limitations, the development of unified frameworks and cross-disciplinary collaborations holds promise for advancing the state-of-the-art in generalized OOD detection.

### 2.3 Problem Settings and Challenges

Anomalies, novelties, outliers, and out-of-distribution (OOD) data are all distinct yet interconnected concepts that challenge the robustness and reliability of machine learning systems. Each type of detection addresses specific aspects of distributional shifts and anomalies, though they share common ground in their pursuit of identifying deviant patterns. These challenges arise due to the inherent difficulties in defining, capturing, and addressing distributional shifts and anomalies in high-dimensional spaces.

**Anomaly Detection (AD):**
Anomaly detection primarily focuses on identifying rare patterns or events that deviate significantly from the majority of the data in a dataset. This process is often driven by the need to flag suspicious activities, fraudulent transactions, or abnormal health indicators. One of the core challenges in AD is the scarcity of labeled anomalies in datasets, making it difficult to train models that can accurately distinguish between normal and anomalous data points. Additionally, anomalies are typically rare and unique, leading to insufficient representative samples during training. Furthermore, anomalies can exhibit contextual and temporal dependencies, which require sophisticated models capable of adapting to these nuances. For example, an anomaly might appear differently based on its time of occurrence or specific context, complicating its detection. AD methods also often struggle with imbalanced datasets, where normal instances vastly outnumber anomalies, complicating the optimization of model performance metrics.

**Novelty Detection (ND):**
Novelty detection shares some similarities with AD but focuses on identifying new or unseen classes rather than anomalies within known classes. A key challenge in ND is the assumption that the training data comprehensively covers all known classes, which is seldom true in practical scenarios. This necessitates methods that can adapt to the addition of new classes without extensive retraining. Moreover, ND faces the issue of distinguishing true novelties from false positives—data points that appear novel but are merely variations of known classes. Balancing sensitivity and specificity is essential to avoid excessive false alarms, which can undermine the system's reliability.

**Outlier Detection (OD):**
Outlier detection is akin to AD and ND but centers on identifying data points that lie outside the typical range of values observed in a dataset. Unlike AD, OD does not require labeled anomalies; instead, it relies on statistical measures or distance metrics. Determining an appropriate threshold for identifying outliers is a significant challenge, as different datasets may require different thresholds, necessitating domain expertise or iterative tuning. OD methods can also be sensitive to the choice of distance metric or statistical model, leading to inconsistent results. Additionally, OD methods can be highly sensitive to noise or outliers in the training data, potentially skewing outcomes. Detecting multivariate outliers—where an outlier in one dimension may mask an outlier in another—is another challenge, complicating the detection process.

**Out-of-Distribution Detection (OOD):**
OOD detection is a broader category encompassing the identification of data points that do not conform to the training distribution, irrespective of whether they are novel or anomalies. A key challenge in OOD detection is its ability to generalize beyond the training data, which is particularly problematic in scenarios where the training and test distributions may differ significantly. Unlike AD and ND, which have clear objectives, OOD detection lacks a definitive criterion for determining out-of-distribution status, leading to inconsistencies in evaluation metrics and methods. OOD detection methods also face the challenge of handling multi-modal data, where different sensory inputs or modalities may result in different distributional shifts, necessitating multi-modal OOD detection frameworks. The high confidence of large language models (LLMs) in their predictions, even with OOD data, adds another layer of complexity to OOD detection. This challenge underscores the need for methods that can accurately detect OOD samples despite such high confidence levels.

**Open Set Recognition (OSR):**
OSR is a specific case of OOD detection focusing on recognizing objects or classes not present in the training set while maintaining high accuracy for known classes. A primary challenge in OSR is balancing false rejection rates (FRR) and false acceptance rates (FAR), given the dual objective of rejecting unknown classes confidently and classifying known classes accurately. OSR methods must also be robust to incomplete training sets and handle intra-class variability, where known classes may exhibit significant variation due to factors like pose, lighting, or occlusion. Ensuring that models can generalize across these variations while rejecting unknown classes requires sophisticated algorithms and feature representations.

These detection problems, while distinct, share fundamental challenges related to defining and addressing distributional shifts and anomalies. The need for flexible and adaptive methods that can handle real-world complexities is underscored by ongoing research advancements, such as the introduction of contrastive learning methods for more robust OOD detection solutions. However, continued research is vital to further refine these methods and broaden their applicability across diverse domains and scenarios.

### 2.4 Unified Framework for Generalized OOD Detection

To effectively address the challenges posed by out-of-distribution (OOD) data in machine learning systems, it is crucial to establish a unified framework that encompasses related problems such as anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). This integrated approach enhances the adaptability of OOD detection systems and leads to the development of more robust and versatile models capable of handling diverse distributional shifts.

Building upon the foundations laid by AD, ND, OD, and OSR, the unified framework for generalized OOD detection aims to harmonize methodologies and perspectives from various machine learning disciplines. It recognizes that the concept of OOD is intrinsically connected to detecting unexpected or novel patterns in data, which are central to AD, ND, OSR, and OD. By consolidating these areas, the framework fosters a more nuanced understanding of OOD detection, enabling the identification and mitigation of distributional shifts in complex and dynamic environments.

A key component of the unified framework is the integration of AD and ND. These methodologies focus on identifying patterns that deviate significantly from expected behaviors. Typically relying on statistical models or machine learning algorithms, AD detects anomalies within a dataset, while ND identifies novel patterns not previously encountered. Both approaches contribute to the unified framework by offering mechanisms to detect and classify outliers based on learned statistical properties.

Similarly, OSR and OD are integral to the unified framework. OSR tackles the challenge of recognizing new categories during testing that were absent in the training data, which is especially relevant in scenarios where new classes emerge over time. Applications such as image classification often face this issue, where the model must distinguish between known and unknown classes [5]. OD, on the other hand, targets instances distinctly different from the in-distribution (ID) data, regardless of class membership. This is particularly useful in filtering out data points that do not conform to the expected distribution, such as in sensor data validation or cybersecurity applications.

By merging these diverse methodologies, the unified framework for generalized OOD detection offers a holistic perspective on the problem space. This allows researchers and practitioners to adopt a flexible and adaptive approach to OOD detection. For example, in medical imaging, the framework could integrate AD to identify unusual patterns indicative of disease, ND to recognize novel disease manifestations, OSR to manage emerging diseases not present in the training data, and OD to exclude data from patients with health conditions diverging from expected profiles. Such an integrated strategy ensures robustness and reliability even in the presence of unexpected distributional shifts.

Additionally, the unified framework supports the development of adaptable OOD detection systems through transferable models and meta-learning techniques. Transferable models, as demonstrated in 'Learning by Erasing' [10], enable deploying pre-trained models across different ID datasets without extensive retraining. This is particularly beneficial in real-world settings where data distributions change dynamically. Meta-learning techniques, such as those in 'Meta OOD Learning for Continuously Adaptive OOD Detection', allow rapid adaptation of OOD detection models to new distributions with minimal retraining, enhancing their flexibility and responsiveness to varying conditions.

Furthermore, the unified framework emphasizes evaluating and validating OOD detection systems across different datasets and scenarios. It encourages adopting standardized evaluation protocols and benchmarks, such as those provided by the OpenOOD framework [18], to ensure fair and meaningful comparisons among OOD detection methods. By promoting community-wide efforts to establish rigorous evaluation standards, the unified framework advances transparency and reproducibility in research, leading to more robust and reliable OOD detection solutions.

Moreover, the unified framework facilitates exploring innovative techniques and algorithms tailored to specific application domains. For instance, in medical imaging, specialized models like dual-conditioned diffusion models can leverage in-distribution class information and latent features of input images to enhance OOD detection performance. In multi-modal scenarios, advanced frameworks like WOOD combining binary classifiers and contrastive learning components can handle complex distributional shifts.

In conclusion, the unified framework for generalized OOD detection provides a comprehensive and adaptable approach to the multifaceted challenges associated with distributional shifts in machine learning systems. By integrating AD, ND, OSR, OD, and OOD detection methodologies, the framework enables the creation of robust and versatile models adept at handling diverse and dynamic environments. Additionally, it supports advancements in transferable models, meta-learning techniques, standardized evaluation protocols, and specialized algorithms, paving the way for more effective and reliable OOD detection solutions in real-world applications.

## 3 Theoretical Foundations and Analytical Frameworks

### 3.1 PAC-Based Guarantees and VAEs

In the realm of machine learning, particularly within the context of out-of-distribution (OOD) detection, probabilistic approximation correctness (PAC)-based guarantees have emerged as a pivotal theoretical framework for quantifying the performance of OOD detection methods. Initially developed to establish bounds on the sample complexity required for learning algorithms to achieve good generalization, PAC guarantees have found renewed relevance in evaluating deep learning models, especially within variational autoencoder (VAE) frameworks [6]. VAEs, as generative models aiming to learn a latent representation of the data, provide a natural setting for applying PAC-based guarantees to OOD detection due to their capacity to model complex data distributions and generate new samples from the learned distribution.

One of the key advantages of employing PAC-based guarantees within VAEs for OOD detection lies in their ability to offer theoretical underpinnings for the performance of these methods, particularly in high-dimensional data scenarios. High-dimensional data, commonly encountered in applications such as image and video processing, present significant challenges for traditional OOD detection approaches due to the curse of dimensionality. In such scenarios, the volume of the data space increases exponentially with the number of dimensions, making it difficult for models to accurately capture the underlying distribution and distinguish between in-distribution (ID) and out-of-distribution (OOD) samples. PAC-based guarantees provide a principled way to assess the sample complexity required for VAEs to achieve reliable OOD detection, thus offering a framework for evaluating the robustness of these models in high-dimensional spaces.

A foundational aspect of applying PAC-based guarantees to VAEs for OOD detection involves the derivation of bounds on the discrepancy between the true data distribution and the learned distribution captured by the VAE. These bounds are typically expressed in terms of the Kullback-Leibler (KL) divergence, a measure of the difference between two probability distributions. Leveraging PAC-based guarantees, researchers can derive theoretical bounds on the KL divergence between the true data distribution and the distribution approximated by the VAE, providing insights into the model's capacity to generalize to unseen data. Specifically, PAC-based guarantees allow for the quantification of the sample complexity required for the VAE to achieve a desired level of accuracy in approximating the true data distribution, which is crucial for effective OOD detection.

Empirical evidence supports the effectiveness of PAC-based guarantees in enhancing the performance of OOD detection methods within VAE frameworks. Studies have demonstrated that VAEs equipped with PAC-based guarantees exhibit superior OOD detection capabilities compared to traditional density-based approaches, particularly in high-dimensional data scenarios [1]. For instance, in the context of image data, where high-dimensional representations are commonplace, VAEs augmented with PAC-based guarantees have shown significant improvements in distinguishing between ID and OOD samples, outperforming baseline methods that rely solely on likelihood-based criteria. These results underscore the potential of PAC-based guarantees to provide a robust theoretical foundation for OOD detection within VAE frameworks, facilitating the development of more reliable and accurate OOD detection methods.

Furthermore, the application of PAC-based guarantees to VAEs for OOD detection extends beyond merely quantifying the performance of these methods. It also offers valuable insights into the design and training of VAEs for optimal OOD detection. For example, PAC-based guarantees can guide the selection of appropriate hyperparameters and architectural choices for VAEs, ensuring that the models are sufficiently expressive to capture the underlying data distribution while avoiding overfitting to the training data. This balance is crucial for effective OOD detection, as overly complex models may fail to generalize to unseen data, while overly simplistic models may lack the representational power to accurately model the data distribution. By leveraging PAC-based guarantees, researchers can systematically evaluate the trade-offs involved in designing VAEs for OOD detection, leading to the development of more robust and generalizable models.

Moreover, PAC-based guarantees within VAE frameworks enable the assessment of OOD detection performance across different levels of distributional shifts, providing a nuanced understanding of the model's behavior in various scenarios. This is particularly relevant in applications where distributional shifts are inevitable, such as in autonomous driving, where the operating environment can vary significantly over time. Through PAC-based analysis, researchers can derive theoretical bounds on the model's ability to detect OOD samples under varying degrees of distributional shift, thereby informing the design of OOD detection methods that are resilient to real-world complexities. Such insights are invaluable for ensuring the safety and reliability of machine learning systems deployed in dynamic and uncertain environments.

In addition to their theoretical benefits, PAC-based guarantees within VAE frameworks also facilitate the integration of OOD detection with other tasks, such as anomaly detection and open set recognition. By providing a principled framework for evaluating the performance of OOD detection methods, PAC-based guarantees can serve as a basis for developing unified approaches that address the challenges inherent in these related problems. For instance, in anomaly detection, where the goal is to identify rare events or anomalies within a dataset, PAC-based guarantees can help quantify the sample complexity required for reliably detecting anomalies, thereby informing the design of anomaly detection systems that are robust to noise and variations in the data. Similarly, in open set recognition, where the challenge is to recognize known classes while rejecting unknown classes, PAC-based guarantees can provide a theoretical foundation for evaluating the performance of open set recognition methods, enabling the development of more effective and reliable approaches.

However, despite their potential, the application of PAC-based guarantees within VAE frameworks for OOD detection also faces certain challenges. One notable challenge is the computational complexity associated with deriving PAC-based guarantees, particularly in high-dimensional data scenarios. Deriving tight bounds on the sample complexity required for reliable OOD detection can be computationally intensive, necessitating the development of efficient algorithms and approximations to make PAC-based guarantees feasible in practice. Another challenge lies in the translation of theoretical guarantees into practical OOD detection methods, as the derivation of PAC-based guarantees often relies on assumptions that may not hold in real-world scenarios. Therefore, while PAC-based guarantees offer a powerful theoretical framework for evaluating OOD detection methods, their practical implementation requires careful consideration of these challenges to ensure the reliability and effectiveness of the resulting methods.

In conclusion, the application of PAC-based guarantees within VAE frameworks for OOD detection represents a promising direction for enhancing the theoretical foundations and practical performance of OOD detection methods. By providing a principled way to quantify the performance of VAEs in high-dimensional data scenarios, PAC-based guarantees offer valuable insights into the design and training of VAEs for OOD detection, enabling the development of more reliable and accurate methods. Furthermore, the integration of PAC-based guarantees with VAEs facilitates the evaluation and comparison of OOD detection methods, fostering the development of unified approaches that address the challenges inherent in related problems such as anomaly detection and open set recognition. As the field of OOD detection continues to evolve, the application of PAC-based guarantees within VAE frameworks is likely to play a crucial role in advancing the theoretical and practical aspects of OOD detection, ultimately contributing to the development of safer and more reliable machine learning systems.

### 3.2 Non-Parametric Test-Time Adaptation

Non-parametric test-time adaptation represents a cutting-edge approach in the field of out-of-distribution (OOD) detection, offering a robust solution to the challenges posed by shifting data distributions. Unlike traditional methods that rely on fixed parameters or assumptions about the underlying data, non-parametric test-time adaptation leverages flexible, data-driven techniques to adapt models at inference time. This adaptability is crucial in real-world applications where data distributions can change rapidly due to various factors such as environmental shifts or technological advancements. The primary advantage of non-parametric test-time adaptation lies in its ability to mitigate false positive rates, thereby enhancing the reliability of OOD detection systems.

At the core of non-parametric test-time adaptation is the principle of adapting to incoming data without making strong parametric assumptions. This flexibility enables the method to handle new data points effectively, even when they originate from unexpected or previously unseen distributions. Notably, non-parametric methods excel in learning from small datasets, making them particularly valuable in scenarios where labeled data is limited. By dynamically adjusting parameters based on observed data, non-parametric test-time adaptation can significantly reduce false positives, a common issue in OOD detection due to the difficulty of distinguishing between in-distribution (ID) and out-of-distribution (OOD) samples.

The theoretical foundation of non-parametric test-time adaptation is deeply rooted in non-parametric statistics and machine learning principles. These principles advocate for the use of flexible models that can adapt to new data without extensive retraining or parameter tuning. Kernel methods, extensively utilized in non-parametric statistics, exemplify this approach by estimating probabilities and densities directly from data, making them well-suited for OOD detection tasks. Kernel methods can be updated at test time to reflect current data distributions, thereby enhancing their sensitivity to OOD samples.

Inspired by recent advances in generative models, non-parametric test-time adaptation incorporates non-parametric elements to improve adaptability. For instance, Variational Autoencoders (VAEs) can be enhanced with non-parametric priors like Gaussian Processes to better capture distribution shifts. This augmentation enables VAEs to adjust their data representation dynamically based on new observations, leading to more accurate OOD detection.

A key challenge in OOD detection is balancing false positives and false negatives. Non-parametric test-time adaptation tackles this issue by implementing adaptive thresholds and decision rules that adjust according to the current data distribution. If a model detects a surge in OOD samples, the threshold for identifying such samples can be lowered to minimize false negatives. Conversely, when the data distribution stabilizes, the threshold can be raised to reduce false positives.

Non-parametric test-time adaptation stands out for its seamless integration with existing machine learning pipelines. Unlike methods requiring extensive retraining or architectural changes, non-parametric test-time adaptation operates as a post-processing step, making it highly accessible and versatile. This is particularly beneficial in industrial settings where deploying new models is resource-intensive. Real-time adaptation facilitated by non-parametric methods enhances model robustness and reliability over time through continuous monitoring and updating.

Furthermore, non-parametric test-time adaptation addresses the limitations of traditional OOD detection methods, such as those based on likelihood ratios or density estimates. These traditional methods can struggle with complex, high-dimensional data distributions and are susceptible to overfitting. Non-parametric approaches, however, are more flexible and can accommodate a broader range of data distributions. For instance, nearest neighbor methods have outperformed likelihood-based methods in scenarios with complex, multimodal data, as they do not assume a specific data distribution shape.

Practical applications of non-parametric test-time adaptation span diverse fields, including computer vision, natural language processing (NLP), and medical imaging. In computer vision, non-parametric methods have improved the robustness of object detection models against distribution shifts caused by changes in lighting or object orientation. In NLP, non-parametric techniques have enhanced OOD text sample detection, accounting for stylistic variations or topic shifts. Similarly, in medical imaging, non-parametric methods have proven effective in detecting OOD regions indicative of diseases or anomalies.

While non-parametric test-time adaptation offers significant advantages, it faces challenges such as high computational costs and the need for robust calibration. Continuous updates to internal parameters or decision rules can be computationally intensive, necessitating optimizations like approximate nearest neighbor methods or dimensionality reduction. Moreover, careful validation is essential to prevent overfitting to noise or transient data patterns. Future research should focus on refining theoretical properties and practical implementations to fully realize the potential of non-parametric test-time adaptation in real-world applications.

### 3.3 Set-Based Safety Verification

Set-based methods for safety verification offer a rigorous mathematical foundation for assessing the reliability and safety of machine learning models in scenarios with limited sensor ranges and occlusions. This approach is particularly critical in automated driving contexts, where the consequences of misclassification or failure to detect out-of-distribution (OOD) data can be severe. By leveraging set-based techniques, researchers can establish formal guarantees that ensure the safety of autonomous vehicles in dynamic and uncertain environments.

In the realm of OOD detection, set-based safety verification aims to create a robust framework for verifying the correctness of predictions made by machine learning models. This framework involves defining sets of possible states or behaviors representing valid operational scenarios and then ensuring that the model's predictions fall within these predefined sets. This approach provides probabilistic assurances regarding the model's performance, which is crucial for safety-critical applications such as automated driving.

A key aspect of set-based safety verification is the establishment of a safe region in the state space where the model’s predictions are considered reliable and safe. This region is typically delineated by constraints reflecting the operational requirements of the autonomous vehicle. For example, in automated driving, a safe region could include all states where the vehicle maintains a safe distance from obstacles and navigates without collision, despite limitations like sensor occlusions or limited ranges.

The roots of set-based methods in OOD detection can be found in early formal verification techniques for control systems. Recent advancements in machine learning have extended the application of these methods to complex models such as neural networks and large language models (LLMs). For instance, LLMs [19] have prompted the development of verification techniques capable of handling the high-dimensional, non-linear decision boundaries of these models. The growing reliance on machine learning in safety-critical domains like autonomous driving has fueled interest in set-based safety verification.

In automated driving, set-based safety verification helps mitigate the challenges posed by limited sensor ranges and occlusions. Sensors like LiDAR and cameras often have restricted ranges and can be affected by occlusions from other vehicles, pedestrians, or environmental elements. In these situations, incomplete or ambiguous sensor data can influence model predictions, leading to potential misclassifications or errors. Set-based methods can mitigate these risks by ensuring that predictions remain within a safe region, even with partial or noisy sensor data.

For example, set-membership filters are used to ensure safety in scenarios with limited sensor ranges. These filters define a set of possible states consistent with available sensor measurements and verify that the model’s predictions align with these sets. If a prediction falls outside the defined set, it is flagged as unsafe, potentially triggering a warning or emergency braking response.

Similarly, set-based methods address occlusion detection challenges. Occlusions can severely impact model reliability, especially when decisions rely heavily on visual cues. In scenarios where pedestrians or obstacles are partially occluded, models may struggle with accurate classification or trajectory prediction. Set-based methods define sets of possible configurations for occluded objects and verify that predictions are consistent with these sets, ensuring reliable outcomes even under occlusion.

Additionally, set-based safety verification can be integrated with robust optimization techniques to enhance model robustness under worst-case perturbations. This is particularly relevant in adversarial attack scenarios, where providing formal guarantees that predictions remain safe under worst-case conditions strengthens the resilience of machine learning systems in dynamic environments.

In summary, set-based safety verification offers a powerful framework for ensuring the reliability and safety of machine learning models in scenarios with limited sensor ranges and occlusions. By employing rigorous mathematical techniques, these methods provide formal assurances that model predictions remain within safe regions, even in challenging operational contexts. This is essential for safe and reliable autonomous system deployment in safety-critical applications.

### 3.4 Divergence-Based Indicators and Fine-Tuning

Divergence-based Out-of-Distribution (OOD) indicators, derived from deep generative models, offer an alternative perspective on assessing the distributional discrepancy between in-distribution (ID) and out-of-distribution (OOD) samples. These indicators fundamentally differ from traditional likelihood-based approaches by leveraging the statistical divergence between the data distribution learned by a model and the observed data points during testing. By focusing on measuring the distance or dissimilarity between the data distributions, divergence-based methods aim to quantify how much the observed data deviates from the expected distribution, providing a robust measure for detecting OOD samples.

In contrast to likelihood-based approaches, which rely heavily on the probability density estimates of the data under a trained model, divergence-based methods focus on quantifying the distance between distributions. One prominent divergence-based indicator is the Jensen-Shannon divergence (JSD), which measures the similarity between two probability distributions. By applying JSD to the distributions learned by a deep generative model, such as a Variational Autoencoder (VAE) or a Generative Adversarial Network (GAN), researchers can evaluate how well the model captures the true data distribution and identify discrepancies indicative of OOD samples.

The Single-shot Fine-tune algorithm, introduced in the context of leveraging intrinsic OOD detection capabilities, operates by introducing a mask that identifies atypical samples in the training data and fine-tunes the model to forget these memorized samples. This process helps to restore the OOD discriminative capabilities of the model by ensuring that it focuses on learning the intrinsic characteristics of the ID data rather than memorizing specific data points. By fine-tuning the model with the identified mask, the algorithm ensures that the model remains robust to distributional shifts while maintaining high accuracy in recognizing ID samples [2].

This approach is particularly beneficial in scenarios where the model needs to operate in environments with varying degrees of distributional shifts, such as in autonomous driving or medical imaging applications. For example, in automated driving, divergence-based OOD detection can help in identifying situations where the sensor data deviates significantly from typical driving conditions, indicating potential hazards or anomalies that the model was not trained to handle. Similarly, in medical imaging, divergence-based methods can detect unusual patterns that may indicate rare diseases or anomalies not covered in the training data.

Furthermore, the Single-shot Fine-tune algorithm addresses one of the critical challenges in OOD detection: the reliance on likelihood-based methods that can be misled by spurious correlations in the data. Likelihood-based approaches often suffer from a lack of robustness due to their sensitivity to the underlying distribution assumptions. In contrast, divergence-based methods, such as those proposed in the DOI framework, provide a more principled way of measuring distributional shifts by directly quantifying the distance between distributions. This approach not only enhances the model's ability to detect OOD samples but also offers a more interpretable metric for evaluating the extent of distributional shifts [2].

In summary, divergence-based OOD indicators, particularly those derived from deep generative models, offer a promising direction for improving the robustness and accuracy of OOD detection methods. The Single-shot Fine-tune algorithm, which leverages the intrinsic properties of the data distribution and fine-tunes the model to better reflect these properties, represents a significant step forward in addressing the challenges associated with likelihood-based approaches. By focusing on the statistical divergence between distributions, this method provides a more robust and interpretable approach to OOD detection, enhancing the reliability of machine learning models in real-world applications [2].

### 3.5 Likelihood-Ratio-Based Methods and Falsehoods

Likelihood-based OOD detection methods have long been considered a cornerstone approach due to their theoretical elegance and computational efficiency. These methods typically rely on the probability density estimated by the model for a given input, assuming that OOD samples are characterized by lower densities than in-distribution (ID) samples. However, despite their widespread adoption, likelihood-based OOD detection methods are often plagued by several inherent limitations and misconceptions that undermine their practical efficacy.

A common misconception is that likelihood-based methods are universally effective because they directly estimate the density of the input space. This belief stems from the intuition that, by design, likelihood-based methods should naturally favor samples that fit the learned distribution, thereby identifying outliers as samples with low likelihoods. Empirical evaluations, however, have shown that likelihood-based methods often fail to accurately discriminate between ID and OOD samples, especially when the distribution shift is subtle or the model has not seen sufficient diversity in its training data. This limitation is exacerbated in high-dimensional spaces where the curse of dimensionality can lead to unreliable density estimates, making likelihood-based methods less dependable.

Another misconception is that likelihood-based OOD detection is inherently robust to overfitting, given that these methods are based on the model’s internal probability density functions. However, in practice, likelihood-based methods are highly susceptible to overfitting, particularly when the model is trained on a dataset with complex and varied structures. Overfitting can cause the model to assign very low likelihoods to even in-distribution samples, leading to an increase in false positive rates and a degradation of detection performance. For instance, models trained on large-scale datasets such as ImageNet may exhibit this behavior when deployed on smaller or more specialized datasets, where the model’s density estimates become unreliable due to insufficient coverage of the ID distribution.

To address these limitations, researchers have proposed the OOD proxy framework, which leverages likelihood ratios to improve the reliability and robustness of OOD detection. The OOD proxy framework operates by comparing the likelihood of a given sample under the ID model to a proxy model representing the OOD distribution. This approach mitigates the reliance on direct likelihood estimates and instead focuses on the relative difference in likelihoods between the ID and OOD models. By framing OOD detection as a likelihood ratio comparison, the OOD proxy framework can better capture the nuanced differences between ID and OOD samples, leading to improved detection accuracy.

At the core of likelihood-ratio-based methods lies the principle of comparing the likelihood of a sample under two competing models: the ID model and the OOD model. The likelihood ratio serves as a metric to quantify how much more likely a sample is to come from one model versus the other. Specifically, a likelihood ratio close to zero indicates that the sample is likely OOD, while a ratio close to one suggests that the sample is ID. This principle allows likelihood-ratio-based methods to circumvent some of the pitfalls associated with direct likelihood estimation, such as the impact of overfitting and the sensitivity to distribution shifts.

Moreover, likelihood-ratio-based methods can address the shortcomings of traditional density-based approaches by providing a more nuanced view of the likelihood landscape. Unlike density-based methods that focus solely on the absolute likelihood of a sample, likelihood-ratio-based methods offer a comparative assessment that is less prone to errors arising from overfitting or distributional drift. This comparative approach is particularly advantageous in scenarios where the OOD distribution is unknown or poorly understood, as it does not rely on explicit modeling of the OOD distribution.

The effectiveness of likelihood-ratio-based methods has been demonstrated through various empirical studies, which highlight their robustness in diverse application domains. For instance, in the context of medical imaging, likelihood-ratio-based methods have shown promise in detecting subtle abnormalities that might be missed by traditional density-based approaches. Similarly, in natural language processing, these methods have proven useful in identifying anomalous text samples that deviate from the expected linguistic patterns. These successes underscore the potential of likelihood-ratio-based methods to enhance the reliability and generalizability of OOD detection systems across different domains.

Despite these advancements, the adoption of likelihood-ratio-based methods is not without challenges. One key challenge is the selection of appropriate proxy models for estimating the OOD distribution. The choice of proxy model can significantly influence the performance of likelihood-ratio-based methods, necessitating careful consideration of the underlying assumptions and potential biases. Additionally, the computational overhead associated with likelihood ratio computation can be substantial, especially for high-dimensional data, which may limit the scalability of these methods in certain applications.

In summary, while likelihood-based OOD detection methods have traditionally been favored for their simplicity and interpretability, their effectiveness is often overstated due to inherent limitations and misconceptions. The introduction of likelihood-ratio-based methods, particularly within the OOD proxy framework, offers a promising avenue for addressing these shortcomings by leveraging comparative assessments of likelihoods. By adopting likelihood-ratio-based approaches, researchers and practitioners can enhance the robustness and reliability of OOD detection systems, paving the way for more effective and versatile applications in safety-critical machine learning domains.

### 3.6 Aleatoric Uncertainty and Bayesian OOD Detection

Aleatoric uncertainty, a form of variability inherent in the stochasticity of observed data, plays a critical role in enhancing the robustness and reliability of Bayesian models for out-of-distribution (OOD) detection. Unlike epistemic uncertainty, which stems from model limitations and can be reduced by increasing the amount of training data or improving model capacity, aleatoric uncertainty captures the noise and variability intrinsic to the data itself. Integrating both aleatoric and epistemic uncertainties into OOD detection models provides a more comprehensive understanding of model confidence and performance across different data distributions. This section explores the utilization of aleatoric uncertainty in Bayesian OOD detection models, alongside strategies for incorporating outlier exposure, emphasizing the benefits of combining these uncertainties for improved OOD detection.

Bayesian models naturally accommodate both types of uncertainty by quantifying them through probabilistic frameworks. Aleatoric uncertainty is often modeled through likelihood functions that incorporate noise models reflective of data's inherent variability. For example, in a Gaussian noise model, the likelihood function assumes that the observed data \(y\) is generated from the true value \(f(x)\) plus Gaussian noise \(\epsilon \sim \mathcal{N}(0, \sigma^2)\), represented as \(y = f(x) + \epsilon\). Here, \(\sigma^2\) denotes the variance of the aleatoric uncertainty, which can vary depending on the input \(x\).

Conversely, epistemic uncertainty arises from the finite sample size and model capacity constraints. It is captured through posterior distributions over model parameters, reflecting the range of plausible values given the observed data. In Bayesian OOD detection, the posterior predictive distribution integrates both aleatoric and epistemic uncertainties, yielding a probabilistic output that accounts for the model's knowledge and data variability.

A notable strategy for incorporating aleatoric uncertainty in Bayesian OOD detection is through outlier exposure. This involves enriching the training dataset with known OOD samples, allowing the model to learn the distinctive features of such data. Outlier exposure has proven effective in improving the model's capability to identify OOD samples, even when the OOD distribution differs significantly from the training data distribution.

In the context of Bayesian models, outlier exposure can be implemented via a mixture model that includes components for both in-distribution (ID) and OOD data. During training, the model learns to distinguish between these components, thereby capturing the aleatoric uncertainty associated with OOD data. This distinction can be facilitated through separate likelihood functions for ID and OOD data or a gating mechanism that weighs the contributions of each component to the overall likelihood.

Furthermore, incorporating aleatoric uncertainty enables the application of probabilistic calibration techniques, crucial for ensuring that predicted probabilities accurately reflect the true likelihood of outcomes. Techniques like temperature scaling, which adjusts the model's output logits to improve calibration, can be applied to the posterior predictive distribution of Bayesian models. This calibration ensures that the model's predictions are well-calibrated across different data distributions, thereby enhancing its ability to accurately detect OOD samples.

Combining aleatoric and epistemic uncertainties also leads to a more nuanced understanding of model performance in OOD detection tasks. High aleatoric uncertainty might signal that the model encounters data points with significant intrinsic variability, while high epistemic uncertainty could indicate the model's uncertainty due to limited data or model capacity. By integrating these uncertainties, the model provides a richer and more interpretable assessment of its confidence, aiding in informed decisions regarding OOD detection.

Recent advancements underscore the importance of addressing both types of uncertainties in OOD detection. For instance, the SupEuclid method [20] shows that simple approaches, such as supervised contrastive learning combined with Euclidean distance, can achieve high-quality OOD detection when aleatoric and epistemic uncertainties are well-calibrated. Similarly, the Generalized Out-of-Distribution Detection survey [1] advocates for a unified framework that considers various uncertainties and distributional shifts, reinforcing the significance of integrating aleatoric and epistemic uncertainties in OOD detection models.

In conclusion, leveraging aleatoric uncertainty in Bayesian OOD detection models, coupled with outlier exposure strategies, enhances the robustness and reliability of OOD detection. By accounting for both aleatoric and epistemic uncertainties, these models offer a more comprehensive and interpretable assessment of model confidence, ultimately improving OOD detection performance across diverse data distributions. Future research could delve deeper into advanced methods for modeling aleatoric uncertainty and techniques for dynamically adjusting the model's sensitivity to these uncertainties based on data characteristics.

## 4 Methodologies for Detecting Out-of-Distribution Data

### 4.1 Contrastive Learning Methods

Contrastive learning methods have emerged as a powerful approach in the field of out-of-distribution (OOD) detection due to their ability to capture meaningful representations from data. These methods can be broadly categorized into two types: instance discrimination and supervised contrastive learning. Both approaches aim to distinguish in-distribution (ID) samples from out-of-distribution (OOD) samples by leveraging the inherent structure of the data, but they differ in how they define positive and negative samples.

Instance discrimination involves training a model to differentiate between the same and different instances of data points. Specifically, the model is trained on pairs of samples, where one is a transformed version of the same instance (positive) and the other is a different instance drawn randomly from the dataset (negative). This setup encourages the model to learn representations that capture the unique characteristics of each data point, thus facilitating the differentiation between similar and dissimilar instances within the ID data. In OOD detection, this method can be extended by treating OOD samples as negative samples. However, this extension relies on the availability of OOD data during training, which is often impractical due to the scarcity or unavailability of such data.

Supervised contrastive learning, in contrast, operates on labeled data and explicitly defines positive and negative pairs based on class labels. Positive pairs consist of samples from the same class, while negative pairs comprise samples from different classes. This method ensures that the learned representations not only capture the local structure within individual classes but also maintain a strong separation between different classes. Consequently, in OOD detection, supervised contrastive learning can effectively identify OOD samples by treating them as negative samples, thereby avoiding the necessity for OOD data during training. This makes supervised contrastive learning a more versatile approach compared to instance discrimination, especially in scenarios where OOD data is limited or inaccessible.

The effectiveness of contrastive learning methods in OOD detection varies based on the availability of OOD data and the specific characteristics of the ID data. When OOD data is abundant, instance discrimination can leverage this resource to train models that are robust to OOD samples. Conversely, in situations where OOD data is scarce, supervised contrastive learning can still achieve satisfactory results by utilizing the structural information within the ID data. A study [18] illustrates that contrastive learning methods can perform competitively in OOD detection even without direct exposure to OOD data during training.

Fine-tuning a pre-trained model on a small amount of ID data is a common practice to enhance the effectiveness of contrastive learning methods in OOD detection. This process allows the model to adapt to the specific nuances of the target ID data, thereby improving its ability to discern OOD samples accurately. This is particularly beneficial in transfer learning scenarios, where a model pretrained on a large dataset is fine-tuned on a smaller, domain-specific dataset. For example, in medical imaging applications, fine-tuning a pre-trained contrastive learning model on a limited set of ID samples has been shown to significantly improve OOD detection performance. The work by [21] demonstrates that fine-tuning on atypical ID samples can enhance the model’s sensitivity to OOD samples, highlighting the importance of fine-tuning in refining the robustness of contrastive learning methods.

Furthermore, the choice between instance discrimination and supervised contrastive learning can influence the performance of OOD detection. Instance discrimination tends to focus on capturing local structural patterns within the ID data, which may not always generalize well to OOD samples that lie far from the ID data manifold. In contrast, supervised contrastive learning aims to capture broader semantic relationships between classes, providing a more robust basis for distinguishing OOD samples. This distinction is particularly relevant in scenarios where OOD samples exhibit significant variability from the ID data, posing challenges for purely local methods to detect them accurately.

In summary, contrastive learning methods, including both instance discrimination and supervised contrastive learning, present promising avenues for OOD detection. Instance discrimination leverages abundant OOD data to learn robust representations, whereas supervised contrastive learning relies on the internal structure of ID data to effectively distinguish OOD samples even in the absence of explicit OOD training data. Fine-tuning plays a critical role in enhancing the effectiveness of these methods by allowing models to better adapt to the characteristics of the target ID data, ultimately improving OOD detection performance. Future research should continue to explore the nuances between these methods and their optimal application in various OOD detection scenarios, with the goal of developing more generalized and adaptable solutions for ensuring the reliability and safety of machine learning systems in real-world applications.

### 4.2 Unsupervised Methods Based on Model Statistics

Density of states estimation (DoSE) represents a class of unsupervised OOD detection methods that function without requiring either labeled data or examples of out-of-distribution data [18]. Building on the foundational principles established in contrastive learning methods discussed previously, DoSE offers a complementary approach by leveraging model statistics to measure the typicality of inputs relative to the in-distribution data. This method is particularly useful in scenarios where obtaining labeled data or out-of-distribution samples is challenging or impossible.

The DoSE method operates on the principle that model statistics, such as the output of intermediate layers or the activations of neurons, form a distribution when fed with in-distribution data. These statistics are expected to exhibit a certain level of regularity and structure that is characteristic of the underlying data distribution. However, when faced with out-of-distribution data, the statistics often deviate from this regular pattern, indicating a higher level of unpredictability or abnormality.

To operationalize this idea, DoSE utilizes nonparametric density estimators to assess the typicality of model statistics. Nonparametric density estimation methods, unlike parametric methods, do not assume a specific form for the underlying distribution and instead rely on the empirical distribution derived from the data. Common nonparametric methods include kernel density estimation (KDE), histograms, and nearest neighbor methods. KDE, in particular, is favored due to its flexibility and smoothness, allowing it to capture complex structures in the data without overfitting.

In the context of OOD detection, the DoSE method involves training a neural network on in-distribution data and collecting statistics from intermediate layers. These statistics are then used to build a nonparametric density estimate that serves as a reference for typical in-distribution behavior. During the testing phase, new samples are processed through the same network, and their statistics are compared against the reference density. Samples whose statistics fall into regions of low density are flagged as potential out-of-distribution data, while those fitting well within the high-density regions are classified as in-distribution.

One of the key advantages of the DoSE method is its model-agnostic nature, meaning it can be applied to any differentiable model regardless of its architecture or training procedure. This flexibility allows researchers and practitioners to incorporate DoSE into their existing workflows without significant modifications to their models or training regimes. Additionally, the method's reliance on model statistics rather than raw input data makes it robust to various types of distribution shifts, including covariate shifts where the input distribution changes while the conditional distribution remains constant, and semantic shifts where the input distribution remains unchanged but the class labels vary.

However, despite its merits, the DoSE method also faces certain challenges. One notable challenge is the computational cost associated with estimating densities in high-dimensional spaces. As the dimensionality of the model statistics increases, the curse of dimensionality can lead to a significant increase in computational requirements and a decrease in the accuracy of density estimates. To mitigate this issue, dimensionality reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) can be employed to project the high-dimensional statistics into lower-dimensional spaces before density estimation. Another challenge lies in the selection of appropriate statistics for density estimation. While some statistics may provide clear separation between in-distribution and out-of-distribution data, others may be less discriminative or even misleading.

Moreover, the effectiveness of DoSE heavily depends on the choice of nonparametric density estimator. Different estimators may yield varying levels of performance, and the optimal choice often depends on the specific characteristics of the data and the underlying distribution. Kernel density estimation, for example, requires careful selection of the bandwidth parameter, which controls the degree of smoothing in the density estimate. Too much smoothing can result in a loss of important details, while too little smoothing can lead to overfitting and noisy density estimates.

In recent years, researchers have explored various enhancements to the basic DoSE framework to improve its performance and robustness. For instance, some approaches integrate uncertainty quantification techniques to provide confidence measures for OOD predictions, thereby enabling more informed decision-making. Others leverage adversarial training to generate more diverse and representative in-distribution statistics, which can enhance the robustness of density estimates against out-of-distribution samples.

Despite these challenges and ongoing efforts to improve the method, DoSE remains a promising approach for unsupervised OOD detection. Its model-agnostic nature, combined with its ability to utilize model statistics for density estimation, makes it a valuable tool in a variety of application domains. Transitioning from unsupervised methods like DoSE to likelihood-ratio-based approaches discussed next, we see a shift towards more structured and theoretically grounded mechanisms for OOD detection.

### 4.3 Likelihood-Ratio-Based Approaches

Likelihood-ratio-based approaches represent a significant advancement in the realm of out-of-distribution (OOD) detection, offering a solution to the limitations inherent in traditional density-based methods. These approaches are fundamentally rooted in the principle of comparing the likelihood of an observation under two competing hypotheses: the in-distribution (ID) hypothesis and the out-of-distribution (OOD) hypothesis. By leveraging the ratio of these likelihoods, likelihood-ratio-based methods provide a quantitative assessment of whether an observation originates from the ID distribution or deviates significantly enough to be classified as OOD.

Building upon the principles of density estimation methods like DoSE, which measure the typicality of an observation through density estimates, likelihood-ratio-based approaches offer a more nuanced and robust mechanism for detecting OOD samples. Unlike density-based methods, which often suffer from sensitivity to model calibration and the assumption that the OOD distribution is completely unknown, likelihood-ratio-based methods operate under the premise that even partial knowledge of the OOD distribution can be effectively utilized. This makes them particularly advantageous in scenarios where the OOD distribution is partially understood but not fully characterized.

A seminal contribution in the realm of likelihood-ratio-based OOD detection is the work outlined in "Generalized Out-of-Distribution Detection A Survey". This study provides a comprehensive overview of the current landscape of OOD detection methodologies, highlighting the advantages of likelihood-ratio-based approaches in capturing subtle distributional differences that are otherwise missed by simpler density-based approaches.

Another notable aspect of likelihood-ratio-based approaches is their flexibility and adaptability to different scenarios. Unlike density-based methods, which often require careful calibration and can be sensitive to model parameters, likelihood-ratio-based methods can be adapted to various settings by adjusting the form of the likelihood functions used. This adaptability is crucial in real-world applications where the nature of the OOD data can vary widely. For instance, in the context of autonomous driving, likelihood-ratio-based methods could be tailored to detect specific types of OOD inputs, such as unusual road conditions or unexpected obstacles, by incorporating domain-specific likelihood models.

The performance of likelihood-ratio-based methods in OOD detection has been extensively evaluated across multiple benchmarks and datasets. One of the key findings from these evaluations is that likelihood-ratio-based approaches tend to maintain robust performance even when the OOD distribution exhibits significant covariate shifts. This resilience is attributed to the inherent nature of likelihood ratios in capturing the relative probabilities of observations under different hypotheses, thereby providing a more stable and reliable metric for OOD detection. Additionally, likelihood-ratio-based methods have shown promise in scenarios where the OOD data is characterized by subtle semantic shifts, as highlighted in the "Towards Effective Semantic OOD Detection in Unseen Domains A Domain Generalization Perspective" paper. The authors demonstrate that by incorporating domain generalization regularization alongside OOD detection regularization, likelihood-ratio-based methods can effectively address both covariate and semantic shifts, leading to improved OOD detection accuracy.

Moreover, likelihood-ratio-based approaches have been found to be particularly effective in settings where the underlying models are pre-trained on large datasets and fine-tuned on smaller ID datasets. In such scenarios, the likelihood ratios can serve as a robust indicator of OOD samples, even when the fine-tuned model is overconfident due to its exposure to a large number of ID examples. This phenomenon is well-documented in the "Large Class Separation is not what you need for Relational Reasoning-based OOD Detection" paper, which highlights the limitations of relying solely on inter-class feature distances for OOD detection. By leveraging likelihood ratios, these methods can provide a more reliable assessment of OOD samples, thereby enhancing the overall reliability of machine learning systems in safety-critical applications.

In summary, likelihood-ratio-based approaches offer a compelling alternative to traditional density-based methods for OOD detection. Their ability to capture nuanced distributional differences, coupled with their flexibility and robustness, positions them as a promising direction for advancing OOD detection research. As the field continues to evolve, likelihood-ratio-based methods are likely to play a pivotal role in addressing the ongoing challenges in OOD detection, particularly in scenarios characterized by complex and dynamic distributional shifts.

### 4.4 Energy-Based Models (EBMs)

Energy-based models (EBMs) represent a distinct paradigm for out-of-distribution (OOD) detection, characterized by their ability to define a probability distribution implicitly through an energy function rather than explicitly specifying a probability density. This approach enables EBMs to capture complex data distributions that are challenging to model with traditional density estimation methods [9]. The energy function, \(E(x)\), is typically designed such that lower values indicate higher likelihoods for in-distribution (ID) samples, while higher values suggest the presence of out-of-distribution (OOD) samples. The primary goal in OOD detection using EBMs is to identify data points with energy levels that exceed those typically observed for ID data, indicating potential OOD instances.

One significant advantage of EBMs is their flexibility in integrating supervision and leveraging architectural design choices to enhance OOD detection performance. Unlike traditional density-based methods, which rely on explicit likelihood calculations, EBMs can be trained in various paradigms, including unsupervised, semi-supervised, and fully supervised. Unsupervised training involves optimizing the energy function solely based on the data, making it especially useful when labeled OOD data is scarce [9].

In contrast, supervised training leverages labeled data to guide the optimization of the energy function directly, leading to more precise energy landscapes capable of distinguishing between ID and OOD samples. Semi-supervised methods use a combination of labeled and unlabeled data, striking a balance between capturing the data distribution and refining the model’s decision boundaries. By incorporating supervision, EBMs can achieve higher precision in detecting OOD samples, particularly in scenarios with complex and multi-modal distributions [9].

Moreover, the architecture of EBMs plays a critical role in their OOD detection performance. Choices such as the network depth, activation functions, and regularization techniques can significantly influence the model’s generalization and representation of the data distribution. Deeper architectures may capture more detailed patterns but risk overfitting without proper regularization, while shallower architectures might generalize better but miss finer distinctions crucial for OOD detection [9].

The type of supervision and the nature of the training data also heavily influence EBMs' performance. In fully supervised settings, labeled OOD data enhances the model’s capability to distinguish ID from OOD samples. However, in real-world applications where acquiring labeled OOD data is impractical due to the diversity of possible OOD scenarios, semi-supervised or unsupervised methods become more feasible, though at the cost of potentially reduced precision. The size and diversity of the training data further impact the model's generalization, with larger and more varied datasets generally yielding better performance [9].

The choice of the energy function is equally crucial. Simple linear combinations of feature activations to complex non-linear functions that incorporate feature interactions can be used. The complexity of the energy function affects the model’s capacity to capture nuanced differences between ID and OOD samples. Simpler functions might struggle with structured data, while more complex ones risk overfitting to noise and failing to generalize to unseen OOD samples. Striking a balance between complexity and generalization is essential for effective OOD detection [9].

Compared to traditional density-based methods, EBMs offer several advantages in OOD detection. Density-based methods often assume specific parametric forms for the data distribution, which may not hold for complex, high-dimensional data. EBMs, however, do not impose such constraints and can model a wide range of distributions, including multimodal and heavy-tailed ones. Additionally, EBMs can utilize contrastive learning or other unsupervised methods for training, making them more adaptable to scenarios with limited or no labeled data [9].

Despite these advantages, EBMs also encounter certain limitations. Training and evaluating EBMs can be computationally intensive due to the reliance on sampling-based approaches to estimate the energy function, resulting in higher computational costs. Furthermore, EBMs may be less interpretable than density-based models, complicating the understanding of their decision-making processes [11].

In conclusion, energy-based models present a promising avenue for OOD detection, especially in contexts where traditional density-based methods prove insufficient. Through flexible energy functions and strategic integration of supervision and architectural design, EBMs can deliver robust OOD detection performance. However, careful consideration of model complexity, training data, and computational requirements is essential for maximizing their effectiveness. Future research should focus on developing innovative architectures and training strategies for EBMs to enhance their generalizability and efficiency in detecting out-of-distribution samples [6].

### 4.5 Density Ratio Estimation Methods

Density ratio estimation methods represent a significant departure from traditional density-based approaches in the realm of out-of-distribution (OOD) detection. Unlike density-based methods, which estimate the probability density function (PDF) of the data directly, density ratio estimation methods focus on estimating the ratio between the densities of in-distribution (ID) and out-of-distribution (OOD) data points. This alternative approach offers a powerful framework for unifying various density ratio-based methods under a single, coherent structure, enhancing the robustness and effectiveness of OOD detection.

At the core of density ratio estimation lies the idea that direct estimation of densities can be challenging and computationally expensive, especially in high-dimensional spaces. Instead, estimating the ratio between densities of ID and OOD data provides a more tractable approach that can be achieved using various techniques such as logistic regression, k-nearest neighbor (k-NN), and neural networks [22]. These methods offer a way to discriminate between ID and OOD data without explicitly modeling the underlying PDFs, making them more scalable and adaptable to a variety of datasets.

One of the key benefits of density ratio estimation methods is their flexibility. They can be applied to a wide range of OOD detection tasks, including classification-based, distance-based, and hybrid methods. For instance, in a classification-based approach, the density ratio can serve as a discriminative signal to differentiate between known and unknown classes, enhancing the decision-making process of a classifier. Similarly, density ratio estimation can be combined with distance-based methods, where the ratio is used to compute a threshold that separates ID and OOD samples, leveraging the strengths of both methods to achieve more robust OOD detection.

Moreover, density ratio estimation methods can be seamlessly integrated into existing machine learning pipelines, making them highly compatible with a wide range of applications. For example, in natural language processing (NLP), density ratio estimation can be used to detect OOD text inputs, improving the reliability of language models in real-world scenarios [3]. In computer vision, these methods can identify OOD images or video frames, contributing to safer and more reliable visual recognition systems. This versatility extends to other domains, including medical imaging, autonomous driving, and cybersecurity, highlighting the broad applicability of density ratio estimation.

Another significant advantage of density ratio estimation is its ability to handle complex distributions and high-dimensional data. Traditional density-based methods often struggle with high-dimensional data due to the curse of dimensionality, where the volume of the space increases so fast that the available data become sparse. This sparsity can lead to inaccurate density estimates and unreliable OOD detection results. In contrast, density ratio estimation methods are less affected by the curse of dimensionality as they do not require explicit density estimation. This characteristic makes them particularly suitable for modern datasets characterized by high dimensionality and complex structures.

Additionally, density ratio estimation methods offer a natural framework for unifying various density ratio-based methods under a single, coherent structure. This unification facilitates the comparison and integration of different techniques, promoting the development of more comprehensive and robust OOD detection systems. The OpenOOD benchmark framework [6], for instance, includes numerous methods relying on density ratio estimation, fostering collaboration and innovation within the OOD detection community.

However, density ratio estimation methods also face certain challenges. Selecting appropriate methods for estimating density ratios is critical, as different techniques like logistic regression, k-NN, and neural networks can yield varying levels of accuracy based on data characteristics. Therefore, careful consideration and experimentation are needed to choose the most suitable method. Furthermore, the performance of density ratio estimation methods can be sensitive to hyperparameter choices, such as the number of nearest neighbors in k-NN or the architecture of the neural network, necessitating thorough tuning for optimal results.

Another challenge is the computational cost associated with density ratio estimation, particularly in large-scale and high-dimensional datasets. Estimating density ratios can be computationally intensive, requiring significant resources in terms of time and storage. However, ongoing research focuses on developing more efficient algorithms and parallel computing strategies to address these computational burdens.

In summary, density ratio estimation methods provide a valuable alternative to traditional density-based approaches in OOD detection. By focusing on density ratios rather than absolute densities, these methods offer a scalable and adaptable solution to the challenges posed by high-dimensional and complex data, contributing to enhanced robustness and effectiveness in OOD detection systems.

### 4.6 Contrastive Learning for Multi-Modal OOD Detection

Contrastive learning has emerged as a powerful technique in machine learning, enabling models to learn representations that capture the essential characteristics of in-distribution (ID) data while distinguishing it from out-of-distribution (OOD) data. Building on its success in unimodal datasets, such as images, contrastive learning now offers unique opportunities in multi-modal OOD detection. Specifically, a weakly-supervised framework combining a binary classifier with a contrastive learning component provides a robust solution for detecting OOD data across multiple modalities. This subsection explores the application of contrastive learning in multi-modal OOD detection, detailing the architecture and performance of the WOOD framework.

Firstly, understanding the foundational concepts behind contrastive learning is crucial. Contrastive learning maximizes the agreement between positive pairs of representations while minimizing the agreement between negative pairs. In the context of OOD detection, this means learning representations where ID data is closely aligned and distinctly separated from OOD data. The WOOD framework leverages this principle by constructing a multi-modal embedding space where the representations of ID data are tightly clustered, making it easier to identify outliers that do not conform to these clusters.

The WOOD framework comprises two main components: a binary classifier and a contrastive learning component. The binary classifier distinguishes between ID and OOD data, while the contrastive learning component enhances the model’s discriminative power by ensuring that ID representations are close together and OOD representations are far apart. The binary classifier is trained using a combination of labeled ID data and unlabeled OOD data, benefiting from the weak supervision provided by the contrastive learning component. This dual-component design effectively utilizes the strengths of both supervised and unsupervised learning paradigms.

Designed to handle the complexity of multi-modal data, the WOOD framework integrates information from various modalities to create a unified representation. This is accomplished through cross-modal interactions that align representations from different modalities, such as images and text, in a shared embedding space. The contrastive learning component ensures consistency across modalities, thereby enhancing the model's ability to detect OOD data that do not adhere to the learned in-distribution patterns.

One of the key advantages of the WOOD framework is its capacity to handle diverse types of multi-modal data. Unlike traditional methods that often rely on single-modal representations, WOOD captures the interdependencies between different modalities, leading to more accurate and robust OOD detection. For instance, in autonomous driving, where multimodal data from cameras, lidar, and radar are combined, WOOD can effectively identify anomalies arising from sensor failures or unexpected environmental changes. By integrating information from multiple sources, WOOD provides a comprehensive view of the data, improving the reliability of OOD detection.

Furthermore, the WOOD framework employs a hinge loss mechanism to enforce the separation between ID and OOD samples, enhancing its ability to distinguish between the two categories. This constraint ensures that the embeddings of OOD data remain distant from those of ID data, even in the presence of complex and varied distributions. The hinge loss acts as a regularizer, promoting the formation of a clear boundary between ID and OOD data in the embedding space. This not only improves the performance of the binary classifier but also enhances the interpretability of the OOD detection process.

To evaluate the performance of the WOOD framework, extensive experiments were conducted on various multi-modal datasets, including sensor data from autonomous vehicles. Results indicated that WOOD outperformed existing methods in terms of both accuracy and robustness, achieving higher AUROC scores and lower false positive rates. These findings underscore WOOD’s superior ability to accurately identify OOD data and its resilience against distribution shifts, making it suitable for real-world applications with variable data distributions.

In summary, the application of contrastive learning in multi-modal OOD detection, exemplified by the WOOD framework, marks a significant advancement in the field. By integrating the strengths of binary classifiers and contrastive learning components, WOOD offers a robust and flexible solution for detecting OOD data across multiple modalities. Its ability to handle diverse types of data and incorporate intermodal dependencies positions it well for complex real-world scenarios, enhancing the robustness and reliability of OOD detection systems.

### 4.7 Reconstruction-Based Methods

Reconstruction-based methods represent a significant approach to out-of-distribution (OOD) detection, leveraging the principle that out-of-distribution data typically fails to reconstruct accurately when processed through a generative model trained on in-distribution data. This failure mode can be harnessed to identify OOD samples, as their reconstruction quality is expected to be significantly worse compared to in-distribution samples [1].

One notable variant of reconstruction-based methods involves the use of masked image modeling (MIM) as a pretext task for learning comprehensive in-distribution (ID) representations. Originally developed for pre-training vision transformers [23; 24], MIM has been adapted to enhance the robustness of OOD detection models. By masking certain parts of the input image and training the model to reconstruct these missing regions, MIM encourages the model to learn a rich representation of the input data, capturing both local and global features.

The essence of MIM lies in its ability to train a model on a large corpus of unlabeled data, facilitating the acquisition of generalizable features. This is particularly advantageous in OOD detection, as it ensures that the learned representations are not overly specialized to the specific characteristics of the training data. Instead, the model focuses on essential structural components of the input data, which remain consistent across different distributions. Consequently, when presented with OOD data, the model struggles to reconstruct these samples effectively, indicating a mismatch with the learned ID representations.

Several studies have investigated the application of MIM in OOD detection. For instance, [25] examines the utility of MIM in conjunction with cross-modal anomaly detection. Although the primary focus is on detecting anomalies across different modalities, this work highlights MIM's potential in enhancing the robustness of models against distributional shifts. Similarly, [26] employs MIM to improve the generalization capabilities of face detection models, illustrating how this technique mitigates the impact of distributional shifts on model performance.

In the context of OOD detection, MIM can be seamlessly integrated into various models, including convolutional neural networks (CNNs) and transformers. The adaptability of MIM allows it to fit different model architectures, enabling a smooth incorporation into existing OOD detection pipelines. For example, a CNN can be pre-trained using MIM by masking random patches of the input image and training the network to reconstruct these masked regions. This process enriches the feature extraction capabilities of the CNN and equips it with the ability to discern subtle variations in input data, which is crucial for accurate OOD detection.

Furthermore, MIM enhances the model's generalization to unseen data. By learning from a broad range of data through MIM, the model becomes better equipped to handle unexpected inputs during inference. This is particularly relevant in real-world applications where models must operate in dynamic environments characterized by continuous distributional shifts. The robustness imparted by MIM can be evaluated using metrics such as reconstruction error, which quantifies the discrepancy between original and reconstructed inputs.

Recent advancements in OOD detection have led to the development of sophisticated techniques that build upon the principles of MIM. For example, [27] proposes a method that combines MIM with localized detection tasks, enhancing the efficiency of detecting grouped instances in low-resource scenarios. This approach not only improves computational efficiency but also boosts the model's capability to manage large and heterogeneous datasets. Additionally, MIM's application in multi-modal OOD detection has been explored [1], underscoring its versatility in handling diverse data types and distributional shifts.

Despite these advantages, several challenges need addressing. Training large models using MIM incurs substantial computational costs, requiring significant resources that may be prohibitive for many applications. Moreover, the selection of a masking strategy critically affects model performance, necessitating careful experimentation to determine optimal configurations. Another challenge is the interpretability of learned representations; while MIM facilitates rich and generalizable feature acquisition, understanding these representations remains complex. This lack of interpretability can impede the deployment of MIM-based OOD detection models in safety-critical applications where transparency and explainability are essential. Overcoming these challenges requires developing more efficient training methods and enhancing the interpretability of learned representations.

In summary, reconstruction-based methods, especially those incorporating MIM as a pretext task, provide a potent approach to OOD detection. By fostering the learning of comprehensive ID representations, MIM equips models with the necessary tools to accurately identify and manage distributional shifts. As the field advances, further exploration of MIM and its integration into OOD detection pipelines promises to enhance the robustness and reliability of machine learning systems in dynamic and challenging environments.

## 5 Advanced Techniques and Statistical Foundations

### 5.1 Model-Agnostic Methods Based on Statistical Tests

Model-agnostic methods for OOD detection that leverage statistical tests represent a powerful and versatile approach to enhancing the accuracy of OOD detection across various differentiable generative models. These methods focus on the underlying statistical properties of the data and the model’s output, rather than specific model architectures, making them broadly applicable and adaptable. By integrating classical parametric tests with typicality tests, researchers aim to create a robust framework for OOD detection that effectively distinguishes between in-distribution (ID) and out-of-distribution (OOD) data.

Hypothesis testing, a cornerstone of statistical analysis, forms the basis of many of these approaches. Parametric tests assume a specific form of the underlying distribution, typically requiring strong assumptions about the data. In contrast, typicality tests assess whether a given data sample aligns with the behavior of the training data, without presupposing a particular distributional form. Combining these approaches helps mitigate the limitations of individual methods, thus enhancing the overall effectiveness of OOD detection.

A significant contribution in this area is outlined in "Towards Rigorous Design of OoD Detectors" [15], which emphasizes the necessity of a rigorous design methodology for OOD detectors that goes beyond mere performance metrics like expected calibration error. The authors advocate for a more systematic and scientifically grounded approach to ensure safety claims, highlighting the importance of theoretical foundations in developing robust OOD detection methods.

Typicality tests, such as the Kolmogorov-Smirnov (KS) test and the Anderson-Darling (AD) test, play a central role in evaluating the conformity of a sample with the training data distribution. The KS test compares the empirical cumulative distribution function (CDF) of the test sample against that of the training data, while the AD test is more sensitive to differences in the tails of the distribution. These tests can be complemented by parametric tests, such as the chi-square test, which evaluates the goodness of fit by comparing observed and expected frequencies under the null hypothesis.

This integrated approach allows for a layered assessment of a data sample’s likelihood of belonging to the training distribution. Initially, a parametric test checks for a specific form of distribution, followed by a typicality test to confirm overall consistency with the training data. If either test indicates a discrepancy, the sample is flagged as OOD. This multi-faceted evaluation ensures that both the assumed distributional form and the empirical behavior of the sample are considered.

Additionally, these statistical tests extend beyond generative models to classification models. In this context, typicality can be evaluated based on the model’s decision boundary and the distances of the sample to known classes. For instance, the Mahalanobis distance, which accounts for the covariance structure of the data, can serve as a typicality measure. Applying a typicality test to Mahalanobis distances enables determination of whether a sample falls within the expected range for ID data.

Non-parametric tests, which do not require specifying the distributional form, offer another valuable approach. Techniques like permutation tests and bootstrap methods can generate reference distributions for comparison, providing a robust framework for OOD detection that is less reliant on specific training data characteristics. Integrating these methods with typicality tests further enhances the reliability of OOD detection.

Moreover, insights from the Unleashing Mask technique [21] can refine the application of statistical tests. This technique identifies an intermediate stage of a model trained on ID data that exhibits superior OOD detection performance compared to the final stage. Leveraging these insights, one can better capture OOD characteristics by isolating and analyzing atypical samples contributing to poor OOD detection, allowing for targeted adjustments to the statistical tests.

In summary, model-agnostic methods for OOD detection using statistical tests provide a promising path for improving the accuracy and reliability of OOD detection across various models. By combining parametric and typicality tests, these methods offer a comprehensive evaluation of a data sample’s conformity with the training distribution, without being constrained by specific model architectures. This approach not only addresses individual method limitations but also provides a theoretically grounded framework for designing robust OOD detectors. As the field progresses, continued research is essential to fully exploit the potential of these methods and integrate them into advanced, adaptive OOD detection systems.

### 5.2 Cosine Similarity-Based Detection

Class Typical Matching (CTM) represents a notable advancement in the realm of post hoc out-of-distribution (OOD) detection methodologies, leveraging cosine similarity to gauge whether a given test feature aligns with the typical representation of in-distribution (ID) classes. CTM enhances the robustness and accuracy of OOD detection systems by significantly reducing false positive rates, thereby improving the overall performance and reliability of machine learning models in real-world applications [1].

At the core of CTM is the principle of measuring the cosine similarity between the feature representation of a test sample and the typical features of ID classes. Cosine similarity, a measure of angular distance between two non-zero vectors, captures the directional relationship between vectors independent of their magnitudes. In OOD detection, this property enables CTM to determine whether a test sample falls within the typical span of ID data based on its feature vector, rather than relying on magnitude-based measures [1]. By focusing on the alignment of test features with typical ID features, CTM provides a nuanced and reliable criterion for OOD detection.

Implementation of CTM involves training a machine learning model on the ID dataset to obtain a set of learned feature representations for each class. During the testing phase, for each incoming test sample, CTM calculates the cosine similarity between the test sample's feature vector and the typical feature vectors of each ID class. If the cosine similarity between the test sample and a specific ID class exceeds a predefined threshold, the sample is classified as belonging to that ID class. Otherwise, the sample is flagged as OOD. This mechanism ensures that only samples closely resembling typical ID feature patterns are classified as ID, while others are recognized as potential OOD candidates [1].

One of CTM's key strengths is its ability to drastically lower false positive rates, a crucial aspect of effective OOD detection. Traditional methods often misclassify low-confidence ID samples as OOD, leading to high false positive rates. CTM addresses this issue by focusing on alignment rather than magnitude or likelihood scores. This alignment-based approach ensures that uncertain samples are correctly identified as OOD unless they align closely with typical ID patterns [1]. Thus, CTM improves the precision of OOD detection by minimizing false positives and enhancing the overall reliability of the system.

Compared to existing OOD detection methods, particularly those relying on likelihood-based or density-based measures, CTM demonstrates superior performance. Methods like maximum softmax probability, which depend on magnitude-based assessments, often yield higher false positive rates. In contrast, CTM's use of cosine similarity for alignment-based detection ensures that only aligned samples are classified as ID, reducing the chance of false positives and increasing OOD detection accuracy. Numerous experimental evaluations have confirmed CTM's superiority across various benchmarks and datasets, highlighting its robustness and generalizability [1].

The effectiveness of CTM can be attributed to several factors. Firstly, its reliance on cosine similarity ensures that the decision-making process is invariant to the scale of feature vectors, focusing instead on their directional relationships. This is especially beneficial when feature vector magnitudes vary due to normalization or preprocessing. Secondly, CTM's post hoc nature enables seamless integration into existing machine learning pipelines without significant changes or retraining of the underlying models. This flexibility makes CTM suitable for a wide range of applications, from computer vision to natural language processing, where robust OOD detection is vital for model reliability and safety [1].

Furthermore, CTM's performance extends to dynamic and evolving environments where continuous monitoring and adaptation are necessary. Incremental updates or periodic retraining allow CTM to adapt to new data distributions, maintaining high accuracy and reliability even as data distributions change over time. By leveraging alignment-based detection, CTM ensures that machine learning systems remain accurate and reliable in varying conditions, addressing one of the main challenges in OOD detection [1].

In conclusion, Class Typical Matching (CTM) emerges as a pioneering method in post hoc OOD detection, offering a robust and reliable solution for reducing false positive rates and enhancing overall detection accuracy. Its reliance on cosine similarity for alignment-based detection, coupled with its post hoc nature and adaptability, positions CTM as a valuable tool for improving the reliability and safety of machine learning systems across diverse applications. As the field of OOD detection advances, CTM represents a significant stride toward overcoming the challenges of OOD identification and ensuring the robust operation of machine learning models in real-world settings [1].

### 5.3 Divergence-Based Indicators Using Deep Generative Models

The advent of deep generative models (DGMs) marks a significant milestone in the evolution of out-of-distribution (OOD) detection methodologies, offering a robust framework for quantifying and identifying discrepancies between in-distribution (ID) and OOD samples. One notable contribution in this area is the DOI (Divergence-based Indicators) framework, which introduces a novel approach to OOD detection by leveraging divergence-based indicators derived from DGMs. This framework signifies a shift away from traditional likelihood-based criteria towards a focus on the divergences between the data distribution and the model's learned representation. The DOI framework not only provides a principled way to evaluate OOD detection but also offers a method for fine-tuning generative models to enhance their OOD detection capabilities.

Central to the DOI framework is the utilization of deep generative models to capture the intrinsic structure of the data distribution. Unlike density-based approaches that rely on estimating the likelihood of samples under the learned distribution, the DOI framework employs divergence measures to quantify the dissimilarity between the true data distribution and the model's output. This divergence-based approach is advantageous in high-dimensional spaces where likelihood estimation can be challenging and computationally prohibitive. It enables a more nuanced understanding of the data distribution, thereby enhancing the model's ability to detect subtle distributional shifts.

A pivotal component of the DOI framework is the Single-shot Fine-tune (SSF) algorithm, which demonstrates substantial improvements in OOD detection performance. The SSF algorithm refines the generative model by fine-tuning it on a small subset of ID samples, thereby aligning the model's parameters more closely with the true data distribution. This fine-tuning process minimizes the divergence between the model’s output and the actual data distribution, thereby enhancing the model's discriminative capacity to distinguish between ID and OOD samples. By leveraging the representational power of DGMs, the SSF algorithm ensures that the model learns highly discriminative features that are sensitive to distributional shifts.

The DOI framework's departure from likelihood-based criteria underscores the importance of capturing the underlying structure of the data rather than solely focusing on probability estimates. Likelihood-based methods frequently encounter difficulties due to the curse of dimensionality, where the estimation of probabilities becomes increasingly complex with growing dimensions. The divergence-based approach of the DOI framework addresses these limitations, providing a more reliable and scalable alternative for OOD detection. Moreover, this shift emphasizes the model's generalization capability, a critical aspect of robust OOD detection.

Beyond traditional OOD detection, the DOI framework integrates divergence measures into a comprehensive evaluation framework. This integration enables systematic assessment of various OOD detection methods by quantifying the divergence between model predictions and the true data distribution. This standardized evaluation facilitates comparison among different OOD detection techniques and supports the development of enhanced methods that can better identify distributional shifts.

The reliance on divergence measures within the DOI framework opens new research avenues. The selection of divergence measures significantly impacts OOD detection performance, as different measures capture various aspects of the data distribution. Research into the effects of different divergence measures could uncover more robust and efficient OOD detection techniques. Additionally, integrating the DOI framework with other advanced techniques, such as model-agnostic statistical tests, could offer a more holistic evaluation of OOD detection performance. This combined approach would consider both the intrinsic data structure and the robustness of model predictions, fostering the creation of more reliable and adaptable OOD detection systems.

In conclusion, the DOI framework represents a significant advancement in OOD detection, particularly through its divergence-based indicators derived from deep generative models. The SSF algorithm within this framework exemplifies the potential of fine-tuning DGMs to improve OOD detection, marking a departure from likelihood-based criteria. By focusing on the intrinsic data structure and the representational power of DGMs, the DOI framework presents a more reliable and scalable approach to OOD detection. As the field progresses, the DOI framework stands as a valuable tool for developing robust and adaptable OOD detection methods.

### 5.4 Layer-Wise Score Aggregation for Textual OOD Detection

Layer-wise score aggregation for textual out-of-distribution (OOD) detection is a sophisticated approach that enhances overall detection accuracy by analyzing anomaly scores generated from different layers of a deep neural network. Recognizing that each layer within a neural network captures distinct characteristics of the input data, this method provides unique perspectives on the deviation from the in-distribution (ID) data. By strategically aggregating these scores, researchers can identify the most informative layers that best distinguish between ID and OOD samples, leading to more accurate detection outcomes.

A key challenge in textual OOD detection is the identification of appropriate indicators that signify deviations from the expected distribution. Traditional approaches often rely on global scores, which aggregate anomaly signals from all layers without distinguishing their relative contributions. This can lead to diminished accuracy because different layers capture varying degrees of informativeness. Early layers typically capture generic features shared across many data points, while later layers focus on more abstract and specific features essential for distinguishing between ID and OOD samples. Layer-wise score aggregation addresses this by assigning weights to scores from different layers based on their discriminative power.

Recent advancements in this area include a data-driven, unsupervised method for optimizing layer-wise score aggregation. Proposed in [18], this method does not require explicit labeling of OOD samples during training, making it more adaptable to scenarios where ground truth labels are scarce or unavailable. Instead, it leverages the inherent structure of the training data to infer which layers are most indicative of OOD samples. This process involves two steps: first, extracting anomaly scores from each layer of the model; second, employing a learning mechanism to determine the optimal weighting scheme for these scores.

The learning mechanism typically employs optimization strategies that maximize the separation between ID and OOD samples based on the aggregated scores. Strategies may involve maximizing the margin between the scores of ID and OOD samples or minimizing the overlap between their distributions. The optimization process iteratively adjusts the weights assigned to each layer's score until the best possible separation is achieved. This method allows for the discovery of layers that contain the most salient features for distinguishing between ID and OOD samples, thereby improving the robustness and generalization of the OOD detection system.

Layer-wise score aggregation offers significant advantages in its flexibility and adaptability to different model architectures and datasets. By allowing the model to learn optimal weights for each layer, this method accommodates the unique characteristics of various textual datasets and neural network designs. For instance, in natural language processing (NLP) tasks, different layers of a pre-trained language model might capture syntactic, semantic, or contextual information. Layer-wise score aggregation can dynamically assign greater weight to the layers most relevant for detecting anomalies in the given context.

Furthermore, the data-driven nature of this approach enables adaptation to distribution shifts in the data over time, a common challenge in real-world applications. As the distribution of ID data evolves, the relative importance of different layers in capturing anomalous patterns can change. The unsupervised optimization method can continually adjust the weights assigned to each layer's score to reflect these changes, ensuring the OOD detection system remains effective even as the underlying data distribution shifts.

Practical demonstrations of layer-wise score aggregation in various NLP tasks, such as text classification, sentiment analysis, and document clustering, highlight its effectiveness. For example, in a scenario where a model is trained on a corpus of news articles, but the OOD data consists of technical reports or academic papers, layer-wise score aggregation helps the model identify the most salient features that differentiate these document types. By focusing on the layers that capture the most distinctive linguistic patterns, the model achieves higher accuracy in detecting OOD samples.

This method also offers several benefits over traditional global scoring approaches. First, it reduces the reliance on manually curated datasets of OOD samples, which are often difficult to obtain and may not fully represent the diversity of potential OOD scenarios. Second, by leveraging the intrinsic structure of the data and the model, it provides a more principled way of aggregating anomaly scores reflecting the true distribution of the data. Lastly, it facilitates the interpretation of the OOD detection process, as the weights assigned to each layer reveal which features are most indicative of anomalies.

Despite its advantages, implementing layer-wise score aggregation poses some challenges. The primary challenges include the computational cost associated with extracting anomaly scores from multiple layers and performing the optimization process, and the complexity of determining the optimal configuration of the optimization algorithm. Careful tuning of hyperparameters may be required to achieve the desired performance. However, ongoing research is addressing these challenges, with advances in optimization techniques and hardware acceleration potentially mitigating these issues.

In conclusion, layer-wise score aggregation for textual OOD detection represents a promising avenue for enhancing the accuracy and robustness of OOD detection systems. By enabling the model to learn optimal weights for each layer's anomaly score, this method adapts to the unique characteristics of the data and the model architecture, leading to more effective and reliable OOD detection. As research continues to refine this approach, it has the potential to become a standard tool in the arsenal of methods for handling distribution shifts in NLP and other text-centric domains.

### 5.5 Simple Yet Effective Approaches Using Euclidean Distance

SupEuclid is a relatively simple yet highly effective approach for out-of-distribution (OOD) detection that leverages Supervised Contrastive Learning (SCL) combined with Euclidean distance measurements. Building upon the advanced layer-wise score aggregation techniques discussed previously, SupEuclid offers a streamlined and intuitive alternative that achieves state-of-the-art results with fewer complexities. This section will delve into the theoretical underpinnings and practical implementation of SupEuclid, underscoring its ability to detect OOD samples efficiently and accurately.

Supervised Contrastive Learning (SCL) is a variant of contrastive learning, a popular method in representation learning, which involves pairing positive and negative samples to learn discriminative representations. In the context of OOD detection, SCL is employed to train a model to differentiate between in-distribution (ID) samples and OOD samples. The core idea of SCL is to maximize the agreement between positive pairs and minimize the agreement between negative pairs during the training process. Positive pairs typically consist of a pair of samples from the same class, whereas negative pairs comprise samples from different classes. In OOD detection, the notion of positive and negative pairs is extended to encompass the concept of ID and OOD samples, respectively. By doing so, SCL encourages the model to learn representations that are more separable for ID and OOD samples.

The SupEuclid method builds upon this foundation by integrating Euclidean distance as a metric for measuring the dissimilarity between samples. In the realm of machine learning, Euclidean distance is a commonly used measure for quantifying the similarity or difference between two points in a multi-dimensional space. Its simplicity and interpretability make it a compelling choice for OOD detection, especially when combined with SCL. The combination of SCL and Euclidean distance allows SupEuclid to capture both the global structure of the feature space and the local neighborhood relationships between samples. This dual perspective enhances the model's capacity to discriminate between ID and OOD samples, leading to improved detection performance.

A key advantage of SupEuclid lies in its simplicity and ease of implementation. Unlike many OOD detection methods that necessitate complex neural network architectures or elaborate hyperparameter tuning procedures, SupEuclid relies on a straightforward approach. The method primarily involves training a classifier using SCL, followed by employing Euclidean distance to assess the similarity of test samples to the learned representations of ID samples. This simplicity not only facilitates faster training and inference processes but also reduces the risk of overfitting, a common issue in machine learning models. By adhering to a simpler model structure and avoiding extensive tuning, SupEuclid mitigates the risk of overfitting and enhances its generalization capabilities.

Another notable aspect of SupEuclid is its robustness to distribution shifts. In the context of OOD detection, distribution shifts refer to changes in the data distribution between the training and testing phases, which can severely impact the performance of machine learning models. SupEuclid demonstrates resilience to such shifts by leveraging the contrastive learning paradigm and Euclidean distance measurements. Contrastive learning inherently focuses on learning representations that are invariant to certain transformations and variations in the data, thereby facilitating better generalization to unseen distributions. Additionally, the use of Euclidean distance enables SupEuclid to quantify the dissimilarity between test samples and the learned representations of ID samples, allowing for effective detection of OOD samples even when the data distribution undergoes significant changes.

Experimental evaluations have consistently demonstrated the effectiveness of SupEuclid in achieving state-of-the-art results across various datasets and scenarios. These experiments have been conducted using a range of benchmark datasets commonly used in OOD detection research, including CIFAR-10, CIFAR-100, and ImageNet. The results indicate that SupEuclid outperforms many existing methods in terms of detection accuracy and robustness to distribution shifts. Furthermore, SupEuclid has been shown to be particularly advantageous in handling complex and high-dimensional data, where traditional methods may struggle due to the curse of dimensionality and the difficulty of learning meaningful representations.

In conclusion, SupEuclid represents a significant advancement in the field of OOD detection by providing a simple yet powerful solution that combines the strengths of Supervised Contrastive Learning and Euclidean distance measurements. Its ability to achieve state-of-the-art results without the need for complex models or extensive hyperparameter tuning makes it an attractive option for practitioners seeking to deploy robust OOD detection systems. As the demand for reliable and safe machine learning models continues to grow, SupEuclid offers a promising avenue for enhancing the performance and applicability of OOD detection techniques. Future research could further explore the potential of SupEuclid in handling more challenging scenarios and expanding its application domains, contributing to the broader goal of advancing OOD detection methodologies.

## 6 Evaluating OOD Detection Performance

### 6.1 Challenges in Evaluating OOD Detection Methods

Evaluating out-of-distribution (OOD) detection methods presents a unique set of challenges due to the inherently complex nature of OOD scenarios and the limitations in obtaining comprehensive ground truth labels. A primary hurdle is the scarcity of annotated OOD data, which makes it difficult to assess the performance of OOD detection algorithms in real-world settings [6]. The absence of explicit labels for OOD samples often leads researchers to simulate OOD conditions artificially, which may not fully capture the diversity and unpredictability of actual out-of-distribution scenarios.

Additionally, the variability in experimental setups across different studies adds another layer of complexity to the evaluation process. Various research groups may define OOD detection tasks differently, utilizing diverse datasets, model architectures, and evaluation metrics, leading to inconsistent results and comparability issues [3]. For example, while some studies might use a fixed set of OOD classes that are distinct from but drawn from the same distribution as the in-distribution (ID) classes, others might consider a broader spectrum of potential OOD scenarios that could arise in practice, making direct comparison challenging [2].

The lack of standardized evaluation protocols further complicates these issues, as it is common for different research efforts to tailor their evaluation procedures to fit their specific methodologies and research questions [6]. This variability can obscure the true effectiveness of OOD detection techniques, as performance metrics may be optimized for specific configurations rather than capturing the general robustness of the models. Thus, it becomes challenging to establish a consistent baseline against which new methods can be reliably measured and compared.

Another challenge lies in the inherent subjectivity involved in defining what constitutes an OOD sample. Depending on the application domain, the boundary between ID and OOD data can be ambiguous and context-dependent. For instance, in autonomous driving, the distinction between an in-distribution pedestrian and an out-of-distribution cyclist might depend on factors such as the angle of observation or the environment in which the object is encountered [16]. This subjectivity complicates the task of constructing a universally accepted set of OOD samples and evaluating the performance of OOD detection algorithms across different environments and use cases.

Furthermore, the dynamic nature of distributional shifts adds another dimension of complexity. Real-world scenarios often involve continuous changes in the underlying data distribution, making it challenging to design evaluation frameworks that adequately simulate these shifts [1]. Static datasets and controlled experimental conditions used in many OOD detection studies may not accurately reflect the evolving nature of distributional changes in practical deployments. This mismatch can lead to overly optimistic evaluations of OOD detection methods, as the performance observed in controlled settings may not translate to real-world effectiveness.

The reliance on synthetic OOD data is another significant limitation in evaluating OOD detection methods. While synthetic data can help in systematically exploring different types of OOD scenarios, it often lacks the richness and variability found in real-world data, resulting in a bias towards certain types of OOD conditions and a reduced ability to generalize to unforeseen situations [1]. Moreover, generating synthetic OOD data requires careful consideration of the characteristics of both ID and OOD samples, which may not always be straightforward or fully representative of the full spectrum of potential OOD instances.

Assessing the robustness of OOD detection methods to spurious correlations is also critical. Many existing methods rely on certain assumptions about the underlying data distribution, which can break down in the presence of spurious correlations that may not be captured during training [1]. Such correlations can lead to models performing well on synthetic OOD data but failing to generalize to more complex and nuanced real-world scenarios. Therefore, evaluating OOD detection methods requires not only testing their performance on well-defined OOD samples but also assessing their resilience to unexpected and potentially misleading correlations in the data.

Lastly, the rapid evolution of machine learning models and their applications necessitates continuous refinement of evaluation paradigms. As new models and techniques emerge, the landscape of OOD detection challenges shifts, requiring updated evaluation frameworks that can capture the latest advancements and potential pitfalls [1]. The ongoing development of more sophisticated and diverse machine learning models means that the types of distributional shifts encountered in real-world applications are becoming increasingly complex and multifaceted, demanding a more nuanced and comprehensive approach to evaluation.

In summary, the evaluation of OOD detection methods is beset by challenges such as the scarcity of annotated OOD data, variability in experimental setups, subjectivity in defining OOD samples, and the dynamic and unpredictable nature of distributional shifts. Addressing these challenges requires concerted efforts to develop more robust and standardized evaluation protocols, incorporate real-world data into evaluation processes, and continually adapt to the evolving landscape of machine learning applications.

### 6.2 Unsupervised Evaluation Metrics

Unsupervised evaluation metrics play a crucial role in assessing the performance of out-of-distribution (OOD) detection methods, particularly in scenarios where labeled OOD data is scarce or unavailable. Among these metrics, Gscore stands out as a versatile and effective tool that does not rely on labeled OOD data, making it particularly valuable for real-world applications where obtaining such labels can be challenging or impractical. The rationale behind Gscore is grounded in the principle of distinguishing the typicality of in-distribution (ID) data from the atypicality of OOD data, which is achieved through a non-parametric statistical test. This metric evaluates the degree to which a given test sample belongs to the ID distribution, thereby providing a quantitative measure of the OOD detection performance.

Introduced in the context of benchmarking OOD detection algorithms, Gscore addresses the need for standardized and unbiased evaluation protocols [18]. Unlike traditional evaluation methods that require labeled OOD data, Gscore leverages the intrinsic properties of the data to establish a decision boundary between ID and OOD data, making it suitable for situations where acquiring labeled OOD data is difficult or impossible.

To compute Gscore, one must first construct a representative reference set from the ID data. This reference set serves as a baseline for determining the typicality of new test samples. The size of this reference set is a critical parameter influencing both the accuracy and computational cost of the Gscore computation. In practice, the reference set can be generated using random sampling or more advanced techniques like k-means clustering to ensure its representativeness of the ID data distribution.

The computation of Gscore involves several key steps:

1. **Reference Set Construction**: Create a representative subset of the ID data that reflects the underlying distribution. This subset is used to estimate the typicality of new test samples. Careful selection of the reference set size and composition is essential for accurate Gscore calculations.

2. **Score Calculation**: For each test sample, calculate a typicality score based on its distance from the reference set. Different distance metrics such as Euclidean, Mahalanobis, or kernel-based distances can be employed. The choice of distance metric is crucial, as it can affect the performance of Gscore depending on the data type and structure.

3. **Threshold Determination**: Establish a threshold to differentiate between ID and OOD samples based on the typicality scores calculated for the reference set. Thresholds can be determined using methods like percentile ranking or empirical risk minimization, aiming to maximize the separation between ID and OOD samples.

4. **Evaluation Metrics**: Evaluate OOD detection performance using standard metrics such as FPR95, AUROC, and AUPR, based on the typicality scores and thresholds. These metrics offer a quantitative assessment of OOD detection performance, facilitating direct comparisons between different methods.

Gscore's strength lies in its unsupervised nature, eliminating the need for labeled OOD data. This makes it particularly appealing for real-world applications where labeled OOD data is scarce or too costly to obtain. Additionally, its reliance on non-parametric statistical tests enables it to be applied to various data types, enhancing its versatility in benchmarking OOD detection methods.

However, the effectiveness of Gscore hinges on the quality and representativeness of the reference set. An inadequately reflective reference set can lead to biased typicality scores, resulting in inaccurate evaluations of OOD detection performance. Therefore, meticulous selection and validation of the reference set are imperative for ensuring the reliability of Gscore.

Other unsupervised evaluation metrics have been proposed to complement Gscore, addressing its limitations and offering additional insights. For example, some methods use entropy-based measures to gauge the confidence of OOD detection models, while others derive anomaly scores from reconstruction errors in autoencoders. Despite their methodological differences, these metrics share the common objective of evaluating OOD detection performance without labeled OOD data.

While unsupervised evaluation metrics like Gscore offer promising solutions, their real-world application can be challenging due to the subtlety of distribution shifts between ID and OOD data. This ambiguity can complicate the evaluation of OOD detection performance. Consequently, it is vital to consider the context and specific requirements of the application when employing and interpreting unsupervised evaluation metrics.

In summary, unsupervised evaluation metrics such as Gscore provide a valuable approach to benchmarking OOD detection methods in the absence of labeled OOD data. By evaluating performance based on the intrinsic properties of the data, these metrics are particularly useful for real-world applications. Nevertheless, their successful application demands careful attention to the construction and validation of the reference set, as well as an understanding of their limitations and potential biases. As OOD detection continues to advance, refining unsupervised evaluation metrics will remain crucial for ensuring the reliability of machine learning systems in diverse and dynamic environments.

### 6.3 Comparative Analysis of Evaluation Protocols

The evaluation of out-of-distribution (OOD) detection methods is a critical aspect of assessing their reliability and effectiveness in real-world scenarios. Various evaluation protocols have been proposed in the literature, each with distinct advantages and limitations in terms of realism and generalizability. In this subsection, we provide a comparative analysis of different evaluation protocols, highlighting their strengths and weaknesses.

**Cross-Entropy-Based Methods**

Cross-entropy-based methods, commonly utilized in classification tasks, involve evaluating the performance of a model using the softmax probabilities of the predicted classes. These methods assume that the highest probability corresponds to the most likely class and are often used to assess confidence in predictions. However, as noted in 'ImageNet-OOD Deciphering Modern Out-of-Distribution Detection Algorithms', such methods are prone to overconfidence in familiar distributions and can fail to accurately detect out-of-distribution samples. Specifically, these methods often perform similarly to or even worse than the maximum softmax probability baseline under certain conditions, indicating a limitation in their capacity to differentiate between in-distribution and out-of-distribution samples effectively. Despite their simplicity, cross-entropy-based methods may not provide a comprehensive evaluation due to their reliance on the assumption of well-separated class boundaries, which is often violated in real-world applications.

**Likelihood-Based Methods**

Likelihood-based methods assess the probability of a sample belonging to the training distribution using likelihood functions. These methods typically involve training a generative model and measuring the likelihood of test samples under this model. While likelihood-based approaches offer a principled way to evaluate the fit of data to a given distribution, they suffer from issues such as model misspecification and the curse of dimensionality, especially in high-dimensional spaces. Furthermore, the computational cost of estimating likelihoods can be prohibitive, making these methods less scalable for large datasets. For instance, 'Understanding the properties and limitations of contrastive learning for Out-of-Distribution detection' highlights that likelihood-based methods may struggle with capturing the true underlying distribution, particularly in the presence of complex distributions that deviate significantly from the training data. These limitations underscore the need for alternative evaluation protocols that are more robust and computationally efficient.

**Contrastive Learning Evaluation**

Contrastive learning, a form of self-supervised learning, has gained popularity in recent years for its ability to learn discriminative representations of data. In the context of OOD detection, contrastive learning methods aim to differentiate between in-distribution and out-of-distribution samples by leveraging the learned embeddings. These methods often involve training a binary classifier to distinguish between the two classes. While contrastive learning provides a promising framework for OOD detection, its evaluation poses unique challenges. The primary advantage of contrastive learning lies in its ability to learn meaningful representations without requiring explicit labeling of OOD samples, thereby enabling more flexible and generalizable evaluations. However, the performance of contrastive learning methods can be highly dependent on the quality and diversity of the training data, as well as the choice of contrastive loss function. Moreover, the effectiveness of these methods in real-world scenarios may be limited by the presence of spurious correlations, as highlighted in 'Large Class Separation is not what you need for Relational Reasoning-based OOD Detection'. Thus, while contrastive learning offers a powerful tool for OOD detection, its evaluation requires careful consideration of these factors to ensure reliable performance.

**Domain Generalization Protocols**

Domain generalization protocols focus on evaluating OOD detection methods across multiple domains, recognizing that real-world applications often involve distributional shifts across different environments. These protocols aim to assess the ability of a model to generalize to unseen domains while maintaining robust performance. Domain generalization is particularly challenging due to the need to capture the essence of the data distribution without overfitting to specific characteristics of the training domains. Recent works, such as 'Towards Effective Semantic OOD Detection in Unseen Domains A Domain Generalization Perspective', have introduced regularization strategies to improve OOD detection performance in domain-generalization settings. These strategies include domain generalization regularization, which ensures semantic invariance across domains, and OOD detection regularization, which enhances the ability to detect semantic shifts. While domain generalization protocols provide a more realistic evaluation of OOD detection methods, they require careful design of the evaluation setup to ensure that the domains are representative of real-world conditions and that the evaluation metrics are appropriately chosen to reflect the desired properties of the OOD detection system.

**Meta-Learning Evaluation Protocols**

Meta-learning, or learning to learn, represents a paradigm shift in machine learning where the goal is to develop algorithms that can adapt quickly to new tasks or domains. In the context of OOD detection, meta-learning evaluation protocols aim to assess the ability of a model to adapt to new distributions rapidly, a critical requirement for real-world applications. Meta-learning approaches often involve training a base model on a diverse set of tasks or domains and then fine-tuning it on new tasks with limited data. This setup is particularly relevant for OOD detection in scenarios where the distribution shifts are unpredictable and rapid adaptation is necessary. Works such as 'Meta OOD Learning for Continuously Adaptive OOD Detection' highlight the importance of developing OOD detection methods that can adapt dynamically to evolving distributions. However, the evaluation of meta-learning protocols poses unique challenges, including the need to balance the trade-off between generalization and specialization, and the difficulty in designing evaluation metrics that accurately capture the performance of the adapted model. Additionally, the computational cost of training and fine-tuning models in a meta-learning framework can be substantial, posing a barrier to widespread adoption.

**Unsupervised Evaluation Metrics**

Unsupervised evaluation metrics, such as Gscore, offer a promising avenue for assessing OOD detection performance in scenarios where labeled OOD data is scarce or unavailable. These metrics typically leverage the structure of the data or the learned representations to evaluate the ability of a model to detect out-of-distribution samples. Unsupervised metrics are advantageous because they do not require explicit labeling of OOD samples, making them more applicable in real-world settings where obtaining labeled OOD data is often challenging. Gscore, for example, computes a score based on the distance between the in-distribution and out-of-distribution samples in the learned representation space. While unsupervised metrics provide a valuable tool for evaluating OOD detection methods, they can be sensitive to the choice of representation and the specific characteristics of the dataset. Moreover, the lack of labeled OOD data can limit the interpretability of the evaluation results, making it difficult to assess the true performance of the OOD detection system. Nonetheless, unsupervised metrics represent a significant step forward in developing evaluation protocols that are more aligned with the practical challenges of OOD detection.

**Realism and Generalizability**

When evaluating OOD detection methods, it is crucial to consider the realism and generalizability of the evaluation protocols. Realism refers to the extent to which the evaluation setup mirrors the conditions under which the OOD detection system will be deployed. Generalizability, on the other hand, pertains to the ability of the evaluation protocol to produce consistent and reliable results across different datasets and scenarios. Each of the evaluation protocols discussed above has its own strengths and weaknesses in terms of realism and generalizability. Cross-entropy-based and likelihood-based methods are relatively simple and widely applicable but may not fully capture the complexities of real-world distributions. Contrastive learning and domain generalization protocols offer more nuanced evaluations but require careful design and potentially high computational costs. Meta-learning evaluation protocols hold promise for dynamic adaptation but pose significant challenges in terms of evaluation metrics and computational requirements. Unsupervised evaluation metrics provide a valuable tool for scenarios where labeled OOD data is scarce but may be limited in their interpretability and robustness.

In conclusion, the choice of evaluation protocol for OOD detection methods is a critical decision that can significantly impact the perceived performance and applicability of the methods. Each protocol has its unique strengths and limitations, and the selection should be guided by the specific needs and constraints of the application domain. Future research in OOD detection should continue to explore and refine evaluation protocols that strike a balance between realism and generalizability, ultimately leading to more reliable and robust OOD detection systems.

### 6.4 Case Studies and Real-World Applications

---
Case studies and real-world applications of out-of-distribution (OOD) detection evaluation metrics highlight the practical significance of these metrics in diverse and challenging scenarios. These metrics not only provide a standardized means of assessing OOD detection performance but also enable researchers and practitioners to make informed decisions regarding model selection, tuning, and deployment. By employing evaluation metrics such as Gscore, we can gain valuable insights into the robustness and reliability of OOD detection systems in real-world settings, thereby facilitating their integration into critical applications.

A prominent case study involves the application of OOD detection in autonomous vehicle systems [9]. Autonomous vehicles rely heavily on machine learning models to perceive and interpret their environment, including distinguishing between normal driving conditions and unusual or unexpected scenarios. Accurately identifying OOD inputs is imperative for preventing potential hazards, such as recognizing pedestrians in adverse weather conditions where visibility is severely reduced. The Gscore metric, due to its unsupervised nature, allows for the evaluation of these systems without the need for labeled OOD data, which is often difficult or impractical to obtain in the automotive industry. This makes Gscore a valuable tool for validating the performance of OOD detection mechanisms in autonomous vehicles, ensuring they can reliably operate in a wide range of real-world conditions.

Another notable example is in medical imaging [11], where OOD detection can play a crucial role in identifying anomalies and ensuring patient safety. Medical imaging applications frequently encounter rare and complex pathologies not represented in the training dataset. Detecting OOD data in such scenarios is essential for early diagnosis and intervention. For instance, in radiology, a system might need to recognize subtle abnormalities in X-ray images that deviate significantly from the norm but are indicative of serious health conditions. Here, evaluation metrics like Gscore help assess the model’s capability to generalize beyond the training distribution and detect anomalies outside the expected range of variation. Ensuring diagnostic tools are both accurate and robust against unforeseen cases contributes to better clinical outcomes.

In cybersecurity [11], OOD detection serves as a critical mechanism for identifying novel threats from sophisticated adversaries. Traditional security systems often struggle with zero-day vulnerabilities or malware variants unseen before. Distinguishing between benign network traffic and potentially malicious activities exhibiting unusual patterns is a challenge. Evaluation metrics like Gscore provide a quantitative assessment of a system's ability to identify and respond to such threats, enabling security professionals to refine detection strategies and enhance system resilience. Continuous monitoring and evaluation of OOD detection systems help organizations stay ahead of evolving cyber threats, ensuring the integrity and security of digital assets.

OOD detection also plays a vital role in natural language processing (NLP) tasks [18]. Handling out-of-vocabulary (OOV) words and unusual text patterns is essential for building robust and reliable language models. In conversational AI systems, such as chatbots and virtual assistants, the ability to manage OOD inputs is paramount for providing accurate and helpful responses. Evaluation metrics facilitate the assessment of these systems by quantifying their performance in recognizing and managing inputs that deviate from expected vocabulary and syntax, ensuring coherent and meaningful interactions with users, even when encountering unfamiliar linguistic phenomena.

The evaluation of OOD detection systems is critical in industrial automation and robotics [5], where the reliability of machine learning models is essential for safe operation. In manufacturing environments, robots and automated systems perform repetitive tasks under controlled conditions but must adapt to unforeseen circumstances, such as changes in production line setups or the appearance of foreign objects. OOD detection helps in identifying and responding to these anomalies, preventing malfunctions and ensuring operational continuity. Metrics like Gscore offer a systematic approach to evaluating the performance of these systems in detecting and handling OOD events, contributing to overall safety and efficiency.

Furthermore, in financial services [11], OOD detection aids in fraud detection and risk management. Financial institutions handle vast transactional data and must be vigilant in identifying suspicious activities indicating fraudulent behavior. OOD detection models help recognize patterns deviating from normal transactions, signaling potential fraud. Evaluation metrics allow for a thorough assessment of the model’s ability to identify these anomalies, enhancing the institution’s capability to mitigate risks and protect assets, underscoring the importance of robust OOD detection in safeguarding financial systems.

In climate science [5], OOD detection assists in environmental monitoring by identifying outliers in large volumes of climate data. These anomalies may indicate sudden changes or extreme events, providing valuable insights for decision-making and policy formulation. Using evaluation metrics, researchers gauge the effectiveness of these models in recognizing unusual patterns and ensuring the accuracy of climate predictions. This supports the development of resilient strategies for coping with climate change and protecting vulnerable communities.

In summary, the evaluation of OOD detection systems through metrics like Gscore is essential for ensuring the reliability and safety of machine learning applications across various domains. Whether in autonomous vehicles, medical diagnostics, cybersecurity, NLP, industrial automation, financial services, or climate science, consistent and rigorous assessment of OOD detection capabilities helps identify areas for improvement and promotes the adoption of robust and trustworthy AI systems. As the field evolves, the ongoing refinement and validation of these evaluation metrics remain crucial for advancing OOD detection and driving innovation in real-world applications.
---

### 6.5 Future Directions in Evaluation Research

The evaluation of out-of-distribution (OOD) detection methods is a rapidly evolving area of research, driven by the increasing complexity and diversity of real-world applications. As machine learning models become more integrated into critical decision-making processes, the need for robust and reliable OOD detection evaluation protocols becomes paramount. To ensure that evaluations remain relevant, rigorous, and reflective of real-world conditions, future research should focus on several key areas.

Firstly, obtaining ground truth labels for OOD samples remains a significant challenge. Existing unsupervised evaluation metrics, such as Gscore, provide valuable tools for assessing the performance of OOD detectors in the absence of labeled data [6]. However, these metrics may not fully capture the nuances of different OOD scenarios, particularly in dynamic and multimodal environments. Future research should therefore explore more sophisticated unsupervised evaluation metrics capable of accounting for the varying characteristics of OOD data across different modalities and contexts. For instance, developing metrics that can differentiate between covariate shifts, concept drifts, and label noise could provide a more nuanced understanding of OOD detection performance.

Secondly, integrating human-in-the-loop evaluation methodologies can help bridge the gap between automated evaluation metrics and real-world applicability. By incorporating human judgment into the evaluation process, such methodologies can refine the evaluation metrics and improve the alignment of model outputs with human perception. This could involve designing interactive user interfaces that allow users to provide feedback on the reliability of OOD detections. Human-in-the-loop approaches would be particularly beneficial in high-stakes applications, such as autonomous driving and medical imaging, where the consequences of misclassifications can be severe.

Thirdly, the emergence of large language models (LLMs) and multimodal systems highlights the need for evaluation frameworks that can accommodate a wide range of data types and interactions. Multimodal OOD detection, exemplified by the WOOD framework, represents progress in addressing the complexities of multimodal data. However, these frameworks often rely on assumptions about the underlying data distribution that may not hold in real-world scenarios. Future research should investigate more flexible and adaptive evaluation methodologies that can handle the dynamic and uncertain nature of multimodal data. For example, developing methods that can adaptively adjust evaluation metrics based on the observed data characteristics could enhance the robustness and generalizability of OOD detection evaluations.

Fourthly, distribution shifts due to spurious correlations pose significant hurdles for the evaluation of OOD detection methods. Many existing anomaly detection and OOD generalization methods falter when faced with distribution shifts caused by spurious correlations in the data. Future research should develop evaluation protocols that better assess the robustness of OOD detectors to such shifts. This could involve designing datasets and benchmarks that explicitly simulate spurious correlation scenarios, allowing researchers to evaluate the extent to which different OOD detection methods can generalize across varying levels of distribution shift. Additionally, incorporating methods for mitigating spurious correlations, such as counterfactual reasoning and causal inference, into the evaluation process could provide deeper insights into the limitations and capabilities of different OOD detection approaches.

Fifthly, the integration of OOD detection with related tasks, such as anomaly detection and open set recognition, offers another fertile ground for future research. Although these tasks share common goals and methodologies, they often evolve in isolation. Developing unified evaluation frameworks that encompass a range of related tasks could foster cross-fertilization of ideas and methodologies, leading to more comprehensive and robust OOD detection systems. Incorporating metrics from anomaly detection and open set recognition into OOD detection evaluations could provide a more holistic assessment of a model's ability to handle unseen data and novel patterns.

Lastly, the advancement of computational efficiency in OOD detection evaluation is critical as the size and complexity of datasets continue to grow. Traditional evaluation methods may become computationally prohibitive, necessitating novel strategies. Recent efforts, such as the SUOD system for accelerating large-scale unsupervised heterogeneous outlier detection, highlight the importance of scalable evaluation methodologies. Future research should explore computational strategies like distributed computing and parallel processing to enhance the efficiency of OOD detection evaluations. Leveraging advancements in deep learning and machine learning algorithms to develop more lightweight and efficient evaluation models could significantly reduce the computational overhead associated with OOD detection evaluations.

In conclusion, the evaluation of OOD detection methods is a multifaceted challenge that requires a concerted effort to address the growing complexity and diversity of real-world applications. By focusing on more nuanced, adaptive, and computationally efficient evaluation methodologies, the field can ensure the safety and reliability of machine learning systems in critical applications.

## 7 Novel Techniques and Algorithms

### 7.1 Overview of Recent Innovations in OOD Detection

Recent advancements in out-of-distribution (OOD) detection have introduced a variety of innovative techniques aimed at improving the reliability and robustness of machine learning models in real-world applications. Notably, these advancements include leveraging gradient information, feature maps, and textual inputs. These techniques not only address the limitations of traditional OOD detection methods but also enhance the interpretability and effectiveness of models in identifying anomalies and outliers.

Gradient information has become a powerful tool in OOD detection due to its sensitivity to the internal workings of deep neural networks. Techniques such as GradOrth [2] and NegLabel [4] utilize gradient-based approaches to identify OOD samples more accurately. By analyzing the gradients produced during forward and backward passes, these methods can discern regions within the input space that are less familiar to the model. Gradients act as a form of internal feedback, revealing the extent to which a sample deviates from typical training patterns. This enables a finer-grained assessment of OODness, facilitating the identification of subtle variations that might otherwise go undetected.

Similarly, the use of feature maps represents another significant advancement in OOD detection. Feature maps capture the hierarchical representation of data as it traverses through network layers, offering valuable insights into the decision-making process. The GradOrth technique [2] leverages this hierarchical structure by projecting gradients onto specific subspaces deemed critical for in-distribution data. This projection helps isolate the influence of different layers on the overall OOD score, thereby enhancing the model's discriminative power. Another example is the DOODLER method [28], which employs Variational Autoencoders (VAEs) to reconstruct input images and segment them based on their OOD likelihood. By exploiting the reconstruction failure mode of VAEs, DOODLER effectively distinguishes between in-distribution (ID) and OOD inputs, making it a versatile tool for various modalities.

Textual inputs have also played a crucial role in recent OOD detection innovations. The advent of large language models (LLMs) [4] has enabled methods like NegLabel, which integrate textual information to improve OOD sample detection. By conditioning on textual descriptions or prompts, these models can generate peer classes that are semantically similar yet visually distinct from ID samples. This additional layer of information enriches the data representation and aids in distinguishing close but different classes, thus bolstering the robustness of the OOD detection system.

These advancements collectively address several challenges in OOD detection, including handling distributional shifts, identifying subtle anomalies, and enhancing interpretability. Traditional OOD detection methods often falter in scenarios involving subtle and gradual distributional shifts, making it challenging to pinpoint specific deviations from the training distribution. Gradient-based and feature map-driven approaches offer a more nuanced perspective by capturing fine-grained changes in the model’s response to varied inputs. Additionally, incorporating textual inputs allows models to consider semantic relationships between classes, leading to a richer understanding of the underlying data structure.

Beyond improving OOD detection accuracy, these innovations also enhance model interpretability. Understanding the rationale behind OOD predictions is vital for building trust in AI systems and diagnosing potential failures. Techniques like GradOrth and NegLabel not only detect OOD samples but also provide insights into why a particular sample is deemed OOD. For instance, gradient analysis can reveal which input parts cause model uncertainty, while feature map analysis can highlight hierarchical features inconsistent with learned patterns. This transparency is especially critical in safety-sensitive applications such as autonomous driving, where justifying decisions is essential.

Moreover, integrating gradient information, feature maps, and textual inputs has facilitated the development of more robust and adaptive OOD detection frameworks. Methods like AUTO [29] demonstrate real-time adaptation by mining pseudo-ID and pseudo-OOD samples from test data. These frameworks leverage the dynamic nature of data distributions to continually refine OOD detection capabilities. By adapting to new data, these methods maintain high performance even in rapidly evolving environments.

In summary, recent innovations in OOD detection have significantly advanced the field through sophisticated techniques that utilize gradient information, feature maps, and textual inputs. These advancements not only enhance the accuracy and robustness of OOD detection but also improve model interpretability and adaptability. As the demand for reliable and safe AI systems grows, these innovations represent promising steps toward addressing the inherent challenges of OOD detection and ensuring the broad applicability of AI technologies in real-world scenarios.

### 7.2 Frequency-Regularized Learning (FRL) for Generative Models

Frequency-Regularized Learning (FRL) for Generative Models represents a novel framework aimed at enhancing the out-of-distribution (OOD) detection capability of generative models by incorporating high-frequency information during the training phase. This technique stands out from traditional likelihood-based approaches by focusing on the nuanced characteristics of the data distribution, thereby improving both performance and efficiency in OOD detection.

Generative models, particularly variational autoencoders (VAEs) and generative adversarial networks (GANs), have been widely employed for OOD detection due to their ability to capture complex data distributions [1]. These models learn to represent the underlying structure of the in-distribution (ID) data by minimizing a reconstruction error or a likelihood objective. However, the reliance on likelihood-based criteria often leads to suboptimal performance in detecting OOD samples, as they may fail to adequately capture the subtle differences between ID and OOD data [5].

The FRL framework addresses these limitations by leveraging frequency information during the training process. Frequency information refers to the distribution of signal amplitudes across different frequencies, which can be obtained through Fourier transforms or similar spectral decomposition techniques. By incorporating this information, the FRL framework enables the generative models to better understand and differentiate between ID and OOD samples, even in scenarios where the data distributions exhibit minor but significant differences [5].

One of the key contributions of the FRL framework is its ability to enhance the robustness of generative models against small perturbations in the data. Traditional likelihood-based approaches often struggle with OOD detection when faced with subtle distribution shifts, as they tend to assign high likelihood values to OOD samples that resemble ID data [5]. By integrating frequency information, the FRL framework allows the model to detect these perturbations and classify OOD samples more accurately. This is achieved by penalizing the reconstruction errors in high-frequency regions, where small changes in the input data can lead to significant variations in the output [5].

Moreover, the FRL framework offers a more efficient way of training generative models compared to traditional likelihood-based approaches. Incorporating frequency information during training can help to regularize the model and prevent overfitting to the training data, which is a common issue in likelihood-based models [1]. By ensuring that the model captures the essential characteristics of the data distribution without overfitting to noise or irrelevant features, the FRL framework enables the generative model to generalize better to unseen data and improve its OOD detection performance.

In practice, the FRL framework can be implemented by modifying the loss function used during the training of the generative model. Specifically, the loss function can be augmented with a regularization term that penalizes high-frequency errors in the reconstructed data. This regularization term encourages the model to focus on the low-frequency components of the data, which typically contain the most significant information about the underlying distribution, while simultaneously allowing the model to capture the high-frequency details that are crucial for distinguishing between ID and OOD samples [5].

Empirical evaluations have demonstrated the effectiveness of the FRL framework in improving the OOD detection performance of generative models. For instance, in a series of experiments conducted using the ImageNet dataset, the FRL framework was able to significantly outperform traditional likelihood-based approaches in terms of both precision and recall, even when the OOD samples were drawn from closely related distributions [5]. Additionally, the FRL framework showed improved robustness to distribution shifts and maintained consistent performance across different types of OOD data, indicating its potential for broader applicability in real-world scenarios [5].

Furthermore, the FRL framework can be seamlessly integrated into existing generative models without requiring significant modifications to the model architecture or training procedures. This flexibility makes it an attractive option for researchers and practitioners looking to enhance the OOD detection capabilities of their models [5]. By leveraging frequency information during training, the FRL framework provides a principled way to improve the robustness and generalization of generative models, making them more effective tools for detecting out-of-distribution data in a wide range of applications [5].

Building upon the advancements in gradient information and feature maps discussed earlier, the FRL framework further enriches the toolkit available for OOD detection by focusing on frequency-based regularization. This holistic approach not only improves the accuracy and robustness of generative models but also enhances their ability to generalize to unseen data, aligning with the broader goal of creating more reliable and interpretable AI systems [1].

In conclusion, the Frequency-Regularized Learning (FRL) framework represents a significant advancement in the field of OOD detection, offering a novel and efficient approach for enhancing the performance of generative models. By incorporating high-frequency information during the training process, the FRL framework enables these models to better capture the subtle differences between in-distribution and out-of-distribution data, thereby improving their ability to detect OOD samples accurately and efficiently. As the field continues to evolve, the FRL framework is poised to play a pivotal role in advancing the capabilities of generative models for OOD detection, paving the way for more reliable and robust machine learning systems [5].

### 7.3 Gradient-Based Methods for OOD Detection

Gradient-based methods for out-of-distribution (OOD) detection represent a compelling avenue of research due to their unique advantages in capturing the local behavior of data points relative to the model's decision boundary. Unlike non-gradient-based approaches that primarily rely on statistical measures of density or likelihood, gradient-based methods offer a direct assessment of the model's response to perturbations around data points, providing a nuanced perspective on OOD detection. This section explores how these methods leverage gradient norms to identify OOD samples and highlights their adaptability and integration with various model architectures.

The core mechanics of gradient-based OOD detection revolve around the calculation of gradient norms, which quantify the sensitivity of the model's output to small perturbations in the input space. High gradient norms typically indicate regions where the model's prediction is highly uncertain, reflecting areas far from the training data distribution. Conversely, lower gradient norms correspond to regions close to the training data, where the model's predictions are more confident. This principle is grounded in the idea that the decision boundary of a well-trained model lies near the training data, and deviations from this region suggest OOD conditions.

One of the key insights in gradient-based OOD detection is the observation that gradient norms are highly indicative of OOD samples, especially in scenarios where traditional density-based methods falter. Density-based approaches often struggle to distinguish between low-density regions within the in-distribution (ID) data and true OOD samples, leading to high false positive rates. In contrast, gradient-based methods are less prone to these issues because they focus on the immediate vicinity of the decision boundary rather than the global distribution of data points. 

A notable example of a gradient-based OOD detection method is the GradNorm technique [1]. This method introduces a loss function that encourages the model to produce consistent gradients across the ID data, effectively learning a decision boundary that is robust to small perturbations. By monitoring the deviation of gradient norms from this learned boundary, GradNorm can accurately identify OOD samples. Experimental evaluations on standard OOD benchmarks show that GradNorm achieves superior performance compared to density-based baselines, particularly in high-dimensional and complex data spaces.

The flexibility of gradient-based methods extends to their compatibility with various model architectures and learning paradigms. Unlike some density-based approaches that require specific assumptions about the underlying data distribution, gradient-based methods can be easily adapted to different model types, including neural networks, support vector machines, and ensemble models. This adaptability is crucial for real-world applications where models frequently operate in heterogeneous environments and encounter diverse types of distributional shifts.

Moreover, gradient-based methods offer seamless integration with existing model training processes. By incorporating gradient information directly into the loss function, these methods enable a unified framework for both training and detecting OOD samples. This not only streamlines the workflow but also allows for the simultaneous optimization of OOD detection performance alongside the primary task of model training. For instance, during training, the model can be penalized for producing high gradient norms on known ID samples, effectively encouraging the formation of a decision boundary that is less sensitive to perturbations within the ID distribution.

Additionally, gradient-based methods have shown promise in addressing the challenge of handling real-world distribution shifts, which often involve both covariate and semantic shifts. Traditional OOD detection approaches that focus solely on one type of shift can be inadequate in complex scenarios where multiple types of distributional shifts coexist. Gradient-based methods, however, can capture the nuances of these shifts by leveraging the local geometry of the data manifold. By analyzing the gradient dynamics in the neighborhood of each data point, these methods can discern between genuine OOD samples and ID samples that exhibit anomalous characteristics due to covariate shifts.

Despite their advantages, gradient-based methods also face certain limitations. One significant challenge is the computational overhead associated with computing gradient norms, particularly in high-dimensional data spaces. The process of calculating gradients for each sample can be resource-intensive, posing a barrier to real-time deployment in applications requiring rapid decision-making. Additionally, the interpretability of gradient-based methods can be a concern, as the meaning of high gradient norms is not always intuitive, especially for non-specialist users. Addressing these challenges may involve optimizing the computation of gradient norms and developing more user-friendly visualizations of the decision boundary.

In summary, gradient-based methods for OOD detection provide a robust and adaptable solution for enhancing the reliability and safety of machine learning systems. By focusing on the local behavior of data points relative to the model's decision boundary, these methods offer a powerful tool for identifying OOD samples that are often overlooked by traditional density-based methods. Their broad applicability across different model types and learning paradigms positions them as a promising direction for future research, particularly in the context of real-world applications characterized by complex distributional shifts.

### 7.4 GradOrth - Leveraging Gradient Projections for OOD Detection

GradOrth is a recent technique introduced to improve the accuracy and efficiency of out-of-distribution (OOD) detection by analyzing the norm of gradient projections onto subspaces that are deemed important for in-distribution (ID) data. Building upon the principles established in gradient-based OOD detection, GradOrth focuses on refining the method by examining gradients through the lens of specific subspaces that capture the essence of the ID data distribution.

### Theoretical Foundation of GradOrth
The theoretical underpinning of GradOrth stems from the observation that well-trained machine learning models capture the intrinsic structure of their training data. These models exhibit distinct behaviors when evaluating ID versus OOD data. GradOrth capitalizes on this behavior by analyzing the gradients of the model’s output with respect to its input. It projects these gradients onto subspaces that are representative of the ID data, with the rationale being that ID data points will have gradient vectors aligning closely with these subspaces, while OOD data points will deviate significantly.

The critical step in GradOrth is the selection of the appropriate subspace. A common strategy involves using the top eigenvectors of the covariance matrix of the gradient vectors for ID data points, capturing the principal directions of variation within the ID data. Alternatively, utilizing the learned feature space from a pre-trained model can define subspaces that are more nuanced and capture higher-order interactions indicative of the ID distribution.

### Implementation of GradOrth
Implementing GradOrth begins with training a machine learning model on a dataset of ID data. Once trained, the model computes the gradients of its output with respect to the input data points. These gradients are then projected onto the chosen subspace, and the norm of these projections is calculated. Higher projection norms for OOD data points indicate greater deviation from the ID distribution, facilitating accurate classification of OOD samples.

A key advantage of GradOrth is its model-agnostic nature, enabling its application to any differentiable model without the need for additional training data or modifications to the model architecture. This versatility makes GradOrth a powerful tool across various applications. Moreover, its independence from OOD data during training addresses a common challenge in acquiring such data.

### Performance Metrics and Comparative Analysis
The efficacy of GradOrth has been evaluated using standard metrics for OOD detection, such as AUC, FPR@95TPR, and AUROC. These metrics provide a comprehensive evaluation of the method’s capability to distinguish between ID and OOD data points.

When compared to other OOD detection methods, such as density-based approaches and likelihood-ratio-based methods, GradOrth demonstrates competitive performance. For instance, compared to density-based methods, which estimate data probability densities, GradOrth offers a computationally efficient and less overfitting-prone alternative. Similarly, against likelihood-ratio-based methods, GradOrth remains robust due to its independence from likelihood estimates.

Additionally, GradOrth has shown promise over contrastive learning methods, which rely on constructing contrastive pairs to learn invariant representations. While contrastive learning can be effective, it may struggle with high-dimensional data or poorly suited transformations. GradOrth, by leveraging gradient vector information, provides a complementary approach that excels in these scenarios.

### Empirical Validation
Empirical studies on various datasets, including ImageNet and CIFAR-10, confirm GradOrth’s effectiveness across different domains. On ImageNet, GradOrth achieved an AUC score of 0.92, surpassing several density-based methods. On CIFAR-10, it obtained an AUROC score of 0.94, outperforming likelihood-ratio-based methods. These results underscore GradOrth’s versatility and reliability.

### Challenges and Future Directions
Despite its strengths, GradOrth faces challenges, notably in the sensitivity of subspace selection and the quality of initial model training. Alternative subspace choices or ensemble methods might address these issues, ensuring more robust OOD detection. Further exploration into advanced subspace selection techniques, like those based on deep learning, could also enhance GradOrth’s performance.

In conclusion, GradOrth represents a significant advancement in OOD detection by leveraging gradient projections onto carefully selected subspaces. Its model-agnostic nature, computational efficiency, and robust performance make it a valuable contribution to the field. As research progresses, further investigations into GradOrth and related techniques will continue to refine OOD detection methodologies.

### 7.5 NegLabel - Utilizing Negative Labels for Enhanced OOD Detection

NegLabel, an innovative approach to enhancing out-of-distribution (OOD) detection, leverages negative labels derived from extensive textual databases to refine the process of identifying unknown samples. This method benefits from the rich resources available in textual corpora to provide a more nuanced understanding of what constitutes OOD data, thereby improving the robustness and accuracy of detection mechanisms across various vision-language models.

The theoretical foundation of NegLabel is built on the premise that textual databases contain abundant information that can help delineate the boundaries of known distributions more precisely. Unlike traditional OOD detection methods, which often rely solely on the inherent properties of the data or model outputs, NegLabel incorporates external knowledge to guide the detection process. This external knowledge is embodied in negative labels, which are extracted from a thorough understanding of the data domain, typically through extensive text mining and annotation efforts [22]. These labels represent elements that are unequivocally not part of the in-distribution (ID) classes, serving as a benchmark for identifying OOD data.

One of the key challenges in OOD detection is recognizing subtle distinctions between ID and OOD data, especially when these differences are not evident from the raw data features. NegLabel tackles this issue by leveraging the contextual information embedded in textual databases to enrich the understanding of what signifies an anomaly. By integrating these negative labels into the OOD detection pipeline, NegLabel enables the model to better discern patterns indicative of OOD data, thereby reducing false positives and improving overall detection accuracy [6].

The methodological framework of NegLabel comprises a two-step process. Firstly, negative labels are generated from textual databases using a combination of keyword extraction, semantic analysis, and manual curation. These labels aim to capture the essence of what is considered OOD in the specific domain of interest. For example, in medical imaging, negative labels might be derived from descriptions of common artifacts or anomalies that do not correspond to known diseases [1]. In a broader computer vision context, negative labels could encompass a wide array of objects or scenes that are not part of the original training dataset.

Secondly, these negative labels are incorporated into the training or evaluation phase of the vision-language model. This integration can vary based on the specifics of the model architecture and the nature of the negative labels. Common approaches include using negative labels as an additional input or constraint during training to guide the model towards learning more robust representations that can distinguish between ID and OOD data. Alternatively, negative labels can be utilized during the evaluation phase to validate the model’s OOD detection capabilities [30].

Empirical evaluations of NegLabel have showcased its effectiveness across different vision-language models, underscoring its versatility and adaptability. In a series of studies, researchers applied NegLabel to various models, including transformers and convolutional neural networks, to assess its impact on OOD detection performance [31]. The results indicated that NegLabel consistently improved the precision and recall of OOD detection, surpassing traditional methods that did not integrate negative labels. This enhancement is attributed to the model’s improved ability to comprehend and represent the nuances of the data domain, which is essential for accurate OOD detection.

Furthermore, NegLabel demonstrates potential in addressing inherent limitations of OOD detection methods. Traditional approaches often struggle with adapting to distribution shifts and identifying subtle anomalies that may not be immediately apparent from the raw data. By leveraging negative labels from textual databases, NegLabel provides a mechanism for the model to better adapt to distribution shifts and handle anomalies arising from spurious correlations or environmental changes [32]. This adaptability is crucial for real-world applications where models frequently encounter dynamic and unpredictable data distributions.

In summary, NegLabel represents a significant advancement in the field of OOD detection by harnessing the power of textual databases to enrich the detection process. Through the incorporation of negative labels, NegLabel enhances the model's capacity to accurately identify OOD data, thereby improving the reliability and safety of machine learning systems. As the field continues to advance, NegLabel stands as a promising approach that bridges the gap between theoretical understanding and practical implementation of OOD detection [22].

### 7.6 DOODLER - Segmenting OOD Regions Using VAE Reconstructions

DOODLER, an innovative VAE-based OOD detection method, builds upon the reconstruction failure mode inherent in Variational Autoencoders (VAEs) to segment input images based on their likelihood of being out-of-distribution. By leveraging the fact that VAEs excel at capturing the underlying distribution of the training data and generating accurate reconstructions of in-distribution (ID) samples, DOODLER identifies and segments regions within an input image that are inconsistent with the learned ID distribution, thereby pinpointing areas that may contain OOD elements. This method is particularly useful in applications such as medical imaging and autonomous driving [1].

At the core of DOODLER is the principle that VAEs struggle to produce accurate reconstructions when faced with OOD samples, resulting in increased reconstruction errors. During the inference phase, DOODLER reconstructs the input image and measures the reconstruction error for each pixel or region. High reconstruction errors indicate potential OOD areas, while low errors suggest that the region is likely ID. This segmentation process enables a detailed analysis of the input image, facilitating the identification of specific regions that deviate from the expected ID pattern.

One of the key strengths of DOODLER is its ability to generate interpretable outputs. Instead of offering a simple binary classification, DOODLER produces a heatmap of reconstruction errors, providing a visual representation of the model's confidence in each region of the input image. This transparency is invaluable in contexts such as medical imaging, where clinicians need to understand the rationale behind a model's decisions. By highlighting regions with high reconstruction errors, DOODLER assists medical professionals in identifying areas that may require further investigation or intervention.

DOODLER's reliance on the reconstruction failure mode of VAEs also makes it robust against subtle distribution shifts, a challenge that many OOD detection methods face. For instance, in autonomous driving, DOODLER can effectively flag unusual traffic signs or unexpected road conditions that subtly deviate from the norm but significantly affect safety. This capability is crucial for maintaining the reliability and safety of autonomous systems, as even minor anomalies can pose substantial risks if undetected.

To enhance its performance, DOODLER implements several post-processing techniques. These include thresholding the reconstruction error map to minimize false positives and applying morphological operations to smooth and regularize the segmented regions. Additionally, DOODLER uses a novel layer-wise aggregation strategy that integrates reconstruction errors from multiple layers of the VAE, offering a more comprehensive view of OOD regions. This approach ensures that the model captures both local and global inconsistencies within the input image, thereby improving the accuracy of OOD detection.

DOODLER’s adaptability is another significant advantage. Given its reliance on the VAE framework, the method can be easily adapted to various types of input data, including images, audio, and text. This flexibility makes DOODLER suitable for a wide range of OOD detection tasks, from identifying anomalies in medical scans to detecting new classes in image classification. Furthermore, DOODLER can be seamlessly integrated into existing machine learning pipelines with minimal adjustments to the VAE architecture and training procedure.

Beyond its robust performance and interpretability, DOODLER offers computational efficiency. Leveraging the efficient encoding and decoding processes of VAEs, DOODLER can perform OOD detection in real-time, making it ideal for applications requiring immediate response, such as autonomous vehicles and industrial monitoring systems. Additionally, its reliance on reconstruction errors simplifies the evaluation process, eliminating the need for complex scoring functions or additional hyperparameter tuning.

However, DOODLER does face some limitations. Adequate coverage of the ID distribution through a sufficiently large and diverse dataset is necessary for optimal performance. Insufficient data may hinder the model's ability to capture the nuances of the ID distribution, leading to suboptimal OOD detection. Additionally, the method's reliance on reconstruction errors as the primary indicator of OOD data can be affected by noise and artifacts in the input image, potentially causing false positives or false negatives. Addressing these challenges requires further research into advanced VAE architectures and training strategies that better accommodate the complexities of real-world data distributions.

In conclusion, DOODLER represents a significant advancement in OOD detection, offering a robust and interpretable method for segmenting input images based on their likelihood of being out-of-distribution. By exploiting the reconstruction failure mode of VAEs, DOODLER provides a detailed analysis of input images, identifying specific regions that deviate from the expected ID pattern. Its adaptability, efficiency, and interpretability make it a valuable tool for various applications, from medical imaging to autonomous driving. As research continues to explore the potential of VAEs and other generative models in OOD detection, DOODLER stands as a promising avenue for enhancing the reliability and safety of machine learning systems in real-world scenarios.

### 7.7 Other Notable Techniques and Frameworks

In the landscape of generalized OOD detection, several notable techniques and frameworks have emerged that contribute significantly to the field's advancement. Building upon the principles discussed in DOODLER, these methods extend the scope of OOD detection by incorporating diverse approaches, ranging from leveraging gradient information to integrating textual prompts and developing comprehensive benchmark frameworks for rigorous evaluation.

One such technique is GradNorm [1], which focuses on optimizing the gradients of the model to better identify OOD samples. The core idea behind GradNorm is to utilize the gradient norms to assess the likelihood of a sample being OOD. By training the model to have consistent gradient magnitudes for in-distribution (ID) samples and larger gradient magnitudes for OOD samples, GradNorm aims to capture the intrinsic difference in the learning dynamics between ID and OOD samples. This method not only highlights the importance of gradient information in OOD detection but also showcases how gradient norms can serve as a powerful indicator for distinguishing between ID and OOD data. Furthermore, the use of gradient norms allows for a model-agnostic approach, making GradNorm applicable to a wide range of neural network architectures and tasks.

Another notable technique is NegPrompt [1], which introduces the concept of utilizing negative prompts to enhance OOD detection performance. Unlike traditional methods that focus on learning positive representations of ID samples, NegPrompt emphasizes the role of negative examples in guiding the model's decision-making process. Specifically, NegPrompt generates negative prompts that are semantically similar to the OOD samples but distinct from the ID samples, thereby enabling the model to learn the boundaries between ID and OOD data more effectively. This approach is particularly beneficial in scenarios where the OOD samples share some similarities with the ID samples, making it challenging for traditional methods to accurately differentiate between them. By incorporating negative prompts, NegPrompt enhances the model's ability to generalize and recognize subtle differences that might be overlooked by other techniques.

Additionally, the OpenOOD benchmark framework [1] has become a crucial tool for evaluating and comparing different OOD detection methods. This framework provides a standardized platform for researchers to test and validate their methods across a variety of datasets and scenarios. One of the key strengths of the OpenOOD framework is its comprehensive coverage of different OOD detection scenarios, including but not limited to, image classification, object detection, and semantic segmentation. By offering a diverse suite of datasets and evaluation metrics, the OpenOOD framework ensures that the performance of OOD detection methods is assessed in a rigorous and unbiased manner. Moreover, the OpenOOD framework includes a wide range of baselines, enabling researchers to compare their methods against established techniques and identify potential improvements. This framework not only accelerates the development and validation of OOD detection methods but also fosters collaboration among researchers by providing a common ground for experimentation and discussion.

These techniques and frameworks collectively underscore the multifaceted nature of OOD detection and highlight the ongoing efforts to develop more robust and versatile solutions. While DOODLER leverages the reconstruction failure mode of VAEs to segment input images based on their likelihood of being out-of-distribution, techniques like GradNorm and NegPrompt explore alternative pathways to achieve reliable OOD detection. GradNorm leverages the inherent properties of gradients to capture the learning dynamics of neural networks, offering a powerful and interpretable approach to OOD detection. NegPrompt, on the other hand, introduces a novel paradigm by incorporating negative prompts to guide the model's learning process, thereby improving its generalization and discriminative capabilities. Finally, the OpenOOD framework serves as a critical resource for the evaluation and comparison of different OOD detection methods, ensuring that advancements in the field are validated through rigorous testing and benchmarking. Together, these contributions represent significant strides toward achieving the overarching goal of generalized OOD detection, paving the way for more reliable and effective machine learning systems in a wide array of applications.

## 8 Multi-Modal OOD Detection Frameworks

### 8.1 Overview of Multi-Modal OOD Detection

Multi-modal out-of-distribution (OOD) detection presents a significant challenge in contemporary machine learning, particularly in applications such as autonomous driving, robotics, and healthcare, where systems must navigate dynamic environments characterized by varying sensory inputs and unpredictable conditions. Unlike single-modal OOD detection, which focuses on data from a single source, multi-modal OOD detection entails the concurrent processing of data from multiple sources, each offering unique yet potentially conflicting insights. This complexity poses several hurdles that traditional OOD detection methods struggle to overcome effectively.

One key challenge in multi-modal OOD detection is the management of sensor faults. In practical applications, sensors can malfunction or deliver erroneous readings due to physical damage, environmental interference, or calibration issues. For instance, in autonomous vehicles, lidar sensors may fail to measure distances accurately under adverse weather conditions, while cameras might falter in low-light scenarios [16]. Such faults not only degrade the quality of individual data streams but also complicate the integration of information across multiple modalities. Robust detection systems must therefore be adept at distinguishing between genuine in-distribution (ID) data and data contaminated by sensor faults.

Environmental variability is another major hurdle in multi-modal OOD detection. Autonomous systems often operate in environments that can change dramatically—from bustling city streets to quiet rural roads. Factors such as lighting, atmospheric conditions, and the presence of obstructions like fog or smoke can significantly alter how sensors perceive and process data. These changes can lead to substantial shifts in the distribution of input data compared to what was encountered during training, complicating accurate classification and response. Consequently, effective multi-modal OOD detection requires systems that can adapt to these variations, maintaining performance and safety across diverse conditions.

Data quality also plays a critical role in the efficacy of multi-modal OOD detection. High-quality, diverse datasets are essential for training robust machine learning models capable of generalizing well to novel situations. However, acquiring such datasets can be challenging. In medical imaging, for example, securing a representative dataset can be constrained by patient consent, ethical concerns, and the technical complexities involved in integrating multiple imaging modalities like MRI, CT scans, and ultrasound [3]. Similarly, in autonomous driving, the logistical and safety limitations of capturing every conceivable scenario and environmental condition make comprehensive datasets impractical. Training models under these constraints necessitates methods that can leverage limited and potentially noisy data to enhance generalization.

Addressing these challenges requires the development of robust frameworks capable of integrating multi-modal information while managing sensor faults and adapting to environmental changes. Contrastive learning methods offer one promising approach by teaching models to identify consistent patterns across different modalities, even when individual sensors are unreliable [6]. This improves system resilience and reliability.

Adaptive learning algorithms that dynamically adjust parameters based on real-time feedback are another effective strategy. Systems can recalibrate their decision-making processes in response to changing conditions, enhancing their ability to differentiate between ID and OOD data. For example, the WOOD framework, introduced in "General-Purpose Multi-Modal OOD Detection Framework," combines a binary classifier with a contrastive learning component to detect OOD samples in a weakly-supervised manner. This not only boosts adaptability but also minimizes reliance on labeled data, making implementation more feasible in real-world settings [14].

Incorporating techniques that explicitly model uncertainty can further bolster multi-modal OOD detection systems. Bayesian methods, for instance, provide a structured way to quantify and propagate uncertainty throughout the system, aiding in the handling of noisy and incomplete data. These techniques can also aid in identifying anomalous patterns that deviate from expected distributions, facilitating the detection of OOD samples [15].

In summary, the complexities of multi-modal OOD detection highlight the need for innovative frameworks that can effectively manage sensor faults, adapt to environmental changes, and utilize high-quality data. By integrating advanced learning algorithms, uncertainty modeling, and adaptive strategies, we can develop systems that maintain high performance and reliability in dynamic and uncertain environments, enhancing both safety and the broader adoption of machine learning in critical applications.

### 8.2 WOOD Framework for Multi-Modal OOD Detection

In the quest for robust out-of-distribution (OOD) detection mechanisms, particularly within multi-modal contexts, the emergence of the WOOD framework marks a significant advancement [5]. Building upon the challenges highlighted in sensor reliability and environmental variability discussed previously, WOOD offers a comprehensive approach to detecting OOD samples in environments characterized by diverse sensory inputs, such as those found in autonomous vehicles, smart homes, and healthcare systems. By integrating a binary classifier with a contrastive learning component, WOOD provides a weakly-supervised mechanism that effectively distinguishes between in-distribution (ID) and out-of-distribution (OOD) samples across multiple modalities.

At the heart of the WOOD framework lies its unique combination of two distinct components: a binary classifier and a contrastive learning module. The binary classifier is designed to determine whether an input sample belongs to the known in-distribution category or is an out-of-distribution sample. Leveraging a deep neural network architecture, this classifier captures intricate patterns within multi-modal data, assuming that ID and OOD samples exhibit distinct patterns in their feature space representations. This capability is crucial for dealing with the sensor reliability and data quality issues often encountered in multi-modal systems.

Contrastive learning within the WOOD framework aims to learn robust and invariant representations across different modalities. This method emphasizes learning representations where similar instances (positive pairs) from the same modality are closer to each other in the embedding space, while dissimilar instances (negative pairs) from different modalities are pushed apart. By selecting positive pairs from the same modality and negative pairs from different modalities, WOOD facilitates the creation of a joint embedding space where ID samples cluster together, while OOD samples remain distinctly separate. This approach ensures that the learned representations are not only discriminative but also robust to distributional shifts across modalities, aligning well with the requirements for handling environmental variability.

A critical aspect of the WOOD framework is its use of a Hinge loss function to enforce a clear separation between ID and OOD samples. This loss function maximizes the margin between different classes, thereby enhancing the classifier's ability to distinguish between ID and OOD samples. Specifically, for each ID sample, the framework ensures that the closest OOD sample is sufficiently distant in the embedding space, thus creating a clear margin that improves detection accuracy. This loss formulation is pivotal in enabling the WOOD framework to maintain high performance in OOD detection, especially in scenarios with highly variable and complex data distributions.

Moreover, the WOOD framework operates in a weakly-supervised manner, eliminating the need for explicit OOD labels during training. Instead, it leverages the inherent structure of multi-modal data to infer the OOD status of samples, a capability particularly advantageous given the challenges in obtaining labeled OOD data in real-world applications. This characteristic aligns well with the discussion on data quality and the necessity for methods that can work with limited and potentially noisy data.

Another notable feature of the WOOD framework is its flexibility and adaptability to different multi-modal datasets and application domains. Demonstrating superior performance in detecting OOD samples across various benchmarks, such as the COCO-O dataset [33], the framework showcases its effectiveness in handling diverse and complex multi-modal data. Additionally, its modular design allows for the easy incorporation of domain-specific knowledge, making it a valuable tool for tailoring solutions to specific application scenarios.

In summary, the WOOD framework emerges as a powerful tool for multi-modal OOD detection, combining the discriminative power of a binary classifier with the robust representation learning capabilities of contrastive learning. Its reliance on a Hinge loss for enforcing clear margins and its ability to operate in a weakly-supervised manner contribute to its versatility and effectiveness across a wide range of applications. These attributes position the WOOD framework as a critical component in addressing the challenges of sensor reliability and environmental variability, thereby enhancing the reliability and safety of machine learning systems in dynamic and uncertain environments.

### 8.3 Sensor Reliability Monitoring and Data Cleaning

In multi-modal out-of-distribution (OOD) detection frameworks, sensor reliability monitoring and data cleaning are critical components that enhance the robustness and applicability of multi-modal systems. These systems often face challenges due to variations in sensor performance, environmental changes, and data quality issues, which can significantly affect the reliability of OOD detection. One notable framework that addresses these challenges is RelSen, an optimization-based approach presented in "Generalized Out-of-Distribution Detection: A Survey."

Designed to simultaneously monitor the reliability of multiple sensors and clean sensor data, RelSen provides a comprehensive solution for ensuring the integrity and accuracy of multi-modal OOD detection. At the core of RelSen is an optimization algorithm that formulates the problem of sensor reliability monitoring and data cleaning as a joint optimization task. This algorithm seeks to minimize the discrepancy between the original sensor measurements and the cleaned, corrected data while ensuring the cleaned data adheres to statistical constraints indicative of reliable sensor operation. By doing so, RelSen not only cleans the data but also identifies and flags potentially unreliable sensors, enabling proactive maintenance and calibration.

A key strength of RelSen is its ability to handle sensor reliability monitoring and data cleaning without requiring explicit knowledge of which sensors are faulty. This is particularly advantageous in real-world deployments where the nature and timing of sensor failures are unpredictable. RelSen employs statistical measures to assess the reliability of each sensor and adjusts its contribution to the final data cleaning process accordingly. This adaptive approach allows RelSen to dynamically respond to changes in sensor performance, ensuring that the cleaned data remains representative of the true underlying phenomena despite transient issues with individual sensors.

Moreover, RelSen incorporates a data-driven approach to optimize the parameters of its cleaning algorithm. Training on a diverse set of sensor data, including both clean and corrupted examples, enables RelSen to learn optimal thresholds and correction strategies for different types of sensor errors. This data-driven optimization ensures that RelSen can generalize well to new situations and adapt to different operational contexts, enhancing its robustness and applicability across various applications. It also makes RelSen a versatile tool for multi-modal OOD detection, capable of handling a wide range of sensor types and configurations.

Integrating sensor reliability monitoring and data cleaning in a single framework is particularly beneficial for multi-modal systems that rely on sensors with varying levels of reliability and precision. By addressing both aspects simultaneously, RelSen mitigates the risks associated with data anomalies and sensor failures, ensuring the system remains robust and reliable even in challenging environments where sensor data quality is inconsistent or degraded. This holistic approach maintains the integrity of the underlying signal while cleaning the data, providing a significant advantage over traditional methods that either focus solely on data cleaning or sensor reliability monitoring.

The effectiveness of RelSen in multi-modal OOD detection has been demonstrated through experiments on various datasets and simulation scenarios. These experiments show significant reductions in false positive rates and improvements in detection accuracy, particularly in the presence of sensor failures and data corruption. RelSen also performs strongly in scenarios involving multiple modalities, such as visual and acoustic data, further highlighting its versatility and robustness.

In conclusion, the RelSen framework represents a significant advancement in the field of multi-modal OOD detection, offering a comprehensive solution for sensor reliability monitoring and data cleaning. By integrating these critical functions, RelSen enhances the robustness and reliability of multi-modal systems, making them more suitable for real-world applications with varying sensor data quality and performance. The framework’s adaptive and data-driven nature ensures it can effectively handle a wide range of challenges and operational conditions, positioning it as a valuable tool for advancing the state-of-the-art in OOD detection.

### 8.4 Adversarial Approaches in Multi-Modal Fusion

Adversarial approaches to multi-modal sensor fusion, as detailed in "Robust Multi-Modal Sensor Fusion: An Adversarial Approach," represent a sophisticated method aimed at enhancing the robustness of multi-modal fusion systems against noisy or damaged sensors. Building upon the principles established in the RelSen framework, which focuses on sensor reliability and data cleaning, adversarial approaches leverage advanced training techniques to further fortify the system against distribution shifts and sensor malfunctions.

The essence of this approach lies in the strategic employment of adversarial training techniques to simulate and mitigate the effects of various disturbances that can compromise sensor reliability and data integrity. By challenging the fusion model with adversarially generated inputs, the system can learn to distinguish between genuine signals and noise, thereby improving its overall resilience and accuracy. This complements the sensor reliability monitoring and data cleaning efforts of frameworks like RelSen by adding an additional layer of defense against more nuanced and targeted disruptions.

One of the central tenets of adversarial approaches is the principle of adversarial training, which involves pitting two neural networks against each other—one acting as the generator and the other as the discriminator. In the context of multi-modal sensor fusion, the generator network is tasked with producing synthetic sensor data that mimics the characteristics of actual data but includes intentional distortions or anomalies. These distorted samples are then fed into the discriminator network, which aims to differentiate between real and synthetic data. Through iterative refinement, the discriminator learns to discern subtle differences, while the generator adapts to produce increasingly realistic yet distorted samples. This adversarial interaction ultimately results in a more robust fusion model capable of filtering out noisy or corrupted sensor inputs, ensuring that the cleaned data remains representative of the true underlying phenomena.

A significant advantage of adversarial approaches is their ability to simulate a wide range of distribution shifts and sensor malfunctions, which are critical for preparing the fusion system to handle real-world variability. For instance, the generator can be configured to introduce various types of noise, such as Gaussian noise, impulse noise, or mixed noise, into the sensor data. Additionally, it can simulate scenarios where individual sensors fail or malfunction, leading to missing data or erroneous readings. By training the fusion model to recognize and mitigate the effects of these adversarial manipulations, it becomes better equipped to handle unexpected anomalies and maintain accurate data fusion even under adverse conditions.

Moreover, adversarial approaches enable the incorporation of domain-specific knowledge into the fusion process, allowing for more tailored and effective mitigation strategies. For example, in the context of autonomous vehicles, the generator can be programmed to simulate scenarios involving sudden weather changes, sensor occlusions due to passing vehicles, or even deliberate adversarial attacks aimed at confusing the vehicle's perception system. Such targeted simulations help the fusion model to develop a deeper understanding of the diverse challenges it might encounter in real-world applications, thereby enhancing its overall robustness and reliability.

Another key aspect of adversarial approaches is their ability to enhance the detection and isolation of OOD samples within multi-modal data streams. By leveraging the discriminator's ability to identify synthetic anomalies, these approaches can be adapted to flag potentially problematic sensor readings or data segments that deviate significantly from expected patterns. This capability is particularly valuable in safety-critical applications where the timely identification and exclusion of unreliable data can prevent critical errors and ensure system stability.

Furthermore, adversarial training facilitates the development of more interpretable and transparent fusion models, which is crucial for gaining trust and confidence in their outputs. Through the iterative refinement process, the model not only learns to filter out noise but also develops a clearer understanding of the underlying data structures and relationships. This increased transparency can aid in diagnosing issues and troubleshooting potential failures, contributing to a more resilient and dependable system.

However, the implementation of adversarial approaches also presents several challenges that must be addressed to ensure their effectiveness. One of the primary concerns is the computational cost associated with training the generator and discriminator networks. Given the complexity and resource-intensive nature of deep learning models, especially when dealing with multi-modal data, the training process can be prohibitively expensive in terms of both time and hardware requirements. To mitigate this challenge, researchers are exploring various optimization techniques, such as model compression and parallel processing, to streamline the training process and make adversarial approaches more scalable and feasible for real-world deployment.

Additionally, there is a need for careful consideration of the balance between model robustness and generalization performance. While adversarial training can significantly enhance a model's resilience against perturbations and anomalies, it may also risk overfitting to the specific types of noise and disturbances encountered during training. This overfitting can lead to reduced generalization capabilities when the model encounters new or unforeseen distribution shifts in real-world settings. To address this issue, researchers are experimenting with various regularization techniques and transfer learning strategies to strike a balance between robustness and generalization, ensuring that the model remains effective across a wide range of scenarios.

In conclusion, adversarial approaches offer a promising avenue for enhancing the robustness and reliability of multi-modal sensor fusion systems in the face of challenging and unpredictable environments. By leveraging the power of adversarial training, these approaches can simulate a diverse array of disturbances and anomalies, enabling the fusion model to develop a comprehensive understanding of the data and effectively filter out noise and faulty readings. While there are still technical challenges to overcome, ongoing research and advancements in deep learning and adversarial training techniques continue to push the boundaries of what is possible in multi-modal fusion, paving the way for safer, more reliable, and more adaptable systems in a variety of real-world applications.

### 8.5 Certified Robustness of Multi-Sensor Fusion Systems

Certified robustness in multi-sensor fusion systems is a critical aspect that ensures the reliability and safety of machine learning applications in real-world scenarios. Building upon the robust adversarial training methodologies discussed previously, frameworks such as COMMIT, introduced in "COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks," extend the scope of adversarial defenses to include semantic attacks, thus enhancing the overall security and trustworthiness of multi-sensor fusion systems.

In the context of multi-sensor fusion, the primary challenge lies in integrating data from diverse sources to form a coherent representation that accurately reflects the true state of the environment. However, this process is inherently susceptible to various types of adversarial attacks, particularly semantic attacks, which manipulate the data to deceive the system into incorrect interpretations. Unlike simple noise or anomalies, semantic attacks do not necessarily alter the raw data significantly but rather exploit semantic inconsistencies or ambiguities to induce misclassification or misinterpretation. To counteract such threats, COMMIT employs a novel certification mechanism that quantifies the robustness of multi-sensor fusion systems against these attacks.

At the heart of COMMIT is the concept of randomized smoothing, a technique that has been extensively used in the domain of adversarial defense mechanisms for single-model scenarios. The idea behind randomized smoothing is to add random perturbations to the input data and average the outputs of the model over these perturbations to obtain a smoothed output. This process effectively reduces the sensitivity of the model to small adversarial perturbations, thereby enhancing its robustness. In the context of multi-sensor fusion, COMMIT extends this principle by applying randomized smoothing not just to individual sensor inputs but also to the fused outputs. By doing so, COMMIT ensures that the system remains robust even when individual sensor readings are manipulated, as the smoothing process helps in mitigating the effects of such manipulations.

Moreover, COMMIT introduces a grid-based splitting method to further enhance the robustness of multi-sensor fusion systems. The grid-based splitting method divides the input space into smaller, manageable regions or grids. Within each grid, the system computes the smoothed output using randomized smoothing. By splitting the input space into grids, COMMIT allows for more granular analysis and certification of robustness, enabling the system to handle complex, high-dimensional input spaces more effectively. This approach is particularly advantageous in multi-sensor fusion scenarios where the input space can be highly intricate and multidimensional, encompassing data from various sensors such as cameras, lidars, and radars.

The certification process in COMMIT involves a rigorous evaluation of the system's response to semantic attacks across different regions of the input space. For each region or grid, COMMIT computes the probability that the system's output will remain unchanged despite adversarial perturbations. This probability is referred to as the robustness certificate and serves as a quantitative measure of the system's robustness. A higher robustness certificate indicates greater confidence in the system's ability to withstand semantic attacks, thereby providing a clear indication of the system's resilience.

One of the key advantages of the COMMIT framework is its ability to provide certifiable robustness guarantees, which are crucial for ensuring the reliability of multi-sensor fusion systems in safety-critical applications such as autonomous driving and robotics. Unlike heuristic approaches that rely on empirical evaluations, COMMIT offers a theoretically grounded method for assessing the robustness of multi-sensor fusion systems. This theoretical foundation enables researchers and practitioners to have a deeper understanding of the vulnerabilities and strengths of these systems, facilitating the design of more secure and robust systems.

However, implementing COMMIT in practice requires careful consideration of several factors. Firstly, the choice of the random perturbation distribution and the size of the perturbations can significantly impact the robustness guarantees provided by COMMIT. A larger perturbation size generally yields stronger robustness certificates but may also lead to reduced accuracy in benign conditions. Therefore, finding an optimal balance between robustness and accuracy is essential for practical deployment. Secondly, the computational cost of computing robustness certificates using randomized smoothing can be substantial, particularly for high-dimensional input spaces. To mitigate this, COMMIT employs efficient sampling techniques and parallel processing to expedite the certification process.

Furthermore, the applicability of COMMIT extends beyond traditional multi-sensor fusion scenarios to a wide range of machine learning applications that involve multiple data sources. For instance, in the field of medical imaging, where multimodal data from different imaging modalities (e.g., MRI, CT scans) are often integrated, COMMIT can be adapted to provide robustness guarantees against semantic attacks. Similarly, in cybersecurity, where multi-source data from network traffic, logs, and user behavior are analyzed, COMMIT can offer a robust certification framework to protect against sophisticated semantic attacks that exploit inconsistencies in the data.

In conclusion, the COMMIT framework represents a significant step forward in the field of multi-sensor fusion systems, offering a robust certification mechanism against semantic attacks. By leveraging randomized smoothing and grid-based splitting methods, COMMIT provides a theoretically sound and practically implementable approach to ensure the reliability and safety of multi-sensor fusion systems. As the demand for robust and secure machine learning systems continues to grow, frameworks like COMMIT will play a pivotal role in shaping the future of machine learning applications in safety-critical domains.

## 9 Specialized Applications and Domain-Specific Challenges

### 9.1 Challenges in Medical Imaging OOD Detection

Medical imaging has emerged as a critical domain for out-of-distribution (OOD) detection due to its pivotal role in healthcare diagnostics and treatment planning. Unlike traditional computer vision tasks, medical imaging involves intricate and diverse data modalities, posing significant challenges for OOD detection methodologies. These challenges primarily revolve around managing domain shifts, semantic shifts, and distinguishing subtle abnormalities that often masquerade as normal cases, thereby complicating the detection process [1]. Ensuring the reliability and safety of machine learning systems in medical imaging requires the development of models that can adapt to unseen domains while maintaining high accuracy in detecting novel classes.

One of the primary challenges in medical imaging OOD detection is handling domain shifts. Domain shifts refer to variations in imaging protocols, equipment, and patient demographics that can alter the appearance of medical images [1]. For example, images captured from different hospitals may vary significantly in terms of resolution, contrast, and noise levels due to differences in scanning equipment and protocols. These variations can introduce OOD data, making it difficult for models trained on a specific dataset to generalize effectively. Traditional methods, such as those relying solely on in-distribution (ID) data, often struggle to account for these domain shifts, leading to poor performance in real-world applications [1].

Semantic shifts present another significant challenge. Semantic shifts occur when the underlying meaning or structure of the data changes, even if the visual appearance remains similar [1]. For instance, a lung nodule detected in a chest X-ray could appear similar to a benign lesion but may have entirely different clinical implications. Traditional OOD detection methods that rely on pixel-level similarities might fail to capture these semantic nuances, leading to incorrect classifications. Therefore, there is a pressing need for methods that can understand the semantic content of medical images and differentiate between visually similar but semantically distinct anomalies [1].

Subtle abnormalities represent yet another hurdle in medical imaging OOD detection. These abnormalities are often indistinguishable from normal tissue on initial inspection, requiring sophisticated models to accurately detect and classify them [1]. For example, early-stage tumors or microcalcifications in mammograms can be nearly invisible to human eyes and pose a significant challenge for OOD detection algorithms. Existing methods, such as density-based and likelihood-ratio-based approaches, often struggle with detecting these subtle changes, as they rely heavily on statistical properties that may not be significantly altered by such minor abnormalities [1].

To address these challenges, researchers have explored various strategies, including dual-conditioned diffusion models. These models incorporate in-distribution class information and latent features of the input image to constrain the generative manifold and ensure structural and semantic similarity with in-distribution samples [1]. By conditioning on these factors, dual-conditioned diffusion models can better adapt to unseen domains and maintain high accuracy in detecting novel classes. Additionally, histogram-based methods have shown promise in efficiently detecting OOD samples in 3D medical images, providing near-perfect results without requiring deep learning [1]. These methods leverage simple statistical properties of the data, making them computationally efficient and effective in handling subtle abnormalities [1].

Advancements in unsupervised OOD detection methods, such as density of states estimation (DoSE), have further contributed to the development of models capable of operating without labeled OOD data or even any additional data beyond the model's architecture itself [1]. DoSE uses nonparametric density estimators to measure the typicality of model statistics, enabling it to detect OOD samples based on deviations from expected patterns [1]. Such methods offer a flexible and robust approach to OOD detection in medical imaging, where obtaining labeled OOD data can be logistically challenging and expensive.

Given these advancements, the development of effective OOD detection models in medical imaging remains a complex and ongoing endeavor. The interplay between domain shifts, semantic shifts, and subtle abnormalities necessitates the creation of models that can dynamically adapt to new data distributions while preserving high sensitivity and specificity in detecting novel classes [1]. Future research should focus on integrating multi-modal data sources, such as MRI and CT scans, to enrich the contextual information available for OOD detection. Additionally, the exploration of adversarial approaches for multi-modal sensor fusion could enhance the robustness of OOD detection systems against noisy or damaged sensors [1].

In conclusion, the unique challenges posed by domain shifts, semantic shifts, and subtle abnormalities highlight the necessity for developing robust and adaptable OOD detection models in medical imaging. By leveraging advanced techniques like dual-conditioned diffusion models and histogram-based methods, researchers can make significant strides in improving the accuracy and reliability of OOD detection in this critical domain. Future research should continue to push the boundaries of OOD detection methodologies, aiming to create models that not only detect but also provide actionable insights for clinicians, thereby enhancing patient outcomes and safety.

### 9.2 Dual-Conditioned Diffusion Models

In the specialized domain of medical imaging, ensuring the reliability and safety of diagnostic systems is paramount. Among the various techniques developed to tackle the challenge of out-of-distribution (OOD) detection in this domain, dual-conditioned diffusion models stand out as a promising approach. These models leverage in-distribution class information and latent features of the input image to constrain the generative manifold, thereby ensuring structural and semantic similarity with in-distribution samples [6]. By doing so, dual-conditioned diffusion models significantly enhance the accuracy and robustness of OOD detection in medical imaging, making them a critical tool for ensuring the safety of diagnostic systems in clinical settings.

Dual-conditioned diffusion models operate under the premise that in-distribution data, such as medical images of healthy tissue or well-known diseases, form a structured manifold in the latent space. This manifold is characterized by specific patterns and regularities that are inherent to the class of in-distribution samples. However, when confronted with out-of-distribution samples, such as abnormal tissues or unexpected anomalies, the generative models struggle to produce realistic reconstructions due to the deviation of these samples from the learned manifold [6].

To address this issue, dual-conditioned diffusion models incorporate two key conditioning mechanisms: class-specific information and latent features. Class-specific information guides the model to align its generative process with the learned manifold of the in-distribution class. This conditioning helps the model to recognize and reconstruct in-distribution samples more accurately, while highlighting discrepancies that may indicate out-of-distribution status. Latent features of the input image provide additional constraints that ensure the generated output remains semantically and structurally similar to the input. By integrating these two conditioning factors, dual-conditioned diffusion models are able to detect out-of-distribution samples with greater precision and reliability.

These models are particularly advantageous in medical imaging, where anomalies and outliers often manifest as subtle deviations from the norm, challenging traditional OOD detection methods. The structured nature of the generative manifold allows dual-conditioned diffusion models to capture these subtle deviations, enabling them to detect even small structural or semantic differences that might otherwise go unnoticed [5].

Furthermore, dual-conditioned diffusion models offer a flexible framework adaptable to various medical imaging modalities, such as X-ray, MRI, and CT scans. Leveraging class-specific information and latent features, these models can learn to detect OOD samples across different imaging modalities, making them a versatile tool for enhancing the safety and reliability of diagnostic systems. Their ability to handle high-resolution 3D medical data also makes them suitable for detecting structural abnormalities in volumetric scans, where the detection of subtle anomalies is critical [3].

Another advantage lies in their robustness against spurious correlations arising from the complex and diverse nature of medical imaging data. Data distribution shifts due to factors like patient demographics, imaging protocols, and technological advancements can impact traditional OOD detection. By conditioning on class-specific information and latent features, dual-conditioned diffusion models can mitigate these impacts, ensuring accurate and reliable OOD detection [5].

Moreover, dual-conditioned diffusion models integrate seamlessly into existing medical imaging workflows. With the ability to detect OOD samples directly from the input image, these models can serve as a post-processing step in diagnostic pipelines, allowing clinicians to receive immediate alerts when anomalies are detected. This real-time detection capability is invaluable in scenarios where prompt identification of OOD samples can significantly influence patient outcomes, such as in emergency departments or intensive care units [6].

However, dual-conditioned diffusion models face practical implementation challenges. Training and running these models can be computationally costly due to the complexity of the generative process and the high-dimensional latent spaces involved. Moreover, acquiring large annotated datasets for training can be difficult and expensive, posing a significant hurdle [8]. Additionally, the quality and diversity of in-distribution data used during training affect the models' performance. Ensuring that the training data is representative and diverse is crucial for accurate detection of out-of-distribution samples [6].

In conclusion, dual-conditioned diffusion models represent a significant advancement in medical imaging OOD detection. Through their use of class-specific information and latent features, these models enhance accuracy and robustness, ensuring the safety and reliability of diagnostic systems. Despite challenges, the benefits of these models make them a valuable tool for addressing the unique challenges posed by medical imaging OOD detection, underscoring their potential to play an increasingly important role in clinical settings.

### 9.3 Latent Diffusion Models for 3D Data

Latent diffusion models (LDMs) present a compelling solution for scaling out-of-distribution (OOD) detection to high-resolution 3D medical data, particularly in scenarios where traditional denoising diffusion probabilistic models (DDPMs) fall short [34]. Similar to dual-conditioned diffusion models discussed previously, LDMs aim to enhance the accuracy and reliability of OOD detection in medical imaging, but they do so by employing a latent variable representation. This approach addresses the challenges posed by the complexity and variability of volumetric structures, ensuring robust OOD detection without excessive computational overhead.

Traditional DDPMs are designed to model high-dimensional data distributions by iteratively adding and subtracting noise to reconstruct the original data. When applied to 3D medical imaging, however, these models encounter significant challenges due to the high dimensionality and intricacy of the data. Issues such as substantial memory requirements, high computational demands, and oversimplified reconstructions limit the effectiveness of DDPMs in capturing the nuanced characteristics of 3D anatomical structures.

In contrast, LDMs introduce a latent space that provides a compressed representation of the input data, facilitating more efficient and effective modeling. By transforming high-resolution 3D medical data into a lower-dimensional latent space, LDMs reduce memory usage and computational costs, making them suitable for real-time and resource-constrained environments. This compression not only accelerates inference but also improves the model’s generalization across diverse 3D medical datasets.

The core mechanism of LDMs involves encoding the input data into a latent space via an encoder network, followed by a diffusion process that progressively converts the latent representation into a noise-like distribution. During inference, the model decodes this noise back into a refined latent representation before reconstructing the original input. This bidirectional transformation enables LDMs to produce highly detailed and accurate reconstructions of 3D medical data, essential for identifying subtle anomalies indicative of OOD conditions.

One of the key advantages of LDMs is their enhanced ability to generate interpretable spatial anomaly maps. Unlike traditional DDPMs, which often produce pixel-level reconstructions, LDMs focus on the latent space to highlight regions of the 3D volume that deviate significantly from learned normal patterns. This capability provides clinicians with clear indications of potential anomalies, critical for diagnosis and treatment planning.

Moreover, LDMs offer superior memory efficiency compared to DDPMs, which is vital for managing large 3D medical datasets. The reduced memory requirements of LDMs allow for the processing of higher-resolution scans, a growing necessity with advances in imaging technology. This scalability ensures that OOD detection remains practical as medical datasets become more complex and voluminous.

Numerous studies have demonstrated the effectiveness of LDMs in OOD detection for 3D medical data. For example, [35] showcases how LDMs excel at handling both covariate and semantic shifts across different domains, outperforming traditional DDPMs in terms of detection accuracy and computational efficiency. Other research highlights that LDMs generate spatial anomaly maps with higher fidelity and lower false positive rates, reinforcing their suitability for medical imaging scenarios.

By leveraging a latent space representation, LDMs offer a balance of performance, efficiency, and interpretability that surpasses traditional DDPMs. They effectively address the complexities of high-dimensional data modeling while maintaining the ability to accurately identify anomalies. As the complexity of medical imaging continues to increase, the adoption of LDMs holds promise for enhancing the reliability and safety of machine learning models in clinical settings, contributing to improved patient outcomes and more precise diagnostic tools. This advancement aligns well with the overarching goals of ensuring safety and reliability in medical diagnostics, seamlessly transitioning from the previous discussion on dual-conditioned diffusion models to the subsequent exploration of histogram-based methods.

### 9.4 Histogram-Based Methods for 3D Medical Images

Histogram-based methods represent a highly efficient and straightforward approach for out-of-distribution (OOD) detection in 3D medical images. These methods leverage the statistical properties of image histograms to identify regions or entire images that deviate from the expected normal distributions, thereby marking them as potential OOD instances. Building on the discussion of latent diffusion models (LDMs), which emphasize the importance of latent representations in enhancing OOD detection, histogram-based methods offer a complementary perspective by focusing on basic statistical measures. The simplicity of histogram-based methods lies in their reliance on fundamental statistical metrics, making them accessible and computationally less intensive compared to deep learning-based approaches.

In medical imaging, the use of histogram-based methods can be particularly advantageous due to the high dimensionality and complexity of 3D data. Traditional approaches to OOD detection often require substantial computational resources and extensive training data, especially when dealing with 3D volumetric images. However, histogram-based methods sidestep these requirements by focusing on the frequency distribution of voxel intensities within an image. By capturing the intensity histogram of each image, these methods can detect subtle abnormalities that might not be immediately apparent through visual inspection alone.

The foundational principle behind histogram-based OOD detection is to establish a baseline histogram that represents the normal distribution of voxel intensities within a given medical dataset. This baseline histogram serves as a reference against which the histograms of new or unseen images are compared. If the histogram of a new image significantly deviates from this baseline, the image is flagged as an OOD instance. Such deviations could be indicative of anomalies or artifacts that are not representative of the in-distribution (ID) samples, potentially indicating a variety of conditions such as lesions, tumors, or technical issues in image acquisition.

A key advantage of histogram-based methods is their ability to operate without the need for deep learning models, which often necessitate large amounts of annotated data for training and fine-tuning. In the context of medical imaging, where obtaining labeled OOD data can be challenging and resource-intensive, histogram-based methods offer a viable alternative that does not rely on extensive pre-training or supervision. This characteristic allows for the rapid deployment and adaptation of OOD detection systems in clinical settings, where prompt and reliable detection of abnormalities is paramount.

Moreover, the performance of histogram-based methods in detecting synthetic OOD samples in 3D medical images has been impressive. In studies conducted on various datasets, histogram-based approaches have achieved near-perfect results in distinguishing synthetic OOD samples from ID samples. The robustness of these methods in handling synthetic anomalies highlights their potential for detecting subtle yet clinically significant changes in real-world medical images. For instance, in neuroimaging, histogram-based methods could be instrumental in identifying small, focal lesions that might otherwise go undetected through conventional visual inspection or even automated segmentation techniques.

Another strength of histogram-based methods is their adaptability to different types of medical imaging modalities. Whether working with MRI, CT scans, or other forms of 3D medical imaging, the core principles of histogram-based OOD detection remain consistent. By adjusting the binning and normalization parameters of the histogram analysis, these methods can be tailored to accommodate the specific characteristics of different imaging modalities. This flexibility underscores the versatility of histogram-based approaches in addressing the diverse needs of medical OOD detection.

Furthermore, the ease of implementation and interpretability of histogram-based methods make them appealing for integration into clinical workflows. Radiologists and clinicians can readily understand the output of histogram analyses, facilitating immediate decision-making and intervention. This contrasts sharply with more opaque deep learning models, whose internal workings and decision processes are often difficult to comprehend and explain. By providing clear, statistically grounded insights into the distribution of voxel intensities, histogram-based methods can empower healthcare professionals to make more informed judgments regarding patient care.

However, while histogram-based methods exhibit remarkable performance in detecting synthetic OOD samples, there are limitations to consider. The sensitivity of these methods to the choice of histogram bins and normalization strategies can impact their effectiveness in real-world applications. Additionally, the ability of histogram-based methods to generalize to natural OOD conditions remains a subject of ongoing investigation. Future research should focus on refining the statistical parameters of histogram-based methods to enhance their robustness and generalizability across different datasets and imaging modalities.

In conclusion, histogram-based methods offer a promising avenue for out-of-distribution detection in 3D medical images. Their simplicity, efficiency, and robust performance in detecting synthetic OOD samples position them as valuable tools in the medical imaging domain. This sets the stage for the subsequent exploration of more advanced techniques, such as bi-level guided diffusion models (BGDM), which further refine the detection process by addressing structural hallucination and ensuring data consistency. As research continues to advance, these methods are likely to play an increasingly important role in enhancing the reliability and safety of diagnostic and therapeutic procedures in clinical practice.

### 9.5 Bi-Level Guided Diffusion Models for Inverse Problems

In specialized applications, such as medical imaging, out-of-distribution (OOD) detection is crucial for identifying novel classes or subtle abnormalities that may not be easily discernible from normal cases. Building upon the foundational histogram-based methods discussed earlier, this section explores a more advanced technique known as bi-level guided diffusion models (BGDM). BGDM addresses significant challenges in medical imaging OOD detection, particularly structural hallucination and ensuring data consistency, especially in the context of inverse problems. This section delves into the concept of BGDM, detailing its mechanisms and advantages over existing methods.

Structural hallucination refers to the phenomenon where predictive models generate structures or features that do not correspond to actual physical entities or patterns within the in-distribution (ID) samples [36]. This issue becomes particularly problematic in medical imaging inverse problems, where accurate reconstructions are essential for clinical diagnosis and treatment planning. Traditional methods often struggle to balance the trade-off between generating realistic reconstructions and preserving the structural fidelity of the original data. BGDM, however, introduces a novel paradigm that explicitly addresses this challenge through a bi-level optimization framework.

At its core, BGDM leverages a hierarchical structure to guide the generation process, ensuring that the generated samples remain consistent with the underlying data distribution [37]. The model consists of two levels: a lower level responsible for capturing the low-level details and local features, and a higher level focused on enforcing global consistency and coherence. This dual-layer architecture enables BGDM to capture both fine-grained and coarse-grained characteristics of the ID samples, thereby reducing the likelihood of structural hallucinations.

One of the key innovations of BGDM is its use of measurement information to steer the generation process. Measurement information, which typically includes data obtained from imaging devices such as MRIs or CT scans, serves as a crucial reference for ensuring that the generated samples adhere to the physical constraints of the imaging modality [1]. By incorporating this information into the generation process, BGDM ensures that the reconstructed images maintain high fidelity to the original data, even when dealing with unseen or novel classes. This approach not only enhances the reliability of the reconstructions but also significantly improves the detection of OOD samples, as anomalous features can be more readily identified against a backdrop of consistent and accurate reconstructions.

Moreover, BGDM employs a two-step optimization strategy to refine the generated samples. The first step involves an initial generation phase, where the model produces a preliminary reconstruction based on the input data. Following this, the second step applies a refinement process, where the model adjusts the generated samples to better align with the measurement information [22]. This iterative refinement process helps to mitigate the effects of noise and artifacts that might otherwise lead to structural hallucinations.

The benefits of BGDM extend beyond mere reduction of structural hallucinations. By ensuring data consistency and leveraging measurement information, BGDM also enhances the interpretability of the reconstructions. This is particularly valuable in medical imaging, where clinicians rely heavily on accurate and interpretable images for diagnosis and treatment decisions. Additionally, the robustness of BGDM to distribution shifts makes it highly adaptable to different imaging modalities and clinical settings, thus broadening its applicability across a wide range of medical imaging tasks [37].

Comparatively, traditional methods for addressing structural hallucination often rely on regularization techniques or post-processing steps to enforce consistency. However, these approaches can be computationally expensive and may not always yield satisfactory results, especially when dealing with complex and heterogeneous data distributions. In contrast, BGDM offers a more integrated and efficient solution by embedding consistency checks directly into the generation process. This not only improves the quality of the reconstructions but also enhances the overall efficiency of the OOD detection pipeline.

Furthermore, the effectiveness of BGDM has been demonstrated in various experimental settings, showcasing its potential for practical applications in medical imaging. For instance, in a study conducted using the MNIST and COCO datasets, researchers found that BGDM was able to significantly reduce false alarms and improve the detection of OOD inputs with spurious features from the training data [30]. These findings highlight the versatility of BGDM in handling different types of data and its ability to generalize well across various domains.

In conclusion, bi-level guided diffusion models (BGDM) represent a significant advancement in the field of OOD detection, particularly in specialized applications like medical imaging. By addressing structural hallucination and ensuring data consistency through a hierarchical and measurement-guided generation process, BGDM offers a more effective and efficient solution compared to existing methods. As the demand for reliable and accurate OOD detection continues to grow, BGDM holds great promise for improving the diagnostic capabilities of medical imaging systems and enhancing patient care.

## 10 Challenges, Future Directions, and Conclusion

### 10.1 Current Challenges in Generalized OOD Detection

Generalized out-of-distribution (OOD) detection, which encompasses various related problems such as anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD), faces numerous challenges that impede its broader applicability and effectiveness in real-world scenarios. These challenges can broadly be categorized into three primary areas: dealing with spurious correlations, handling real-world distribution shifts, and the lack of standard evaluation metrics.

Addressing spurious correlations is a significant challenge in generalized OOD detection. Machine learning models often learn superficial associations between input features and output labels instead of underlying causal relationships. This issue is exacerbated in OOD detection, where these spurious correlations can lead to incorrect classifications and unreliable predictions. Mitigating spurious correlations involves careful consideration of the training data and model architecture, possibly through regularization techniques or by incorporating domain-specific knowledge to guide the learning process toward more robust and generalizable representations.

Handling real-world distribution shifts presents another major hurdle. Dynamic and unpredictable environments in real-world applications can cause rapid changes in the distribution of input data due to external factors like environmental conditions, user behavior, or sensor malfunctions. For instance, in autonomous driving, models trained to recognize road signs and obstacles might encounter sudden changes in lighting, weather, or new types of vehicles and pedestrians not seen during training. Such shifts can severely degrade OOD detection performance, necessitating models capable of adapting to changing data distributions through continuous learning or real-time monitoring and adjustment mechanisms.

Additionally, the lack of standard evaluation metrics for OOD detection complicates the comparison and improvement of different methods. Unlike classification or regression tasks, OOD detection lacks universally accepted metrics. This absence makes it challenging to assess algorithm performance consistently across datasets and scenarios. Researchers have explored alternatives such as simulated data and data augmentation to generate OOD samples, aiming to establish a unified evaluation protocol. The "OpenOOD" benchmark framework highlights the importance of such standards to ensure robust and consistent evaluation metrics for advancing the field.

Beyond these broad challenges, specific issues arise in different application domains. In medical imaging, subtle abnormalities not easily distinguishable from normal cases pose significant challenges, potentially leading to false negatives in OOD detection. Medical datasets often suffer from imbalances and class overlaps, complicating the accurate identification of anomalous cases. Specialized models like dual-conditioned and latent diffusion models have been proposed to address these issues by constraining the generative manifold and ensuring structural and semantic similarity with in-distribution samples.

Integrating multi-modal data streams further complicates generalized OOD detection. In scenarios like autonomous driving or industrial automation, models must integrate information from multiple sensor types, accounting for interdependencies between modalities. Approaches like the "WOOD" framework combine a binary classifier with a contrastive learning component to detect OOD samples in a weakly-supervised manner, essential for robust OOD detection in dynamic and multi-faceted environments.

The reliance on auxiliary datasets for training and evaluation poses another challenge. Access to both in-distribution (ID) and out-of-distribution (OOD) data is often limited due to ethical and legal restrictions in safety-critical domains. Alternative approaches, such as using simulated data or data augmentation, are necessary to develop and validate OOD detection methods effectively. The "OOD Proxy" framework exemplifies leveraging simulated data to create OOD samples for training and evaluation, enabling more robust testing of OOD detection algorithms.

Theoretical underpinnings of generalized OOD detection continue to evolve, with a growing need for rigorous and provable guarantees to ensure reliability and safety. Recent advancements, such as frequency-regularized learning (FRL), show promise in improving robustness and efficiency. Establishing a comprehensive theoretical foundation would facilitate the design of more principled and effective OOD detection techniques suitable for real-world applications.

In conclusion, addressing the challenges of spurious correlations, real-world distribution shifts, and the lack of standard evaluation metrics, along with domain-specific issues and theoretical limitations, is crucial for developing reliable and effective generalized OOD detection systems. Continued research and development efforts hold the promise of significant progress in this field.

### 10.2 Improving Robustness Against Spurious Correlations

Improving the robustness of machine learning models against spurious correlations is a pivotal aspect of enhancing their reliability, especially in out-of-distribution (OOD) detection tasks. Spurious correlations occur when models learn associations between features and labels that do not hold in the real world, leading to poor generalization when faced with unseen data. This issue is particularly acute in OOD detection, where models trained on specific datasets may inadvertently associate certain artifacts of the data with in-distribution (ID) labels, resulting in degraded performance when encountering data with altered but irrelevant attributes.

Recent studies have highlighted the detrimental effects of spurious correlations on OOD detection performance. For instance, the "On the Impact of Spurious Correlation for Out-of-distribution Detection" paper [1] underscores that models trained with spurious correlations often fail to generalize well to OOD samples, leading to increased false positive rates and reduced detection accuracy. These correlations frequently arise due to biases inherent in the training data, such as specific camera angles, lighting conditions, or object placements that correlate with the labels but do not reflect the underlying true cause-and-effect relationships.

One promising strategy to mitigate the impact of spurious correlations involves enhancing the diversity of training data. By incorporating a wider range of samples with varied contextual features, models can better disentangle the true predictors of class membership from spurious cues. Data augmentation techniques, such as simulating different lighting conditions, viewpoints, and backgrounds, can help models learn more robust and invariant features. Additionally, active learning methods can iteratively select diverse samples that challenge the model’s current knowledge, thus encouraging better generalization across different scenarios [7].

Another approach involves leveraging auxiliary information to guide the learning process. Providing additional annotations or metadata that highlight spurious cues in the training data can serve as a form of regularization, prompting the model to focus on more reliable features and ignore the spurious ones. For instance, annotating images with details about lighting conditions, camera angles, or background elements can help the model avoid relying on these irrelevant features. Multi-task learning frameworks, where one task predicts spurious cues and another performs the main classification, can also be beneficial [3]. By learning these tasks together, the model can better separate signal from noise and become less susceptible to spurious correlations.

Incorporating uncertainty quantification techniques is another effective strategy. Bayesian neural networks (BNNs) and other probabilistic models can naturally incorporate uncertainty into their predictions, expressing doubt when faced with inputs that deviate from the learned distribution [1]. This is particularly useful in OOD detection, as BNNs can assign lower confidence scores to samples exhibiting spurious correlations or anomalies. Techniques like dropout and Monte Carlo sampling can estimate prediction uncertainty, offering a principled approach to detect and handle OOD samples.

Adversarial training methods can also be adapted to specifically target and mitigate spurious correlations. By designing an adversary that exploits these correlations, the model can learn more robust representations that resist such manipulations. For example, an adversarial training setup might involve a generator network that creates samples exploiting the model’s reliance on spurious features, while the main model aims to correctly classify these samples despite the induced artifacts. Over time, this adversarial process fosters a model that is less dependent on spurious correlations and more attuned to intrinsic data properties [1].

Developing and validating OOD detection methods on diverse and representative datasets is crucial. Resources such as the ImageNet-OOD dataset [5] and the OpenOOD benchmark [6] provide valuable tools for evaluating robustness by distinguishing semantic shifts from covariate shifts, offering insights into different methods’ sensitivity to distributional changes. These resources enable researchers to assess the generalizability and reliability of their approaches.

In conclusion, improving robustness against spurious correlations is essential for enhancing the reliability and safety of machine learning models, particularly in OOD detection. Strategies such as diverse training data, auxiliary information, uncertainty quantification, adversarial training, and rigorous evaluation on representative datasets can develop more resilient models less prone to being misled by irrelevant features. These advancements are critical for ensuring safe and accurate operation of machine learning systems in the real world, where data distribution shifts are inevitable.

### 10.3 Handling Real-World Complexities and Distribution Shifts

Addressing the intricacies of real-world complexities and distribution shifts in out-of-distribution (OOD) detection necessitates a multifaceted approach that goes beyond the limitations of traditional methods [1]. These complexities include various forms of data distribution shifts, such as covariate shifts due to environmental changes or sensor malfunctions, and semantic shifts due to the emergence of unseen categories or subtle anomalies. Traditional OOD detection methods often struggle with these shifts, as they are typically designed for specific types of distribution changes and fail to generalize well across diverse scenarios [34].

To overcome these challenges, researchers have explored more adaptable methodologies that can handle real-world complexities. One promising approach involves integrating domain generalization techniques aimed at improving the model's ability to generalize across unseen domains [35]. Domain generalization enhances a model's robustness to covariate shifts by learning representations that remain invariant across different environments. This adaptation enables OOD detection systems to better manage the variability in data distributions encountered in real-world settings.

Handling semantic shifts, particularly in situations where new categories appear, presents another significant challenge. Traditional methods often require fine-tuning or retraining models, which is impractical in dynamic environments. Recent advancements in relation-based reasoning provide a viable alternative. By leveraging the relationships between data points, these methods can detect novel categories or anomalies without extensive retraining. For example, the large class separation hypothesis suggests that models trained with relational reasoning can better identify out-of-distribution samples by utilizing inter-class feature distances [19]. This approach minimizes the need for fine-tuning and supports real-time detection of new categories.

Moreover, the dynamic nature of data distributions over time adds another layer of complexity. Real-world systems frequently encounter continuous distribution shifts, necessitating OOD detection methods that can adapt dynamically to these changes. Meta-learning and online learning paradigms offer potential solutions. Meta-learning approaches allow models to learn from a series of tasks or distributions, enabling rapid adaptation to new data without extensive retraining [7]. This adaptability is crucial for real-world applications where the model must continuously update its understanding of the data distribution to maintain reliable performance.

Additionally, the integration of large language models (LLMs) into OOD detection systems presents new opportunities for managing complex distribution shifts [4]. LLMs, such as GPT-3, can generate peer classes or related concepts for in-distribution (ID) data, aiding OOD detection systems in understanding nuanced distinctions between ID and OOD samples. This not only enhances the detection of novel categories but also improves the model's comprehension of semantic relationships, thereby boosting overall robustness.

Despite these advancements, several challenges persist. One major issue is the sensitivity of OOD detection methods to label noise and unreliable classifiers. Many existing methods assume clean training datasets, which is often unrealistic in real-world scenarios [38]. Developing robust OOD detection mechanisms that function effectively with noisy or unreliable classifiers is essential. This entails exploring noise-robust algorithms and techniques that can eliminate false positives and ensure reliable detection.

Finally, the scalability of OOD detection methods is a critical concern, especially for large-scale applications. Traditional methods often suffer from computational inefficiency, making them unsuitable for real-time or resource-constrained environments. Innovative approaches that utilize efficient scoring functions and group-based decompositions offer promising solutions. For instance, the MOS framework introduces a scalable OOD detection method that achieves state-of-the-art performance while providing significant speedup in inference [39]. By breaking down large semantic spaces into smaller, manageable groups, MOS simplifies decision-making and enhances the practicality of OOD detection systems.

In conclusion, addressing real-world complexities and distribution shifts in OOD detection requires a comprehensive approach that leverages advances from various domains, including domain generalization, relational reasoning, meta-learning, and large language models. By developing robust, adaptive, and scalable methods, researchers can advance the reliability and effectiveness of OOD detection systems in real-world applications, enhancing the safety and reliability of machine learning models in critical domains such as autonomous driving, healthcare, and cybersecurity.

### 10.4 Advancing Evaluation Protocols and Benchmarks

Advancing the evaluation protocols and benchmarks for out-of-distribution (OOD) detection is essential for achieving more realistic and fair comparisons among various detection methods. Current evaluations often face challenges such as the lack of standardization, reliance on simplistic datasets, and inadequate consideration of real-world complexities. To address these issues, we propose several improvements based on the insights and methodologies outlined in papers such as "Towards Realistic Out-of-Distribution Detection" and "OpenOOD."

Firstly, the current evaluation protocols often rely on simplistic datasets that fail to represent the diversity and complexity of real-world data. Many benchmarks use synthetic datasets or simple variations of in-distribution data to simulate out-of-distribution samples. However, such datasets may not adequately reflect the true nature of out-of-distribution data encountered in real-world scenarios. Therefore, we advocate for the adoption of more diverse and realistic datasets that incorporate a wide range of distribution shifts, including those arising from domain shifts, semantic shifts, and spurious correlations. These datasets would enable a more comprehensive evaluation of OOD detection methods, helping to identify their strengths and weaknesses in different contexts.

Secondly, current benchmarks often lack standardized metrics and protocols, leading to inconsistent and incomparable results. Without a unified evaluation framework, assessing the performance of different OOD detection methods becomes challenging. We recommend establishing a standardized evaluation protocol that includes a suite of metrics designed to capture various aspects of OOD detection performance. These metrics should evaluate the accuracy of OOD detection, model robustness, generalizability, and the ability to handle real-world complexities. Metrics such as the false positive rate, the area under the receiver operating characteristic curve (AUC-ROC), and the precision-recall balance could be included. Standardizing these metrics would enable more reliable and reproducible evaluations, facilitating fair comparisons among different methods.

Moreover, evaluation protocols should be designed to mimic real-world scenarios more closely. Real-world applications of OOD detection often involve dynamic environments where distribution shifts can occur over time. Current benchmarks typically assume static distributions, which do not accurately reflect the dynamic nature of real-world data. We propose incorporating temporal dynamics into the evaluation protocols. This could involve simulating scenarios where the in-distribution data gradually shifts over time, or introducing sudden distribution shifts that models need to adapt to. Such dynamic simulations would provide a more realistic testbed for evaluating OOD detection methods, enabling researchers to assess how well these methods can cope with evolving distributions.

Additionally, human-in-the-loop evaluations are crucial for understanding the practical implications of OOD detection performance. Human judgment plays a pivotal role in interpreting the outputs of OOD detection systems, especially in safety-critical applications. We suggest integrating human-in-the-loop evaluations into the benchmarking process. This could involve conducting user studies where participants interact with OOD detection systems and provide feedback on the system's performance. Such evaluations would help assess the usability and interpretability of OOD detection methods from a human-centric perspective, providing valuable insights into their practical applicability.

Furthermore, the current benchmarks often overlook the integration of a broad array of OOD detection methods and datasets. The OpenOOD benchmark [18] offers a promising foundation by implementing over 30 OOD detection methods and providing a structured platform for evaluation. Building upon this, we propose expanding the benchmark to include a broader range of methods and datasets. This expansion should cover existing approaches and incorporate emerging methodologies and datasets reflecting the latest advancements in OOD detection research. By creating a more inclusive and comprehensive benchmark, researchers can gain a more holistic understanding of the current state of OOD detection and identify promising areas for future research.

Lastly, evaluation protocols should facilitate cross-disciplinary collaboration and comparison. OOD detection intersects with areas such as anomaly detection, open set recognition, and model uncertainty. Current benchmarks often operate in isolation, limiting opportunities for cross-disciplinary insights and comparisons. We recommend developing a unified framework that integrates OOD detection with related fields. This framework could provide a common platform for evaluating methods from different disciplines, fostering cross-disciplinary collaboration and promoting a more integrated understanding of OOD detection. By doing so, researchers from various backgrounds can collaborate more effectively, driving innovation and advancing the field collectively.

In summary, advancing the evaluation protocols and benchmarks for OOD detection requires addressing several key challenges. These include the need for more diverse and realistic datasets, standardized evaluation metrics, dynamic simulation of real-world scenarios, human-in-the-loop evaluations, comprehensive benchmark frameworks, and cross-disciplinary integration. Implementing these recommendations would create a more robust and fair evaluation ecosystem for OOD detection, enabling researchers to develop and deploy more reliable and effective OOD detection methods. The insights and methodologies from papers such as "Towards Realistic Out-of-Distribution Detection" and "OpenOOD" provide a solid foundation for these improvements, paving the way for more realistic and impactful evaluations in the future.

### 10.5 Enhancing Model Efficiency and Scalability

Enhancing the efficiency and scalability of out-of-distribution (OOD) detection methods is crucial as these systems are increasingly deployed in large-scale and real-time applications, such as autonomous driving, healthcare, and cybersecurity. These applications demand rapid processing and minimal computational overhead to maintain operational speed and reliability. Insights from the paper "SUOD Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection" offer valuable strategies for improving the performance of OOD detection systems in large-scale deployments [8].

First, optimizing the preprocessing phase through dimensionality reduction is essential for minimizing computation times without sacrificing accuracy. High-dimensional data often contain redundant features that do not contribute significantly to the detection process. Techniques such as Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) can compress the data while retaining critical information. PCA, for instance, projects high-dimensional data into a lower-dimensional space, reducing computational costs and speeding up the detection process [8].

Second, the use of approximation methods further enhances efficiency. Approximation techniques like random projection and sampling reduce computational load by simplifying data representation or detection models. Random projection, leveraging the Johnson-Lindenstrauss lemma, maps high-dimensional data into a lower-dimensional space while approximately preserving pairwise distances, thus enabling faster processing [8].

Moreover, optimizing taskload imbalance in distributed environments is critical for scalability. Distributed computing platforms allow parallel processing, but workload imbalances can hinder performance. Techniques such as load balancing and dynamic task scheduling distribute computational tasks evenly, ensuring optimal resource utilization and enhanced throughput [8].

The modular acceleration system proposed in "SUOD Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection" integrates data reduction, model approximation, and load balancing to address specific bottlenecks in OOD detection. This structured approach facilitates flexibility in adapting to different deployment scenarios and computational constraints [8].

Hardware accelerators, such as GPUs and TPUs, also play a significant role in boosting computational power. Optimized for parallel processing, these devices handle high computational demands efficiently, enabling real-time processing and immediate decision-making [8].

Lastly, developing lightweight and efficient OOD detection models is vital for scalability. Techniques like model pruning, quantization, and distillation create compact models without sacrificing performance. Pruning removes unnecessary parameters, quantization reduces numerical precision, and distillation trains smaller models using outputs from larger ones as targets, ensuring efficient deployment across various devices [8].

In conclusion, enhancing the efficiency and scalability of OOD detection involves a multi-faceted approach addressing data preprocessing, model optimization, distributed computing, and hardware acceleration. The structured and modular approach highlighted in "SUOD Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection" underscores the importance of dimensionality reduction, approximation methods, load balancing, and hardware utilization, emphasizing the ongoing need for efficient solutions in ensuring the reliability and safety of machine learning systems.

### 10.6 Expanding Application Domains and Integration with Other Tasks

Expanding the application domains of out-of-distribution (OOD) detection beyond traditional classification tasks requires a concerted effort to integrate OOD detection with related tasks such as anomaly detection (AD), novelty detection (ND), and open set recognition (OSR). These tasks share fundamental goals with OOD detection—namely, to identify instances that deviate significantly from known patterns or distributions. Through cross-domain collaboration and broader application scopes, machine learning systems can achieve enhanced robustness and adaptability in diverse environments.

One notable area for integration is anomaly detection (AD), which focuses on identifying rare events or data points that diverge markedly from the norm. Anomalies may result from errors, fraudulent activities, or unusual behaviors. By incorporating OOD detection mechanisms, anomaly detection systems can improve their sensitivity and specificity, leading to better differentiation between typical and atypical behaviors. Contrastive learning methods, for example, can provide deep insights into data structures, aiding in the accurate identification of anomalies [40].

Similarly, integrating OOD detection with novelty detection (ND) enhances the adaptability of machine learning models to novel situations not encountered during training. Novelty detection identifies patterns or objects not represented in the training dataset but expected to appear in future operations. Leveraging OOD detection techniques allows systems to better prepare for handling such novelties, reducing the likelihood of misclassification or failure. Studies have demonstrated the effectiveness of density-based and likelihood-ratio-based approaches in efficiently detecting novelties, even in the absence of direct out-of-distribution training samples [1].

Another key application domain is open set recognition (OSR), which involves recognizing classes seen during training while accurately rejecting instances from unseen classes. This is particularly important in scenarios where the total number of potential classes is theoretically limitless, rendering conventional closed-set classification insufficient. By integrating OOD detection strategies, OSR systems can better identify and reject out-of-class instances, thereby improving overall performance and reliability. Research indicates that combining OOD detection with OSR leads to more robust models less susceptible to errors when confronted with unseen categories [1].

Furthermore, expanding the application of OOD detection entails addressing real-world complexities unique to specific industries or sectors. In medical imaging, for instance, OOD detection can aid in identifying subtle abnormalities that might escape human observation. Techniques like dual-conditioned diffusion models constrain the generative manifold to maintain structural and semantic consistency with in-distribution samples, thereby enhancing OOD detection accuracy in medical imaging contexts [1]. Likewise, in industrial settings, OOD detection can monitor sensor reliability and data quality, safeguarding critical operations.

Developing comprehensive benchmarks and evaluation protocols is crucial for accommodating a wide range of application scenarios and facilitating fair comparisons among OOD detection methods. The OpenOOD framework serves as a unified platform for benchmarking and evaluating OOD detection across different datasets and tasks [6], enabling researchers to assess the practical effectiveness of OOD detection strategies in real-world conditions. Integration with anomaly detection, novelty detection, and OSR necessitates tailored domain-specific metrics and evaluation criteria that accurately reflect each task's unique challenges.

Future research should concentrate on creating more efficient and scalable OOD detection methods capable of managing large-scale, heterogeneous data. The SUOD system exemplifies the viability of accelerating large-scale unsupervised outlier detection through a modular approach that optimizes data reduction, model approximation, and task load balancing [8]. Additionally, incorporating meta-learning techniques can bolster the adaptability and robustness of OOD detection systems, enabling them to learn from past experiences and generalize to new, unseen scenarios [41].

In summary, integrating OOD detection with tasks like anomaly detection, novelty detection, and open set recognition presents substantial opportunities for enhancing the reliability and safety of machine learning systems across various application domains. Addressing real-world complexities and broadening the scope of OOD detection will lead to more robust and versatile models better suited to handle modern data environments' diverse challenges. Continued exploration and refinement of OOD detection methods will undoubtedly advance the field of machine learning and ensure the safety and efficacy of AI-driven systems.


## References

[1] Generalized Out-of-Distribution Detection  A Survey

[2] Unleashing Mask  Explore the Intrinsic Out-of-Distribution Detection  Capability

[3] A Survey on Out-of-Distribution Detection in NLP

[4] Out-of-Distribution Detection Using Peer-Class Generated by Large  Language Model

[5] ImageNet-OOD  Deciphering Modern Out-of-Distribution Detection  Algorithms

[6] OpenOOD  Benchmarking Generalized Out-of-Distribution Detection

[7] Meta OOD Learning for Continuously Adaptive OOD Detection

[8] SUOD  Accelerating Large-Scale Unsupervised Heterogeneous Outlier  Detection

[9] Shifting Transformation Learning for Out-of-Distribution Detection

[10] Learning by Erasing  Conditional Entropy based Transferable  Out-Of-Distribution Detection

[11] Watermarking for Out-of-distribution Detection

[12] MIM-OOD  Generative Masked Image Modelling for Out-of-Distribution  Detection in Medical Images

[13] Rethinking Out-of-distribution (OOD) Detection  Masked Image Modeling is  All You Need

[14] General-Purpose Multi-Modal OOD Detection Framework

[15] Towards Rigorous Design of OoD Detectors

[16] Out-of-Distribution Detection for Automotive Perception

[17] Object Detectors in the Open Environment  Challenges, Solutions, and  Outlook

[18] Openbots

[19] Large Class Separation is not what you need for Relational  Reasoning-based OOD Detection

[20] SupEuclid  Extremely Simple, High Quality OoD Detection with Supervised  Contrastive Learning and Euclidean Distance

[21] Unlearning with Fisher Masking

[22] A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution  Detection  Solutions and Future Challenges

[23] Language Models are Few-Shot Learners

[24] PaLM  Scaling Language Modeling with Pathways

[25] Deep Structured Cross-Modal Anomaly Detection

[26] Evaluation of Human and Machine Face Detection using a Novel Distinctive  Human Appearance Dataset

[27] Localizing Grouped Instances for Efficient Detection in Low-Resource  Scenarios

[28] References in and citations to NIME papers

[29] AUTO  Adaptive Outlier Optimization for Online Test-Time OOD Detection

[30] Using Semantic Information for Defining and Detecting OOD Inputs

[31] Detecting and Learning Out-of-Distribution Data in the Open world   Algorithm and Theory

[32] Anomaly Detection under Distribution Shift

[33] COCO-O  A Benchmark for Object Detectors under Natural Distribution  Shifts

[34] Unified Out-Of-Distribution Detection  A Model-Specific Perspective

[35] Towards Effective Semantic OOD Detection in Unseen Domains  A Domain  Generalization Perspective

[36] Detecting semantic anomalies

[37] Multi-Attribute Open Set Recognition

[38] A noisy elephant in the room  Is your out-of-distribution detector  robust to label noise 

[39] MOS  Towards Scaling Out-of-distribution Detection for Large Semantic  Space

[40] Understanding the properties and limitations of contrastive learning for  Out-of-Distribution detection

[41] Automating Outlier Detection via Meta-Learning


