Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of natural language processing and multimodal document understanding systems, where domain and semantic shifts are unavoidable. While many post-hoc OOD detection methods were developed for vision models, their direct transfer to textual and multimodal Transformer architectures remains poorly understood. We show that, unlike in vision benchmarks, feature-space provides the dominant OOD signal for text and document models, consistently outperforming logit-based and hybrid detectors.
Building on this observation, we introduce \textbf{VECO} (\emph{VEctor COnformity}), a geometry-aware, purely feature-based OOD scoring framework that implements a stable soft contrast between in-distribution conformity and residual-space deviation.
We instantiate VECO using principal-subspace conformity for multimodal document models and Mahalanobis distance conformity for text classifiers, reflecting modality-aligned representation structure.
VECO achieves state-of-the-art and consistent performance improvements on multimodal document and text classification benchmarks. These results highlight the modality-dependent nature of OOD detection and the importance of adapting score design to representation cues.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ### Changes Since Last Submission
We thank the reviewers for their constructive feedback. The revised manuscript incorporates several improvements aimed at clarifying the methodology, strengthening the empirical study, and improving presentation. All additions are highlighted in blue in the revised manuscript.
**1. Paper organization and clarity improvements.**
We reorganized the manuscript to improve readability and the logical flow of the presentation. In particular, the description of the experimental setup was moved to follow the full presentation of the VECO method. A passage was also rewritten to clarify the monotonicity properties of the VECO score. Specifically, we now explicitly state that the score is strictly increasing in the principal-space energy $s_p$ and strictly decreasing in the residual-space energy $s_r$.
**2. Expanded description of baselines.**
To improve clarity and ensure reproducibility, we added a dedicated appendix section providing detailed descriptions of all baseline OOD detectors used in the experiments. This section explains the underlying mechanisms of the methods (logit-based, feature-based, and hybrid approaches) and clarifies how they differ from the proposed VECO framework.
**3. Clarification of hyperparameters and implementation details.**
We added explicit explanations of the VECO hyperparameters and how they are selected. In particular, we clarified the roles of the principal dimension $k$, the residual dimension $m$, and the calibration factor $\alpha$. We now explicitly state that $m=512$ is fixed across experiments while $k$ is selected per benchmark. The calibration factor is computed from training data statistics:
$$
\alpha = \frac{\sum_i \|U_k^\top f_i^{\text{train}}\|^2}{\sum_i \|U_r^\top f_i^{\text{train}}\|^2}.
$$
This formulation emphasizes that $\alpha$ simply calibrates the relative magnitudes of principal and residual energies and does not require tuning.
**4. Notation corrections and theoretical clarification.**
We revised the notation used in the derivation of the probabilistic motivation to improve clarity. In particular, the coefficients appearing in the idealized likelihood-ratio expression are now explicitly defined in the main text.
**5. Hyperparameter ablation studies.**
Following reviewer suggestions, we added an extensive ablation study evaluating the sensitivity of VECO to the hyperparameters $k$, $m$, and $\alpha$. The appendix now reports numerical ablations and sensitivity plots demonstrating that the method remains stable across a wide range of parameter values and does not require precise tuning.
**6. Broader impact statement.**
A broader impact section was added to discuss the potential implications of the proposed method. The section explains how improved OOD detection can enhance the reliability of NLP and multimodal document systems, while also acknowledging limitations such as potential dataset biases and the fact that OOD detectors cannot guarantee safe deployment.
**7. Minor presentation improvements.**
We corrected formatting issues in figures.
Overall, these changes improve the clarity, completeness, and reproducibility of the paper while addressing the reviewers’ suggestions.
Assigned Action Editor: ~bo_han2
Submission Number: 6928
Loading