Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

TMLR Paper9362 Authors

01 Jun 2026 (modified: 12 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reliable models should not only predict correctly, but also base their decisions on acceptable evidence. However, conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy by exploiting shortcut correlations rather than intended decision evidence. Human priors, such as bounding boxes or target interface elements, can help constrain such behavior, but aligning model evidence with these priors remains challenging because learned decision evidence often diverges from human perception. In this work, we study attribution-guided human-prior alignment with subset-selection-based attribution. Motivated by prior deletion and insertion evaluations showing that subset-selection attribution can identify compact decision-supporting regions, we use it as a training-time signal to expose the model’s decision evidence. When the top-attributed evidence deviates substantially from the prior region, we penalize off-prior reliance and encourage the model to shift its evidence toward the intended regions. This yields a selective prior-constrained objective that avoids uniformly suppressing all non-prior regions. We validate our method on both image classification and click decision tasks in MLLM-based GUI agents. Across discriminative classification and autoregressive decision-making settings, our method improves task accuracy while enhancing attribution reasonability.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xingchen_Wan1
Submission Number: 9362
Loading