Track: long paper (up to 10 pages)
Keywords: Interpretability, Preference Optimization, Concept Bottleneck Models
Abstract: Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of
AI systems by constraining their decisions on a set of human-understandable
concepts. However, CBMs typically assume that datasets contain accurate concept
labels—an assumption often violated in practice, which we show can significantly
degrade performance (by 25% in some cases). To address this, we introduce the
Concept Preference Optimization (CPO) objective, a new loss function based on
Direct Preference Optimization, which effectively mitigates the negative impact
of concept mislabeling on CBM performance. We provide an analysis of some
key properties of the CPO objective showing it directly optimizes for the concept’s
posterior distribution, and contrast it against Binary Cross Entropy (BCE) where
we show CPO is inherently less sensitive to concept noise. We empirically confirm
our analysis finding that CPO consistently outperforms BCE in three real-world
datasets with and without added label noise.
Submission Number: 31
Loading