Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza; Tianyue H. Zhang; Laurent Charlin; Mateo Espinosa Zarlenga

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human understandable concepts. However, CBMs typically rely on datasets with assumedly accurate concept labels—an assumption often violated in practice which we show can significantly degrade performance. To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis on some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise

Lay Summary: Concept Bottleneck Models (CBMs) are a type of machine learning model that first predict human-understandable concepts — like “has a beak” or “is smiling” — and then use those concepts to make a final decision. This design makes the model’s reasoning easier to inspect and, importantly, allows users to intervene by correcting mispredicted concepts. Unfortunately, like many machine learning models, CBMs assume all concept labels are accurate — which isn’t realistic. Real-world data is often contaminated with labeling errors due to subjectivity, labeler fatigue, or even standard training tricks like cropping images that can accidentally hide important features. Our work introduces a new training method called Concept Preference Optimization (CPO) that makes CBMs more reliable when labels aren’t perfect. Instead of treating every label as correct, CPO compares pairs of labels during training and teaches the model to favor those that seem more trustworthy. We show that CPO improves CBM performance even when many concept labels are wrong. It also helps the model better recognize when it’s unsure — a critical ability in high-stakes fields like healthcare or law enforcement.

Link To Code: https://github.com/Emilianopp/ConceptPreferenceOptimization

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: Concept Bottleneck Models, Interpretable AI, XAI

Submission Number: 13713

Loading