Trustworthiness and co-cognition in artificial intelligence systems
Keywords: Causal reasoning, Causal alignment, Trustworthiness in AI, Interpretability
Abstract: Trustworthiness in artificial intelligence requires more than
performance on a fixed task distribution:
a system may satisfy a formal objective while failing to track what
matters to users when conditions change. We argue that trustworthiness requires
\emph{co-cognition} - the behavior of a system which consists in reacting specifically to the
same objects that structure the user's experience, where an
object is any region of experience an agent individuates and a
reaction is \emph{specific} when the object's state change is its minimal cause
relative to the observer's causal model.
We formalize co-cognition through an object correspondence
and minimality condition, and identify stability, generalization, and robustness as
its operational signatures. We further argue that co-cognition is a
common cause of trustworthiness and consciousness attribution, explaining
why they are correlated: trust arises through an active
construction process in which specific reactions lead the observer
to attribute a shared point of view. We show that current ML
systems systematically fail to achieve co-cognition due to correlational
learning, and diagnose distribution shift, reward hacking, adversarial vulnerability,
and hallucination as distinct co-cognition failures. This provides a principled account
of trustworthiness grounded in causal alignment rather than performance alone.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 7
Loading