Abstract: Neural networks have achieved remarkable success across various fields. However, the lack of interpretability limits their practical use, particularly in critical decision-making scenarios. Posthoc interpretability, which provides explanations for pretrained models, is often at risk of fidelity and robustness. This has inspired a rising interest in self-interpretable neural networks (SINNs), which inherently reveal the prediction rationale through model structures. Despite this progress, existing research remains fragmented, relying on intuitive designs tailored to specific tasks. To bridge these efforts and foster a unified framework, we first collect and review existing works on SINNs and provide a structured summary of their methodologies from five key perspectives: attribution-based, function-based, concept-based, prototype-based, and rule-based self-interpretation. We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios, including image, text, graph data, and deep reinforcement learning (DRL). Additionally, we summarize existing evaluation metrics for self-interpretation and identify open challenges in this field, offering insights for future research. To support ongoing developments, we present a publicly accessible resource to track advancements in this domain: https://github.com/yangji721/Awesome-Self-Interpretable-Neural-Network
External IDs:doi:10.1109/jproc.2025.3635153
Loading