Exploring Verification Frameworks for Social Choice Alignment

Published: 29 Aug 2025, Last Modified: 29 Aug 2025NeSy 2025 - Phase 2 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Social Choice Theory, Computational Ethics, Formal Verification
TL;DR: The paper explore how deep neural networks can be verified against culturally informed moral norms and suggest future directions using tools in reachability analysis to model evolving social preferences over time.
Abstract: The deployment of autonomous agents that interact with humans in safety-critical situations raises new research problems as we move towards fully autonomous systems in domains such as autonomous vehicles or search and rescue. If autonomous agents are placed in a dilemma, how would they act? The literature in computational ethics has explored the actions and learning methods that emerge in ethical dilemmas. However, our position paper examines how ethical dilemmas are not isolated in a social vacuum. Our central claim in our position paper is that to enable trust among all human users, a neuralsymbolic verification of moral preference alignment is required. We propose that the formal robustness properties be applied to social choice modelling. We outline how robustness properties can help validate the formation of stable social preference clusters in deep neural network classifiers. Our initial results highlight the vulnerabilities of models in moral-critical scenarios to perturbations, suggesting a verification-training loop for improved robustness. We position this work as an inquiry into the viability of verifying moral preference alignment, based on our initial results. Ultimately, we aim to contribute to the broader interdisciplinary effort that integrates formal methods, social choice theory, and empirical moral psychology for interpretable computational ethics.
Track: Neurosymbolic Methods for Trustworthy and Interpretable AI
Paper Type: Short Paper
Resubmission: Yes
Changes List: We are very grateful for all the reviewers' time, their input has significantly strengthened the paper from the first submission to the current draft. We’ve taken our time to integrate every input and outline them in the summary of changes. Meta Review: In response to the meta-review, we strengthened the neural-symbolic framing by explicitly clarifying that while our classifier is neural, it operates over structured symbolic inputs, semantically tagged categorical features (e.g., age, gender, crossing behavior) derived from the Moral Machine dataset. We emphasised the use of logical constraints over these features for robustness verification. We included a description of how input features are symbolically bounded in the introduction. We also addressed the robustness–norm alignment gap by explicitly noting that geometric robustness does not equate to full moral alignment, but instead serves as a proxy for value generalisation under morally meaningful perturbations, offering bounded behavioral guarantees. To constructively reframe our negative results, we argued that the failure of robustness outside the Western cluster validates our hypothesis that cultural variance affects generalisation, underscoring the need for culturally adaptive or value-weighted training. Finally, we provided descriptions of each robustness metric. Review 1: We revised the abstract and conclusion to more clearly emphasize the central claim: that robustness verification can serve as a proxy for evaluating moral generalization in culturally sensitive settings. We corrected the terminology throughout the paper from “neural-symbolic” to the more accurate “neurosymbolic.” To improve interdisciplinary grounding, we added this in the conclusion. We expanded the results section by integrating key insights from the supplementary material (formerly Section 6) directly into the main text. Additionally, we elaborated on the empirical findings, especially the contrast between the Western cluster’s successful robustness and the failures observed in the Southern and Eastern clusters, framing this as evidence for the need for culturally adaptive training methods. Review 2: We explicitly clarified what is symbolic in our pipeline: the classification model is a neural network, but the input space is derived from symbolic representations of moral dilemmas. We now explain that robustness properties are defined using symbolic constraints based on the standard error in human moral preference data, forming semantically bounded perturbations for verification. To improve clarity, we revised the introduction to distinguish between moral preferences (individual-level responses) and social norms (aggregated cultural structures), and refined the definition of “moral-critical” tasks as those with ethical consequences. We strengthened the core contribution statement to highlight our formal verification framework for semantically grounded robustness checks in morally sensitive domains. We expanded Section 3.3 to include more details on the dataset (Moral Machine), model training, verification methodology, and the significance of verification failures. These failures are now framed not as flaws, but as indicative of the moral complexity of the task. We also cleaned up technical notation throughout, defined all variables and acronyms on first use, explained robustness metrics in plain language, and corrected logical errors (e.g., fixing incorrect quantifier use in argmax expressions). Finally, we considered adding a neural-symbolic diagram but opted against it due to space constraints however we did add two tables for clarity. Reviewer 3: We clarified that our method is not only traditional epsilon-ball verification, but also geometric verification grounded in semantically relevant feature bounds, derived from the standard error of real-world moral preference data. We emphasised that this approach aims to align geometric robustness with social norm alignment by introducing semantic weight into otherwise abstract perturbation techniques. We revised the framing to explicitly state that, while the model is a black box and cannot guarantee interpretable internal representations, the use of symbolic input features and bounded verification defines meaningful external behavioral constraints. Table 1 also was added for this. We clarified that our hypothesis is that semantically grounded perturbations improve the interpretability of robustness testing and can be generalised to future models trained on other questionnaire data. The paper is now positioned more explicitly as a neurosymbolic contribution: the neural network processes symbolically structured inputs, while formal verification tools apply logical and geometric constraints based on social data, laying the groundwork for interpretable moral reasoning. We lowered expectations around verification success, making clear that failures are not flaws, but evidence of the need for future training-verification loops. We now present this work as a foundation for future research comparing robustness-based formal guarantees with local explainability tools like LIME or SHAP, and exploring how formal bounds might support interpretable training objectives. Finally, we adjusted the title and abstract focus to reflect that the core contribution lies in the novel application of robustness verification to socially critical, morally sensitive decision-making contexts.
Publication Agreement: pdf
Submission Number: 21
Loading