Abstract: Generative large language models present significant potential but also raise critical ethical concerns, including issues of safety, fairness, robustness, and reliability. Most existing ethical studies, however, are limited by their narrow focus, a lack of language diversity, and an evaluation of a restricted set of models. To address these gaps, we present a broad ethical evaluation of 29 recent open-source LLMs using a novel dataset that assesses four key ethical dimensions: robustness, reliability, safety, and fairness. Our analysis includes both a high-resource language, English, and a low-resource language, Turkish, providing a comprehensive assessment and a guide for safer model development. Using an LLM-as-a-Judge methodology, our experimental results indicate that many open-source models demonstrate strong performance in safety, fairness, and robustness, while reliability remains a key concern. Ethical evaluation shows cross-linguistic consistency, and larger models generally exhibit better ethical performance. We also show that jailbreak templates are ineffective for most of the open-source models examined in this study. We share all materials including data and scripts at https://github.com/metunlp/openethics
External IDs:dblp:journals/corr/abs-2505-16036
Loading