Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation

ACL ARR 2025 July Submission431 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Foundation models (FMs) have achieved significant success across various tasks, leading to research on benchmarks for reasoning abilities. However, there is a lack of studies on FMs performance in exceptional scenarios, which we define as out-of-distribution (OOD) reasoning tasks. This paper is the first to address these cases, developing a novel dataset for evaluation of FMs across multiple modalities, including graphic novels, calligraphy, news articles, and lyrics. It includes tasks for instance classification, character recognition, token prediction, and text generation. The paper also introduces prompt engineering techniques, Out-of-distribution Reasoning Chain-of-Thought (ORCoT) and ORCoT+Few-Shot, to improve performance. Validation of FMs using various methods revealed improvements. The code repository contains all relevant code and supplementary materials, including prompts such as ORCoT. It is accessible at: https://github.com/Code4PaperBlind/ExceptionalBenchmark
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, automatic creation and evaluation of language resources, evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English, Korean
Submission Number: 431
Loading