Abstract: We introduced PerCoR—Persian Commonsense Reasoning—the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural and other web sources. We introduce a novel conjunction-based segmentation strategy to generate coherent sentence–completion pairs, enabling broad topical and structural diversity. To create challenging distractors, we propose DRESS-AF—Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering—a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PERCOR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://anonymized_for_review.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Benchmarking, automatic creation and evaluation of language resources, NLP datasets,
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: Persian, English
Previous URL: https://openreview.net/forum?id=n8XoZ4u2rP
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: The previous submission was desk rejected due to an extra line in the 9th page. This submission adheres to the ACL format by sticking to 8 pages limit.
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Ethics section
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We clearly cite every models that we used for evaluation. We also created the dataset from a publicly available web sources.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Ethics section
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Ethics
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We used publicly available web sources across the internet.
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 4.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix B.5
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4. Experiment
C3 Descriptive Statistics: Yes
C3 Elaboration: 4. Experiment
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: No
D1 Elaboration: We explain the process in Appendix B.4. We could also provide a link to the leveraged label studio for transparency.
D2 Recruitment And Payment: No
D2 Elaboration: It was done by our friends for free
D3 Data Consent: No
D3 Elaboration: We used publicly available web sources.
D4 Ethics Review Board Approval: Yes
D4 Elaboration: Ethics
D5 Characteristics Of Annotators: No
D5 Elaboration: We asked our friends to anonymously do this.
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used AI Assistant only for polishing our written text to have more professional and better flow.
Author Submission Checklist: yes
Submission Number: 636
Loading