PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion

PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion

ACL ARR 2025 July Submission636 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduced PerCoR—Persian Commonsense Reasoning—the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural and other web sources. We introduce a novel conjunction-based segmentation strategy to generate coherent sentence–completion pairs, enabling broad topical and structural diversity. To create challenging distractors, we propose DRESS-AF—Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering—a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PERCOR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://anonymized_for_review.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Benchmarking, automatic creation and evaluation of language resources, NLP datasets,

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: Persian, English

Previous URL: https://openreview.net/forum?id=n8XoZ4u2rP

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: The previous submission was desk rejected due to an extra line in the 9th page. This submission adheres to the ACL format by sticking to 8 pages limit.

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Ethics section

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: We clearly cite every models that we used for evaluation. We also created the dataset from a publicly available web sources.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Ethics section

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Ethics

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: We used publicly available web sources across the internet.

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 4.1

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix B.5

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 4. Experiment

C3 Descriptive Statistics: Yes

C3 Elaboration: 4. Experiment

C4 Parameters For Packages: Yes

C4 Elaboration: Appendix

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: No

D1 Elaboration: We explain the process in Appendix B.4. We could also provide a link to the leveraged label studio for transparency.

D2 Recruitment And Payment: No

D2 Elaboration: It was done by our friends for free

D3 Data Consent: No

D3 Elaboration: We used publicly available web sources.

D4 Ethics Review Board Approval: Yes

D4 Elaboration: Ethics

D5 Characteristics Of Annotators: No

D5 Elaboration: We asked our friends to anonymously do this.

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We used AI Assistant only for polishing our written text to have more professional and better flow.

Author Submission Checklist: yes

Submission Number: 636

Loading