Mitigating Annotation Artifacts in Physical Commonsense Reasoning Benchmark

Wing-Lam Mok, SungHo Kim

Published: 2023, Last Modified: 26 May 2026IEEE Big Data 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: To achieve human-like AI, physical commonsense reasoning plays a pivotal role for machines to understand the real world. Thereby, physical commonsense reasoning benchmarks [1]–[3] have been developed to evaluate how well language models can understand the physical characteristics and the corresponding utilization. While pre-trained language models (PLMs) have shown remarkable performances on such benchmarks, recent studies [4] have discovered that these models tend to exploit artifacts in benchmarks as shortcuts for the over-estimated results, overpassing the actual reasoning. In this paper, we propose a novel crowdsouring MAA framework to mitigate the artifacts that leads to shortcut effect revealed by the stress test on partial input in physical commonsense reasoning, disclosing the real commonsense reasoning ability of the PLMs.

External IDs:dblp:conf/bigdataconf/MokK23