Abstract: Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: human-oriented evaluation, multilingual MGT analysis, human preferences
Contribution Types: Data resources, Data analysis
Languages Studied: Arabic, Chinese, English, Hindi, Italian, Japanese, Kazakh, Russian, Vietnamese
Previous URL: https://openreview.net/forum?id=60EbrOffPP
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: We made edits of the paper and changed the track of the paper from Human-Centered NLP to Evaluation and Resources
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Ethical Statement section in page 9
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 2, Reference
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 2
B6 Statistics For Data: Yes
B6 Elaboration: 2
C Computational Experiments: No
C1 Model Size And Budget: N/A
C1 Elaboration: This is a human-oriented case study for machine-generated text detection, we just call APIs to generate data. No computational experiments using GPUs.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 2, 5
C3 Descriptive Statistics: Yes
C3 Elaboration: 3, 4, 5
C4 Parameters For Packages: N/A
C4 Elaboration: We did not these for processing.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: 2, Appendix B
D2 Recruitment And Payment: N/A
D2 Elaboration: Authors annotated all, we did not recruit annotators externally.
D3 Data Consent: Yes
D3 Elaboration: 1
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: 2
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 97
Loading