Humans vs ChatGPT: Uncovering the Non-trivial Distinctions by Evaluating Parallel Responses

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: ChatGPT, Natural Language Processing, Machine Learning, Roget's Thesaurus
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: ChatGPT and human text have distinct differences in the concepts they convey and their lexicographical structure, but very similar in their syntactic and semantic features.
Abstract: The advent of ChatGPT and similar Large Language Models has set the world in an uproar as it is able to generate human-like natural language. Due to the high similarity between the human text and ChatGPT text, it begs the question if the two are truly indistinguishable. In this study, the human-generated content is compared to ChatGPT-3.5, ChatGPT-4, and Davinci-3 using the same technical questions as found on StackOverflow and general questions found on Yahoo Answers. We leveraged Roget's thesaurus to uncover thematic similarities and differences between the human corpora and GPT corpora. We performed a chi-square test on Roget's 1034 categories and found a significant difference in the appearance of words for 365 of them. To uncover the differences in the neighborhoods of the word embedding we utilized the MIT Embedding Comparator to distinguish GloVe base vectors with respect to its trained version on human and ChatGPT corpora. Pre-trained BERT and Sentence-BERT were used to measure the semantic similarity in the answers (on the same questions) given by humans and ChatGPT, which came out highly similar. While that might indicate difficulty in distinguishing ChatGPT and human text, the significant differences in the appearance of words suggested a move towards classification using machine learning models. We observed that various machine learning models performed very well. In summary, we discern disparities and parallels that can be attributed to conceptual, contextual, or lexicographic factors. We endeavor to establish connections between each methodology and these respective categories.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7357
Loading