Domain Independent Deception Detection: Feature Sets, LIWC Efficacy, and Synthetic Data Challenges

Casey Hanks, Shanina Ko, Emily Nguyen, Rakesh M. Verma

Published: 2024, Last Modified: 14 Jun 2024IWSPA@CODASPY 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deception is increasingly prevalent in the modern world, appearing in many different forms (domains) from phishing emails to fictitious product reviews, or even false political statements. Many researchers are looking for ways to identify deception within these different domains or to characterize deception through new datasets and techniques. Most researchers focus on one domain of deception at a time, however, we are interested in whether domain-independent deception detection is possible. To this end, we recreate a linguistic cue-based feature set proven to be effective in the business-to-business communication (B2B) domain and compare it to state-of-the-art linguistic cue feature sets in five other domains of deception. This B2B feature set was created through LIWC2007, which allows us the unique opportunity to utilize that version alongside the modern version LIWC22, to ascertain how much linguistic cue extraction has improved over the years. Finally, we leverage our best-performing feature set to investigate the efficacy of a technique gaining popularity in recent years, crowd-sourced synthetic datasets.