Abstract: High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models.
Although this capability has a variety of benign applications, paraphrasing attacks---paraphrases applied to machine-generated texts---are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time.
Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.
Paper Type: Long
Research Area: Generation
Research Area Keywords: paraphrasing, rumor/misinformation detection, prompting, generative models, applications, text-to-text generation, style analysis
Languages Studied: English
Submission Number: 5555
Loading