Do Language Models Plagiarize?

Jooyoung Lee, Thai Le, Jinghui Chen, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

37 Scopus citations

Abstract

Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse"a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.

Original languageEnglish (US)
Title of host publicationACM Web Conference 2023 - Proceedings of the World Wide Web Conference, WWW 2023
PublisherAssociation for Computing Machinery, Inc
Pages3637-3647
Number of pages11
ISBN (Electronic)9781450394161
DOIs
StatePublished - Apr 30 2023
Event2023 World Wide Web Conference, WWW 2023 - Austin, United States
Duration: Apr 30 2023May 4 2023

Publication series

NameACM Web Conference 2023 - Proceedings of the World Wide Web Conference, WWW 2023

Conference

Conference2023 World Wide Web Conference, WWW 2023
Country/TerritoryUnited States
CityAustin
Period4/30/235/4/23

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Do Language Models Plagiarize?'. Together they form a unique fingerprint.

Cite this