Blog

AI Model Plagiarism and Why It Matters for Education

October 17, 2023

In the ever-evolving world of education, technology has presented opportunities for enhancing the learning experience as well as challenging it.

Since ChatGPT-3.5 was released in November of 2022, understandably, a shift has occurred within education that emphasizes setting rules and regulations around where AI-generated content belongs within the curriculum, in the classroom, and beyond.

Furthermore, discussions arose around whether utilizing AI models was cheating. Where students argued that the content generated was original and had never been organized in that particular form before, educators argued that it wasn’t original work and compromised the development of essential skills.

As these new conversations around AI models moved to the forefront, another seemed to fade into the background: plagiarism. Before AI, plagiarism was one of the most oft-discussed challenges facing education. However, the relevance of plagiarism came into question following the explosion of AI model use. After all, if students could now have an AI write entire essays for them, wasn’t plagiarism a moot point?

Not exactly. Plagiarism didn’t go away after AI; it simply changed shape.

As AI continued to grow, the issues surrounding it grew as well. One in particular was the question of copyright infringement and plagiarism. AI models like ChatGPT were trained on human-written content from the internet and other vast databases. This begs the question: where did that content come from?

As it turns out, it came from everywhere, including novels, scientific journals, memoirs, magazine articles, etc., and rarely was permission granted to use the content to train the AI models.

Therein lies the problem. If these AI models were trained on vast amounts of human-created content, could some portions of what they generate be construed as plagiarized? In short, could AI-generated student essays, computer science projects, and other assignments still present the original challenge of plagiarism?

To answer that question, we set out to determine how much AI-generated content is original and how much contains potential plagiarism. Our process was simple: we asked GPT-3.5 to write 1,045 pieces of content, averaging 412 words across all outputs, in 26 popular academic subjects, including World History, Art, Physics, Law, Mathematics, Music, Philosophy, Social Science, and more.

Here is what we discovered.

59.7% of GPT-3.5 Outputs Contained Some Form of Plagiarized Content

Similarity Score

The Similarity Score is a Copyleaks-specific scoring method within our LMS and API integrations, aggregating the rate of identical text, minor changes, paraphrased text, and more. Educators and students utilize the Similarity Score to determine levels of potential plagiarism present within a scanned piece of content. A score of 0% signifies that all of the content is original, whereas a score of 100% means that none of the content is original.

Here is what we found for the 1,045 pieces of content written by GPT-3.5.

Highest Average

The subject with the highest average Similarity Score is Physics at 31.3%, followed closely by Psychology at 27.7% and Science at 26.7%.

Lowest Average

The subjects with the lowest average Similarity Score are Theater at 0.9%, Humanities at 2.8%, and English Language at 5.4%.

Highest Overall

The analysis found that the individual GPT-3.5 output with the highest Similarity Score was in Computer Science, with an astounding 100%.

Types of Plagiarism Found

As shown by the analysis, plagiarism is not dead. In fact, thanks to AI, it’s moving back to the forefront of conversation within education.

None of this is to indicate educational institutions should not utilize AI models, or GenAI platforms like Bard, ChatGPT, etc.. In fact, like other technological advancements, there is an opportunity to implement AI models into the learning journey and to have constructive conversations with students about how to adopt AI use as part of their learning journey responsibly.

Instead, the analysis implies that we need to be more informed about what’s in AI-generated content, further emphasizing the importance of the human touch. Considering that nearly 60% of AI-generated content potentially contains some form of plagiarism, it’s vital to take the necessary steps to ensure originality given how critical it is to the learning process overall.

With the right tools that address AI and plagiarism, not just one or the other, educational institutions can empower authenticity and originality within all content and supply the necessary data to open up the conversation between educators and students about AI use and responsible adoption.