Copyleaks Blog

Your learning destination for all things responsible AI, plagiarism and beyond.

Nearly 60% of GTP-3.5 Outputs Contained Some Form of Plagiarized Content

Copyleaks Research Finds Nearly 60% of GPT-3.5 Outputs Contained Some Form of Plagiarized Content

There’s an unprecedented amount of AI-generated content now saturating the internet. According to a 2023 report, by 2026, nearly 90% of all online content will be AI-generated. As a result of AI content saturation, concerns regarding data pollution and inevitable model collapse raise concerns about AI-generated text’s overall quality and reliability.

Furthermore, broader concerns about originality have also begun. In the wake of several lawsuits regarding AI infringing on copyright and potentially plagiarizing, educational institutions and enterprises across the globe are questioning the authenticity of AI text: Where did it originate from? Is it safe to use as original content?

Ultimately, does AI plagiarize?

To find out, Copyleaks conducted an analysis to determine the degree to which AI-generated content is original and free of potential plagiarism.

Number of Papers Tested for Each Subject



To conduct this analysis:

We asked GPT-3.5 to write 1,045 outputs, averaging 412 words across all outputs, in 26 subjects.


59.7% of GPT-3.5 Outputs Contained Some Form of Plagiarized Content


Physics:
Chemistry:
Science:
Psychology:
Law:
Economics:
Biology:
Business Studies:
Engineering:
Accounting:
Geography:
Mathematics:
Computer Science:
Sports:
World History:
Philosophy:
English Language:
Art:
Physical Education:
Statistics:
Social Science:
Nature:
Music:
Sociology:
Humanities:
Theater:

83.7%
68.0%
67.3%
63.3%
57.5%
57.1%
55.1%
51.4%
51.4%
50.0%
49.0%
49.0%
47.5%
42.1%
39.6%
37.5%
37.1%
35.0%
35.0%
32.5%
28.6%
25.0%
22.9%
22.9%
15.0%
14.3%


Mathematics:
Physics:
Psychology:
Science:
Biology:
Chemistry:
Economics:
Business Studies:
Computer Science:
Law:
Statistics:
Physical Education:
Sports:
Accounting:
Art:
Engineering:
Philosophy:
Geography:
Nature:
World History:
Sociology:
English Language:
Social Science:
Music:
Theater:
Humanities:

67.4%
57.1%
53.1%
51.0%
49.0%
46.0%
38.8%
37.1%
35.0%
30.0%
30.0%
22.5%
21.1%
20.0%
20.0%
20.0%
17.5%
16.3%
15.0%
12.5%
11.4%
8.6%
8.6%
5.7%
5.7%
0.0%


Physics:
Psychology:
Chemistry:
Science:
Biology:
Computer Science:
Economics:
Business Studies:
Mathematics:
Philosophy:
Statistics:
Sports:
World History:
Accounting:
Law:
Nature:
Physical Education:
Art:
Engineering:
Geography:
Sociology:
English Language:
Music:
Social Science:
Humanities:
Theater

79.6%
79.6%
66.0%
65.3%
63.3%
62.5%
59.2%
57.1%
49.0%
47.5%
47.5%
47.4%
45.8%
42.5%
42.5%
40.0%
40.0%
35.0%
34.3%
32.7%
31.4%
28.6%
25.7%
20.0%
15.0%
5.7%


*Identical Text: A one-for-one copying of someone else’s text that is passed off as your own

**Minor Changes: Content with minor alterations to the source material, such as altering a verb within a sentence (e.g., slow to slowly)

***Paraphrased Text: Putting someone else’s idea into your own words without crediting the original source


Copyleaks then conducted an in-depth analysis to gauge the specific outputs with the highest levels of identical text, minor changes, and paraphrasing across all 26 subjects.

Identical Text

Our analysis found that the individual GPT-3.5 output with the highest percentage of plagiarism was in Physics, where 27.0% of the text was identical. This was followed by an individual Chemistry output where 24.7% of the text was identical.

Outputs With the Highest Percentages of Identical Text for Each Subject


Minor Changes

The individual GPT-3.5 outputs with the highest percentages of minor changes were from Physics and Psychology, where 25.2% of each respective output contained minor changes.

Outputs With the Highest Percentages of Minor Changes for Each Subject


Paraphrased

The Individual GPT-3.5 output with the highest percentage of paraphrasing was in Computer Science, where a surprising 80.7% of the text was paraphrased. This was followed by an indiviudal Physics output where 76.3% of the text was paraphrased.

Outputs With the Highest Percentage of Paraphrasing for Each Subject


Similarity Score

The Similarity Score is a Copyleaks-specific scoring method aggregating the rate of identical text, minor changes, paraphrased text, and more. A score of 0% signifies that all of the content is original, whereas a score of 100% means that none of the content is original.

Subjects With the Highest and Lowest Average Similarity Scores

The subject with the highest average Similarity Score is Physics at 31.3%, followed closely by Psychology at 27.7% and Science at 26.7%. The subjects with the lowest average Similarity Score are Theater at 0.9%, Humanities at 2.8%, and English Language at 5.4%.


Outputs With the Highest Similarity Score for Each Subject

Across all subjects, our analysis found that the individual GPT-3.5 output with the highest Similarity Score was in Computer Science, with an astounding 100%, followed by Physics with 92% and Psychology with 88%.


Key Takeaways

With AI-generated content expanding and continuing to saturate the internet, having key solutions in place is critical. As the Copyleaks data shows, nearly 60% of AI-generated content contains some form of plagiarism. 

The insights provided by the analysis can help educational institutions and organizations put emphasis on certain subjects when checking for plagiarism, allowing them to tailor their approach as needed to ensure all potential risks and concerns are addressed. For example; Physics, Chemistry, Mathematics, and Psychology might require a more in-depth look to identify plagiarized text, while other subjects, including Theater and Humanities, may require less scrutiny.

Furthermore, the data underscores the need for organizations to adopt solutions that detect the presence of AI-generated content and provide the necessary transparency surrounding potential plagiarism within the AI content. Full-spectrum protection that includes AI and plagiarism detection ensures compliance with copyright and licensing and empowers authenticity and originality within all content.

Do AI Models Plagiarize?

Do AI Models Plagiarize? 

Is it starting to feel like AI is everywhere? That might be because, in many ways – it is. There’s an unprecedented amount of AI-generated content now saturating the internet. 

A 2023 report predicts that by 2026, nearly 90% of online content will be AI-generated. 

Due to the heavy AI content saturation, quality concerns have been a hot topic of discussion worldwide. Add to the mix that all that AI saturation creates data pollution, inevitably leading to model collapse, and there’s something else to be concerned about. 

In reality, it seems that the more AI is around, the more we discover that we have reasons to be paying closer attention. 

Recently, another question has been asked. In the wake of several lawsuits regarding AI infringing on copyright and potentially plagiarizing, educational institutions and enterprises across the globe are questioning the authenticity of AI text, prompting the question: Do AI models plagiarize? 

Plagiarism in the traditional sense has led to some hefty lawsuits in the past. So why wouldn’t we wonder what’s in that AI-generated content? Should we be concerned about where it came from and if it’s safe to use?

To find an answer to those questions, we here at Copyleaks set out to determine the degree to which AI-generated content is original and free of potential plagiarism. Our process was simple: we asked GPT-3.5 to write 1,045 pieces of content, averaging 412 words across all outputs, in 26 subjects, including accounting, world history, art, physics, law, mathematics, music, philosophy, social science, and more.

What we found surprised us.

59.7% of GPT-3.5 Outputs Contained Some Form of Plagiarized Content

Types of Plagiarism Found



*Identical Text: A one-for-one copying of someone else’s text that is passed off as your own

**Minor Changes: Content with minor alterations to the source material, such as altering a verb within a sentence (e.g., slow to slowly)

***Paraphrased Text: Putting someone else’s idea into your own words without crediting the original source


Copyleaks then conducted an in-depth analysis to gauge the specific outputs with the highest levels of identical text, minor changes, and paraphrasing across all 26 subjects.

Identical Text

The analysis found that the individual GPT-3.5 output with the highest percentage of identical text was in Physics, where 27.0% of the text was identical. This was followed by an individual Chemistry output where 24.7% of the text was identical.

Minor Changes

The individual GPT-3.5 outputs with the highest percentages of minor changes were from Physics and Psychology, where 25.2% of each respective output contained minor changes.

Paraphrased

The individual GPT-3.5 output with the highest percentage of paraphrasing was in Computer Science, where a surprising 80.7% of the text
was paraphrased. This was followed by an individual Physics output where 76.3% of
the text was paraphrased.

Similarity Score

The Similarity Score is a Copyleaks-specific scoring method aggregating the rate of identical
text, minor changes, paraphrased text, and more. A score of 0% signifies that all of the content
is original, whereas a score of 100% means that none of the content is original.

Highest Average

The subject with the highest average Similarity Score is Physics at 31.3%, followed closely by Psychology at 27.7% and Science at 26.7%.

Lowest Average

The subjects with the lowest average Similarity Score are Theater at 0.9%, Humanities at 2.8%, and English Language at 5.4%. 

Highest Overall

The analysis found that the individual GPT-3.5 output with the highest Similarity Score was in Computer Science, with an astounding 100%.

It’s clear that with AI-generated content expanding and continuing to saturate the internet, we need to be more informed about what’s in AI-generated content, especially before we publish it or submit it in an assignment. As the data shows, nearly 60% of AI-generated content contains some form of plagiarism, and nothing ruins a career or academic achievement faster than a plagiarism accusation. 

The insights provided by the analysis can help educational institutions, content creators, marketing teams, and everyone else that are utilizing AI models put the necessary emphasis on certain subjects when checking for plagiarism. Doing so allows for tailoring the approach as needed to ensure all potential risks and concerns are addressed. 

Furthermore, the data underscores the need for adopting solutions that detect the presence of AI-generated content and provide the necessary transparency surrounding potential plagiarism within the AI content. Full-spectrum protection that includes AI and plagiarism detection ensures compliance with copyright and licensing and empowers authenticity and originality within all content.