Copyleaks’ Innovative Text Fingerprinting Research Uncovers Key Insights on AI Model Reliances
NEW YORK, NY – (March 3, 2025) – Copyleaks, a pioneering force in AI-based text analysis, AI governance, and plagiarism detection, today revealed that three-quarters – 74.2% – of texts generated by DeepSeek-R1 match OpenAI’s stylistic fingerprints, indicating a possible reliance on OpenAI’s model during its training process.
This discovery raises concerns about DeepSeek-R1’s resemblance to OpenAI’s model, particularly regarding data sourcing, IP rights, and transparency. Undisclosed reliance on existing models can reinforce biases, limit diversity, and pose legal or ethical risks. Beyond technical issues, DeepSeek’s claims of a groundbreaking, low-cost training method—if based on unauthorized distillation of OpenAI—may have misled the market, contributing to NVIDIA’s $593 billion single-day loss and giving DeepSeek an unfair advantage.
Using a highly rigorous approach, the research combined three advanced AI classifiers, each trained on texts from four major models: Claude, Gemini, Llama, and OpenAI. These classifiers identified subtle stylistic features like sentence structure, vocabulary, and phrasing. What made the method particularly effective was its “unanimous jury” system, where all three classifiers had to agree before a classification was made. This ensured a robust check against false positives, resulting in an impressive 99.88% precision rate and just a 0.04% false-positive rate, accurately identifying texts from both known and unknown AI models.
When testing this ensemble on DeepSeek-R1, the results were notable:
- 74.2% of the generated texts aligned with OpenAI’s stylistic fingerprints, raising important questions about originality and the future of AI-generated content.
- In contrast, Microsoft’s Phi-4 model demonstrated a 99.3% disagreement rate, showing no resemblance to any known model and confirming its independent training.
“With this research, we have moved beyond general AI detection as we knew it and into model-specific attribution, a breakthrough that fundamentally changes how we approach AI content,” said Shai Nisan, Chief Data Scientist at Copyleaks. “This capability is crucial for multiple reasons, including improving overall transparency, ensuring ethical AI training practices, and, most importantly, protecting the intellectual property rights of AI technologies and, hopefully, preventing their potential misuse.”
About the Study
The Copyleaks Data Science Team conducted the research, led by Yehonatan Bitton, Shai Nisan, and Elad Bitton. The methodology involved a “unanimous jury” approach, relying on three distinct detection systems to classify AI-generated texts, with a judgment made only when all systems agreed. This technique enables the identification of major AI models like ChatGPT, Claude, Gemini, and Llama while also detecting the unique stylistic fingerprints of unseen models.
This research has significant implications. It provides transparency regarding AI authorship and addresses concerns about the increasing prevalence of AI-generated content. It also establishes a framework for protecting intellectual property rights and preventing misinformation and misuse of AI technologies.
“Copyleaks is dedicated to advancing AI-generated text verification,” Nisan added. “As AI technologies evolve, it is crucial for stakeholders to accurately discern the origins of AI-generated content. Our approach not only enhances fair use protection but also improves security and tracks the evolution of AI writing styles.”
###
About Copyleaks
Copyleaks is a leading AI text analysis platform empowering businesses and educational institutions to navigate the ever-evolving landscape of genAI confidently. With an award-winning suite of AI-powered tools trusted by millions, Copyleaks ensures AI governance, empowers responsible AI adoption, safeguards IP, protects intellectual property, and maintains academic integrity with comprehensive AI and plagiarism detection.
For additional information, visit our Website or follow us on LinkedIn.