Resource

Evaluating the Accuracy of the Copyleaks AI Detector

A Step-by-Step Methodology

We believe it is more important than ever to be fully transparent about the AI Detector’s accuracy, the rates of false positives and false negatives, areas for improvement, and more to ensure responsible use and adoption. This comprehensive analysis aims to ensure full transparency around our AI Detector’s V10 model testing methodology.

Test date: October 16, 2025

Publish date: November 12, 2025

Model tested: V10

The Copyleaks Data Science and QA teams independently performed testing to ensure unbiased and accurate results. Testing data differed from training data and contained no content previously submitted to the AI Detector for AI detection.

Testing data consisted of human-written text sourced from verified datasets and AI-generated text from various AI models. The test was performed with the Copyleaks API.

Metrics include overall accuracy, based on the rate of correct and incorrect text identification, and ROC-AUC (Receiver Operating Characteristic – Area Under the Curve), which examines true positive rates (TPR) and false positive rates (FPR). Additional metrics include the F1 score, true negative rate (TNR), and confusion matrices.

Testing verifies that the AI Detector displays a high detection accuracy for distinguishing between human-written and AI-generated text while maintaining a low false positive rate.

Evaluation Process

Using a dual-department system, we have designed our evaluation process to ensure top-level quality, standards, and reliability. We have two independent departments evaluating the model: the data science and the QA teams. Each department works independently with its evaluation data and tools and does not have access to the other’s evaluation process. This separation ensures the evaluation results are unbiased, objective, and accurate while capturing all possible dimensions of our model’s performance. Also, it is essential to note that the testing data is separated from the training data, and we only test our models on new data that they haven’t seen in the past.

Methodology

Copyleaks’ QA and Data Science teams have independently gathered a variety of testing datasets. Each testing dataset consists of a finite number of texts. The expected label—a marker indicating whether a specific text was written by a human or by AI—of each dataset is determined based on the source of the data. Human texts were collected from texts published before the rise of modern generative AI systems or later on by other trusted sources that were verified again by the team. AI-generated texts were generated using a variety of generative AI models and techniques.

The tests were executed against the Copyleaks API. We checked whether the API’s output was correct for each text based on the target label, and then aggregated the scores to calculate the confusion matrix.

Results: Data Science Team

The Data Science team conducted the following independent test:

  • The language of the texts was English, and 300,000 human-written texts and 200,000 AI-generated texts from various LLMs were tested in total.
  • Text lengths vary, but the datasets contain only texts with lengths larger than 350 characters—the minimum our product accepts.

Evaluation Metrics

The metrics that are used in this text classification task are:


1. Confusion Matrix: A table that shows the TP (true positives), FP (false positives), TN (true negatives) and FN (false negatives).

2. Accuracy: The proportion of true results (both true positives and true negatives) among the total number of texts that were checked.

3. True Negative Rate (TNR): The proportion of actual negative instances that are all the negative predictions.

In the context of AI detection, TNR is the model’s accuracy on human texts.

4. (TPR) also known as Recall: The proportion of true positive results in all the actual predictions.

In the context of AI detection, TPR is the model’s accuracy on AI-generated texts.

5.  F-beta Score: The weighted harmonic mean between precision and recall, favoring precision more (as we want to favor a lower False Positive Rate).

Combined AI and Human Datasets

Dataset's Name Number of texts Number of Human texts Number of AI texts TPR TNR F-beta(0.5)
Internal extra-hard datasets, including adversarial attacks and special tools 500,000 300,000 200,000 0.988 0.999 0.997

Results: QA Team

The QA team conducted the following independent test:

  • The language of the texts was English, and 229,843 human-written texts and 18,712 AI-generated texts from various LLMs were tested in total.
  • Text lengths vary, but the datasets contain only texts with lengths larger than 350 characters—the minimum our product accepts.

Human-Only Datasets

Dataset's Name Number of texts Correctly identified as Human Incorrectly identified as AI Accuracy
General texts 9,979 9,979 0 1
Articles, news, blogs, social posts 9,991 9,982 9 0.9991
Internet Web Pages Dataset 99,921 99,918 3 0.9999
Students essays 10,000 9,998 2 0.9998
Scholarly papers 99,952 99,906 46 0.9995
Total: 229,843 229,783 60 0.9997

AI-Only Datasets

Dataset's Name Number of texts Incorrectly identified as Human Correctly identified as AI Accuracy
OpenAI family models - other models 12,880 129 12,751 0.9899
GPT-5 1,207 11 1,196 0.9909
Gemini family models 1,978 7 1,971 0.9964
Claude family models 1,072 1 1,071 0.9991
Grok family models 1,575 0 1,575 1
Total: 18,712 148 18,564 0.992

*Model versions may change over time. Texts were generated using the available current versions of models from the above generative AI companies.

Sensitivity Levels

Since  v7.1 we have 3 sensitivity levels for the AI-detection model. Here are the test results for the sensitivity levels of model v10.

ID Sensitivity Definition False Positives False Negatives
1 Extra Safe Designed to minimize false positives by using additional AI detection-based filters.

Good for detecting AI-generated text with no minimal human modification.
0.009% 1.36%
2 Balanced (default -
this is the version
that is shown in
the results above)
Ideal for detecting AI content while minimizing false positives.

Good for detecting AI-generated text with a moderate amount of human modification.
0.026% 0.79%
3 Extra Sensitive Our most sensitive model was designed to flag AI text that was put through a "humanizer" or text spinner. 0.05% 0.53%

True Positives (AI-texts) and True Negatives (Human-texts) Accuracy by Sensitivity Level

true positives
true negatives
100.00% 99.50% 99.00% 98.50% 98.00%
98.64%
99.99%
Minimum False Positives (sensitivity 1)
99.21%
99.97%
Balanced (sensitivity 2)
99.47%
99.95%
Extra-sensitive (sensitivity 3)
Sensitivity

Human and AI Text Error Analysis

During the evaluation process, we identified and analyzed incorrect assessments made by the model and created a detailed report that will enable the data science team to correct the underlying causes. This is done without exposing the incorrect assessments to the data science team. All errors are systematically logged and categorized based on their character and nature in a “root cause analysis process”, which aims to understand the underlying causes and identify repeated patterns. This process is always ongoing, ensuring ongoing improvement and adaptability of our model over time.

One example of such a test is our analysis of internet data from 2013 – 2024 using our V4 model. We sampled 1M texts from each year, starting in 2013, using any false positives detected from 2013-2020, before the release of AI systems, to help improve the model further.

20k
15k
10k
5k
0
N of texts flagged as AI
0
2013
2
2014
3
2015
1
2016
0
2017
2
2018
1
2019
2
2020
34
2021
48
2022
579
2023
15,101
2024
Year

Similar to how researchers worldwide have and continue to test different AI detector platforms to gauge their capabilities and limitations, we fully encourage our users to conduct real-world testing. Ultimately, as new models are released, we will continue to share the testing methodologies, accuracy, and other important considerations to be aware of.