Evaluating the Accuracy of the Copyleaks Text Moderation

A Detailed Methodology

We believe it is highly important to be fully transparent about our Text Moderation Model’s  accuracy, the rates of false positives and false negatives, areas for improvement, and more to ensure responsible use and adoption. This comprehensive analysis aims to ensure full transparency around our Text Moderation Model’s testing methodology.

Test date: June 29, 2025

Publish date: September 16, 2025

Model tested: V1

Executive Summary

Copyleaks Text Moderation Model v1 was subjected to a blind, dual-team evaluation on a total of 120,000 English texts (50% violating, 50% non-violating) that were completely separated from the training dataset. 

The assessment shows the model can identify harmful content with very high recall while almost never flagging innocent text. 

When the identical dataset was processed through three leading commercial moderation APIs (OpenAI, Azure, and Google) at their default thresholds, Copyleaks produced fewer false positives and false negatives, with an advantage of 4%-30% across key metrics.

 

Key figures (QA test set, N = 20,000)  

  • Accuracy: 99.23 %  
  • Precision: 99.97 %  (3 false positives in 10,000 non-violating texts)  
  • Recall (TPR): 98.48 %  
  • F-Beta(0.5) score: 99.67 %

Methodology

A dual-team, blind evaluation was carried out, in order to obtain an unbiased picture of the model’s performance. The Data-Science and QA teams worked in full isolation, including different machines, different scripts, and no shared data. 

1. Test-sets construction

Data-Science test set

  • 100,000 English passages (50,000 non-violating / 50,000 violating)

  • Texts randomly sampled without replacement from four vetted sources: public social-media dumps, news articles, public-domain literature, and Copyleaks-generated edge cases. All material is either in the public domain or used under explicit licenses

  • Cross-check with two external LLMs; only unanimous items kept

  • Coverage across all Copyleaks policy categories

  • Label-certainty filter: only passages whose moderation status was 100% definitive were kept; any borderline texts were discarded. This maximises fairness in head-to-head comparisons and removes subjectivity from the ground truth

QA test set

  • 20,000 English passages (10,000 non-violating / 10,000 violating) crafted independently by the QA department

  • Minimum length of 10 characters; otherwise the same sampling, labeling, license provenance, 100% definitive rule, and category protocol as the DS set

The Data-Science set was strictly held-out from the original corpora used for training. The QA set comprises passages intentionally crafted after model training; these texts were never seen during training and were not drawn from the training corpora.

2. Tool chain and execution details

  • Copyleaks API v1, queried 24 June 2025

  • Competitor endpoints (queried with identical pre-processing on 24 June 2025)
    • OpenAI Moderation v2, default threshold
    • Azure AI Content Safety build 2025-06-15
    • Google Perspective API rev. 2025-06-12, toxicity threshold = 0.50

  • Pre-processing: emoji preservation, no stemming or lower-casing

  • For each run we recorded the raw JSON response, derived a binary verdict, built a confusion matrix (TP, FP, TN, FN) and then computed Accuracy, Precision, Recall, TNR, and F-Beta(0.5)

Moderation Categories Definition

The following definitions were used for the moderation categories:

  1. Adult: Explicit descriptions, references, or portrayals of sexual acts or behavior intended to evoke sexual arousal.

  2. Toxic: Harmful language that insults, demeans, or degrades in a general way, not necessarily aimed at a specific person. This includes any language intended to cause emotional harm.

  3. Violent: Language that incites or glorifies physical harm or injury.

  4. Profanity: Use of strong or offensive swear words.

  5. Self-Harm: References that encourage or normalize self-injurious behavior.

  6. Harassment: Targeted abuse that insults or degrades a specific person or group, focusing on personal traits or beliefs.

  7. Hate Speech: Language that demonizes or incites harm toward a group or individual based on inherent traits, often calling for violence or systemic discrimination.

  8. Drugs Usage: References, descriptions, or endorsements of the use, abuse, or distribution of drugs in a harmful context, including illegal substances or the misuse of legal drugs.

  9. Firearms: Content discussing the use, possession, or distribution of guns and other weapons, especially when such discussions could promote or cause violence or unsafe practices.

  10. Cybersecurity: Content related to computer security, including discussions on hacking, data breaches, and measures to hack digital systems or gain unauthorized access.

  11. Other: Any other content deemed inappropriate, harmful, or offensive not covered by the above categories.

Metrics Definitions

The metrics that are used in this text moderation task are:

1.  Confusion Matrix: A table summarizing the performance of the model, displaying:

    1. True Positives (TP): Violating texts correctly identified as needing moderation.
    2. False Positives (FP): Non-violating texts incorrectly identified as needing moderation.
    3. True Negatives (TN): Non-violating texts correctly identified as not needing moderation.
    4. False Negatives (FN): Violating texts incorrectly identified as not needing moderation.

2. Accuracy: The proportion of correctly classified instances (both true positives and true negatives) out of the total number of texts evaluated.

Accuracy = TP + TN / Total texts

3. True Negative Rate (TNR): The proportion of actual negative instances that are correctly identified as negative. In the context of Text Moderation, TNR measures the model’s performance on non-violating texts.

TNR = TN / TN + FP

4. True Positive Rate (TPR) / Recall: The proportion of actual positive instances that are correctly identified as positive. In the context of Text Moderation, TPR measures the model’s performance on violating texts.

TPR = TP / TP + FN

5. Precision: The proportion of correctly predicted positive observations out of all positive predictions. In the context of Text Moderation, Precision measures the model’s reliability when it does flag content; it tells us how many of the texts identified as violating by the model were actually moderated.

Precision = TP / TP + FP

6. F-beta Score: A weighted harmonic mean of precision and recall, where the beta parameter is set to favor precision. This prioritization helps in achieving a lower False Positive Rate.

Results

Data-Science team

Data Science Team Results

Confusion Matrix

QA team

QA Team Results

QA Test Metrics Summary:

  • Overall accuracy: 0.9923

  • Precision: 0.9997

  • Recall: 0.9848

  • F-beta (β = 0.5): 0.9967

Head-to-head benchmark

Head to Head
Model Comparison

Limitations

  • Language scope: this model and its evaluation cover English only.

  • Context scope: moderation is performed on a single “passage” at a time, where a passage is a self-contained chunk of text with a certain number of tokens. The system maintains no memory across passages, chapters, or conversation turns; therefore references such as “as we explained earlier” or pronouns that depend on earlier context may be missed.

  • Modality scope: this product assesses only text; no image, audio, or video inputs were included.