We believe it is highly important to be fully transparent about our Text Moderation Model’s accuracy, the rates of false positives and false negatives, areas for improvement, and more to ensure responsible use and adoption. This comprehensive analysis aims to ensure full transparency around our Text Moderation Model’s testing methodology.
Copyleaks Text Moderation Model v1 was subjected to a blind, dual-team evaluation on a total of 120,000 English texts (50% violating, 50% non-violating) that were completely separated from the training dataset.
The assessment shows the model can identify harmful content with very high recall while almost never flagging innocent text.
When the identical dataset was processed through three leading commercial moderation APIs (OpenAI, Azure, and Google) at their default thresholds, Copyleaks produced fewer false positives and false negatives, with an advantage of 4%-30% across key metrics.
Key figures (QA test set, N = 20,000)
A dual-team, blind evaluation was carried out, in order to obtain an unbiased picture of the model’s performance. The Data-Science and QA teams worked in full isolation, including different machines, different scripts, and no shared data.
Data-Science test set
QA test set
The Data-Science set was strictly held-out from the original corpora used for training. The QA set comprises passages intentionally crafted after model training; these texts were never seen during training and were not drawn from the training corpora.
The following definitions were used for the moderation categories:
The metrics that are used in this text moderation task are:
1. Confusion Matrix: A table summarizing the performance of the model, displaying:
2. Accuracy: The proportion of correctly classified instances (both true positives and true negatives) out of the total number of texts evaluated.
3. True Negative Rate (TNR): The proportion of actual negative instances that are correctly identified as negative. In the context of Text Moderation, TNR measures the model’s performance on non-violating texts.
4. True Positive Rate (TPR) / Recall: The proportion of actual positive instances that are correctly identified as positive. In the context of Text Moderation, TPR measures the model’s performance on violating texts.
5. Precision: The proportion of correctly predicted positive observations out of all positive predictions. In the context of Text Moderation, Precision measures the model’s reliability when it does flag content; it tells us how many of the texts identified as violating by the model were actually moderated.
6. F-beta Score: A weighted harmonic mean of precision and recall, where the beta parameter is set to favor precision. This prioritization helps in achieving a lower False Positive Rate.
Data-Science team
Confusion Matrix
QA team
QA Test Metrics Summary:
Overall accuracy: 0.9923
Precision: 0.9997
Recall: 0.9848
F-beta (β = 0.5): 0.9967
Head-to-head benchmark