Before reporting eval numbers.
The team needs to know whether held-out examples, answer keys, benchmark rows, or paraphrased eval material have drifted into training or fine-tuning data.
Check whether evaluation data can still be trusted before results are reported.
Use this before a benchmark, eval refresh, model comparison, or investor/customer report where leaked examples can make the model look better than it really is.
The team needs to know whether held-out examples, answer keys, benchmark rows, or paraphrased eval material have drifted into training or fine-tuning data.
Datascreen surfaces overlap, answer-key residue, suspicious metadata, and source context so eval owners can review the evidence before trusting the score.
Source: Data Contamination Report from the 2024 CONDA Shared Task.
Show an eval set upload becoming a leakage review queue, with suspicious benchmark overlap and answer-key residue attached to the underlying rows.
Surface exact and near-exact candidates that can compromise a held-out evaluation.
Flag labels, answer markers, and solution metadata that survived preprocessing.
Keep benchmark, source, and project context attached to each finding.
Export what was reviewed before the score is shared.
A prioritized list of rows and clusters that deserve human review.
The row, source, neighborhood, and reason shown together.
A record of what reviewers kept, removed, fixed, or escalated.
A workflow-ready handoff that states what was reviewed and what remains uncertain.