Datascreen

Eval Set Leakage

Check whether evaluation data can still be trusted before results are reported.

Use this before a benchmark, eval refresh, model comparison, or investor/customer report where leaked examples can make the model look better than it really is.

When this helps

Before reporting eval numbers.

The team needs to know whether held-out examples, answer keys, benchmark rows, or paraphrased eval material have drifted into training or fine-tuning data.

What Datascreen shows

Leakage candidates with row-level evidence.

Datascreen surfaces overlap, answer-key residue, suspicious metadata, and source context so eval owners can review the evidence before trusting the score.

566 reported entries The 2024 CONDA contamination report compiled 566 evidence entries.
91 contaminated sources Those reports covered 91 contaminated benchmark or dataset sources.
23 contributors The first CONDA compilation came from 23 contributors and remains open to additional reports.

Source: Data Contamination Report from the 2024 CONDA Shared Task.

Watch walkthrough

Product demo: inspect an eval refresh before the result is used.

Show an eval set upload becoming a leakage review queue, with suspicious benchmark overlap and answer-key residue attached to the underlying rows.

Review depth

What the product needs to make the decision obvious.

01

Overlap review

Surface exact and near-exact candidates that can compromise a held-out evaluation.

02

Answer residue

Flag labels, answer markers, and solution metadata that survived preprocessing.

03

Eval context

Keep benchmark, source, and project context attached to each finding.

04

Report boundary

Export what was reviewed before the score is shared.

What the team gets
Review queue

A prioritized list of rows and clusters that deserve human review.

Evidence context

The row, source, neighborhood, and reason shown together.

Decision log

A record of what reviewers kept, removed, fixed, or escalated.

Exportable report

A workflow-ready handoff that states what was reviewed and what remains uncertain.