Create a generic dataset that include some PHI / PII - to evaluate the three first modes

**Is your feature request related to a problem? Please describe.**
Create a curated benchmark dataset (benchmark_samples.yaml or similar) containing:

15-20 annotated text samples with varying complexity (simple one-liners, medium clinical notes, long discharge summaries)
Diverse PHI/PII entity types: PERSON, EMAIL, PHONE, SSN, dates, medical IDs (MRN, NPI, accession numbers), locations, organizations
Ground truth annotations with expected entity spans and types
Samples specifically designed to test edge cases: inverted names ("dr nakamura kenji"), hyphenated names, inline IDs, HIPAA-specific entities (ages 90+)
A benchmark script that reports precision/recall/F1 per mode with side-by-side comparison

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Sample categories should include:

Simple: single entity detection
Medium: clinical text with multiple entity types
Long/Complex: full medical documents (ultrasound reports, discharge summaries)
LLM-specific: entities only detectable by SLM (e.g., ages over 89 for HIPAA)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a generic dataset that include some PHI / PII - to evaluate the three first modes #1810

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Create a generic dataset that include some PHI / PII - to evaluate the three first modes #1810

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions