Skip to content

Create a generic dataset that include some PHI / PII - to evaluate the three first modes #1810

Description

@RonShakutai

Is your feature request related to a problem? Please describe.
Create a curated benchmark dataset (benchmark_samples.yaml or similar) containing:

15-20 annotated text samples with varying complexity (simple one-liners, medium clinical notes, long discharge summaries)
Diverse PHI/PII entity types: PERSON, EMAIL, PHONE, SSN, dates, medical IDs (MRN, NPI, accession numbers), locations, organizations
Ground truth annotations with expected entity spans and types
Samples specifically designed to test edge cases: inverted names ("dr nakamura kenji"), hyphenated names, inline IDs, HIPAA-specific entities (ages 90+)
A benchmark script that reports precision/recall/F1 per mode with side-by-side comparison

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Sample categories should include:

Simple: single entity detection
Medium: clinical text with multiple entity types
Long/Complex: full medical documents (ultrasound reports, discharge summaries)
LLM-specific: entities only detectable by SLM (e.g., ages over 89 for HIPAA)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions