Skip to content

feat(analyzer): add recognizer-level threshold config#2116

Open
rodboev wants to merge 7 commits into
data-privacy-stack:mainfrom
rodboev:pr/recognizer-threshold-config
Open

feat(analyzer): add recognizer-level threshold config#2116
rodboev wants to merge 7 commits into
data-privacy-stack:mainfrom
rodboev:pr/recognizer-threshold-config

Conversation

@rodboev

@rodboev rodboev commented Jun 28, 2026

Copy link
Copy Markdown

Change Description

Adds analyzer-level YAML configuration for recognizer-specific score thresholds in presidio-analyzer. The new configuration lets users set a threshold for a recognizer and optionally override it for a specific entity type, while preserving the current scalar score_threshold and default_score_threshold behavior when the new config is absent.

Proposed YAML shape:

recognizer_score_thresholds:
  CreditCardRecognizer:
    default: 0.4
    CREDIT_CARD: 0.7
  SpacyRecognizer:
    PERSON: 0.6

Recognizer and entity thresholds are applied in the existing result-filtering path after recognizer execution and context enhancement, before duplicate removal collapses equivalent spans. Explicit score_threshold arguments remain a global per-call override, and unmatched recognizers continue to fall back to default_score_threshold.

This also updates the analyzer docs, the no-code tutorial example, and the root changelog.

Issue reference

Fixes #1572

Tests

  • poetry run pytest tests/test_analyzer_engine.py tests/test_analyzer_engine_provider.py tests/test_configuration_validator.py -q
  • poetry run ruff check presidio_analyzer/analyzer_engine.py presidio_analyzer/analyzer_engine_provider.py presidio_analyzer/input_validation/schemas.py tests/test_analyzer_engine.py tests/test_analyzer_engine_provider.py tests/test_configuration_validator.py

Note on CHANGELOG

Update CHANGELOG.md under [unreleased]AnalyzerAdded.

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Copilot AI review requested due to automatic review settings June 28, 2026 04:08
@rodboev

rodboev commented Jun 28, 2026

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support in presidio-analyzer for configuring score thresholds per recognizer (and optionally per entity within that recognizer) via YAML, while preserving the existing global default_score_threshold / request-level score_threshold behavior.

Changes:

  • Introduces recognizer_score_thresholds configuration, applies it during result filtering (before duplicate collapse) when no request-level score_threshold is provided.
  • Adds validation/normalization for the new configuration shape (including numeric shorthand) and expands unit test coverage for precedence and error cases.
  • Updates analyzer configuration docs, the no-code tutorial, and the root changelog to document the new option.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/analyzer_engine.py Adds recognizer/entity-specific threshold support and applies filtering before deduplication.
presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py Loads recognizer_score_thresholds from analyzer YAML and passes it into AnalyzerEngine.
presidio-analyzer/presidio_analyzer/input_validation/schemas.py Validates and normalizes recognizer_score_thresholds in analyzer configuration validation.
presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml Documents the new YAML option with a commented example.
presidio-analyzer/tests/test_analyzer_engine.py Adds deterministic tests covering precedence, shorthand, invalid values, and dedupe ordering.
presidio-analyzer/tests/test_analyzer_engine_provider.py Verifies provider passes/normalizes thresholds from YAML and that defaults remain unchanged otherwise.
presidio-analyzer/tests/test_configuration_validator.py Adds validator tests for valid shorthand + invalid threshold structures/types/ranges.
docs/tutorial/08_no_code.md Updates no-code YAML example and explains when to use recognizer-level thresholds.
docs/analyzer/analyzer_engine_provider.md Documents the new configuration key and provides a YAML example.
CHANGELOG.md Adds an Unreleased entry documenting the new analyzer YAML capability.

Comment thread presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Comment thread presidio-analyzer/presidio_analyzer/analyzer_engine.py Outdated
Comment thread presidio-analyzer/presidio_analyzer/analyzer_engine.py
Comment thread presidio-analyzer/presidio_analyzer/analyzer_engine.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recognizer-level thresholds

2 participants