feat: add no-op NLP engine by ultramancode · Pull Request #2071 · data-privacy-stack/presidio

ultramancode · 2026-06-17T15:22:37Z

Change Description

Adds a NoOpNlpEngine for analyzer configurations where the active recognizers do not need artifacts produced by an NLP engine.

This is mainly useful for recognizers such as HuggingFaceNerRecognizer, which runs model inference directly and does not need NLP artifacts from spaCy. In that setup, loading a spaCy model only adds startup cost.

With this change, users can select nlp_engine_name: no_op in analyzer configuration instead of configuring a real NLP model that will not be used.

Includes regression tests and an updated HuggingFaceNerRecognizer example.

Issue reference

Refs #2012

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a NoOpNlpEngine option to Presidio Analyzer to support running analyzers/recognizers without loading NLP models/artifacts (returning empty artifacts instead), and integrates it into engine/recognizer providers and model installation.

Changes:

Introduces NoOpNlpEngine and a default YAML config for it (conf/no_op.yaml).
Updates NlpEngineProvider, model installation, and recognizer registry/provider to support/guard no-op behavior.
Adds comprehensive tests for initialization, batch/text processing, and Analyzer/provider integration.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
presidio-analyzer/presidio_analyzer/nlp_engine/no_op_nlp_engine.py	Implements the new no-op NLP engine returning empty artifacts.
presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py	Registers `no_op` as an available engine and constructs it from config.
presidio-analyzer/presidio_analyzer/nlp_engine/init.py	Exposes `NoOpNlpEngine` in the package exports.
presidio-analyzer/presidio_analyzer/conf/no_op.yaml	Adds a config preset for the no-op engine.
presidio-analyzer/install_nlp_models.py	Skips model installation for `no_op`.
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py	Skips NLP recognizer registration and blocks NLP recognizer retrieval for `NoOpNlpEngine`.
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py	Prevents configuring NLP recognizers when using `NoOpNlpEngine`.
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py	Updates docs example to use no-op engine for recognizer-only flows.
presidio-analyzer/tests/conftest.py	Adds `no_op` to the parametrized NLP engines used in tests.
presidio-analyzer/tests/test_no_op_nlp_engine.py	Adds unit/integration tests for the no-op engine and its provider/analyzer behavior.

ultramancode · 2026-06-17T15:43:33Z

+    def _create_empty_nlp_artifacts(self, language: str) -> NlpArtifacts:
+        return NlpArtifacts(
+            entities=[],
+            tokens=Doc(self._vocab, words=[]),
+            tokens_indices=[],
+            lemmas=[],
+            nlp_engine=self,
+            language=language,
+            scores=[],
+        )


NlpArtifacts does not accept a keywords constructor argument. Keywords are derived internally from lemmas in NlpArtifacts.__init__, so passing lemmas=[] keeps artifacts.keywords as an empty list.

omri374 · 2026-06-17T18:05:03Z

Thanks! Have you seen the slim nlp engine? https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/nlp_engine/slim_spacy_nlp_engine.py

Can we use this as is / adapt it?

ultramancode · 2026-06-18T10:07:16Z

Thanks for the pointer! I looked at SlimSpacyNlpEngine, and I agree it is relevant here.

The main difference is that slim is still spaCy-backed: it disables NER, but still initializes a spaCy pipeline for token/lemma artifacts. The use case here is a bit stricter: recognizers that do not need those artifacts at all, so we can skip initializing a spaCy model or blank pipeline.

We could technically adapt slim with a no-op mode, but that would broaden its contract: depending on configuration, slim would either provide token/lemma artifacts or return empty artifacts.

I went with a separate no_op engine to make the path clearer for cases where the recognizers themselves don’t need NLP artifacts, but I can adapt the PR if you’d prefer to keep that under slim.

feat: add no-op NLP engine

9af8439

Copilot AI review requested due to automatic review settings June 17, 2026 15:22

github-actions Bot added the external label Jun 17, 2026

Copilot AI reviewed Jun 17, 2026

View reviewed changes

fix: address no-op NLP review comments

abe4c95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add no-op NLP engine#2071

feat: add no-op NLP engine#2071
ultramancode wants to merge 2 commits into
data-privacy-stack:mainfrom
ultramancode:feature/no-op-nlp-engine

ultramancode commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ultramancode Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omri374 commented Jun 17, 2026

Uh oh!

ultramancode commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ultramancode commented Jun 17, 2026

Change Description

Issue reference

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

ultramancode Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omri374 commented Jun 17, 2026

Uh oh!

ultramancode commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants