Skip to content

feat: add no-op NLP engine#2071

Open
ultramancode wants to merge 2 commits into
data-privacy-stack:mainfrom
ultramancode:feature/no-op-nlp-engine
Open

feat: add no-op NLP engine#2071
ultramancode wants to merge 2 commits into
data-privacy-stack:mainfrom
ultramancode:feature/no-op-nlp-engine

Conversation

@ultramancode

Copy link
Copy Markdown
Contributor

Change Description

Adds a NoOpNlpEngine for analyzer configurations where the active recognizers do not need artifacts produced by an NLP engine.

This is mainly useful for recognizers such as HuggingFaceNerRecognizer, which runs model inference directly and does not need NLP artifacts from spaCy. In that setup, loading a spaCy model only adds startup cost.

With this change, users can select nlp_engine_name: no_op in analyzer configuration instead of configuring a real NLP model that will not be used.

Includes regression tests and an updated HuggingFaceNerRecognizer example.

Issue reference

Refs #2012

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Copilot AI review requested due to automatic review settings June 17, 2026 15:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a NoOpNlpEngine option to Presidio Analyzer to support running analyzers/recognizers without loading NLP models/artifacts (returning empty artifacts instead), and integrates it into engine/recognizer providers and model installation.

Changes:

  • Introduces NoOpNlpEngine and a default YAML config for it (conf/no_op.yaml).
  • Updates NlpEngineProvider, model installation, and recognizer registry/provider to support/guard no-op behavior.
  • Adds comprehensive tests for initialization, batch/text processing, and Analyzer/provider integration.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/nlp_engine/no_op_nlp_engine.py Implements the new no-op NLP engine returning empty artifacts.
presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py Registers no_op as an available engine and constructs it from config.
presidio-analyzer/presidio_analyzer/nlp_engine/init.py Exposes NoOpNlpEngine in the package exports.
presidio-analyzer/presidio_analyzer/conf/no_op.yaml Adds a config preset for the no-op engine.
presidio-analyzer/install_nlp_models.py Skips model installation for no_op.
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py Skips NLP recognizer registration and blocks NLP recognizer retrieval for NoOpNlpEngine.
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py Prevents configuring NLP recognizers when using NoOpNlpEngine.
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py Updates docs example to use no-op engine for recognizer-only flows.
presidio-analyzer/tests/conftest.py Adds no_op to the parametrized NLP engines used in tests.
presidio-analyzer/tests/test_no_op_nlp_engine.py Adds unit/integration tests for the no-op engine and its provider/analyzer behavior.

Comment thread presidio-analyzer/tests/test_no_op_nlp_engine.py Outdated
Comment thread presidio-analyzer/tests/test_no_op_nlp_engine.py
Comment on lines +157 to +166
def _create_empty_nlp_artifacts(self, language: str) -> NlpArtifacts:
return NlpArtifacts(
entities=[],
tokens=Doc(self._vocab, words=[]),
tokens_indices=[],
lemmas=[],
nlp_engine=self,
language=language,
scores=[],
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NlpArtifacts does not accept a keywords constructor argument. Keywords are derived internally from lemmas in NlpArtifacts.__init__, so passing lemmas=[] keeps artifacts.keywords as an empty list.

Comment thread presidio-analyzer/presidio_analyzer/nlp_engine/no_op_nlp_engine.py Outdated
Comment thread presidio-analyzer/presidio_analyzer/nlp_engine/no_op_nlp_engine.py
Comment thread presidio-analyzer/presidio_analyzer/conf/no_op.yaml
@omri374

omri374 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Thanks! Have you seen the slim nlp engine? https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/nlp_engine/slim_spacy_nlp_engine.py

Can we use this as is / adapt it?

@ultramancode

Copy link
Copy Markdown
Contributor Author

Thanks for the pointer! I looked at SlimSpacyNlpEngine, and I agree it is relevant here.

The main difference is that slim is still spaCy-backed: it disables NER, but still initializes a spaCy pipeline for token/lemma artifacts. The use case here is a bit stricter: recognizers that do not need those artifacts at all, so we can skip initializing a spaCy model or blank pipeline.

We could technically adapt slim with a no-op mode, but that would broaden its contract: depending on configuration, slim would either provide token/lemma artifacts or return empty artifacts.

I went with a separate no_op engine to make the path clearer for cases where the recognizers themselves don’t need NLP artifacts, but I can adapt the PR if you’d prefer to keep that under slim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants