Prompt injection detection and remediation for LLM email agents
-
Updated
Jun 11, 2026 - Python
Prompt injection detection and remediation for LLM email agents
Prompt injection is a type of attack where malicious users craft prompts that trick or manipulate a language model into: - Ignoring system-level or developer instructions - Producing harmful, biased, or manipulated content - Bypassing safety mechanisms or revealing hidden data
A reproducible adversarial ML lab that demonstrates TextFooler, BERTAttack, and DeepWordBug attacks against transformer-based sentiment models, with Docker automation and adversarial security reporting.
Add a description, image, and links to the nlp-security topic page so that developers can more easily learn about it.
To associate your repository with the nlp-security topic, visit your repo's landing page and select "manage topics."