GuardLayer employs a bi-directional guardrail pipeline. This document provides a deep dive into each guard, how it works, configuration options, and plugin customization.
- Mechanism: Analyzes incoming prompts using a lightweight sequence classification transformer model trained to distinguish user-provided inputs from prompt override payloads.
- Detections: Identifies direct instructions like "Ignore previous rules", "You are now in Developer Mode", or hidden instructions in inputs.
- Tuning: The
prompt_injection_thresholddefines the classification confidence score (between0.0and1.0) above which the prompt is flagged and blocked. Default:0.70.
- Mechanism: Implements pattern classifiers combined with semantic analysis to detect adversarial prompt styling.
- Detections: Identifies known structural attacks such as roleplay exploits ("DAN", "Do Anything Now"), hypotheticals ("For academic purposes only..."), and sandboxing escapes.
- Tuning: Adjust the sensitivity using
jailbreak_threshold. Higher values (e.g.,0.90) make it less sensitive (requires higher confidence to block). Default:0.70.
- Mechanism: Scans prompts using tokenizers and regular expressions to locate and scrub sensitive data.
- Supported PII Types:
email: Standard RFC 5322 email patterns.phone: Global formats and country-specific rules.aadhaar: Specifically detects Indian Aadhaar card numbers matching the\b\d{4}[ -]?\d{4}[ -]?\d{4}\bregular expression combined with a Verhoeff checksum validation.credit_card: Standard Luhn-algorithm-checked numbers for Visa, Mastercard, AMEX, Discover.bank_account: Basic routing and account number structural patterns.api_key: Structural formats matching typical API tokens (e.g., prefixsk-orgl-).
- Tuning: Enabled via toggles per PII type. When active, matches are scrubbed and replaced with generic placeholder tokens (e.g.,
[REDACTED_EMAIL]).
- Mechanism: Evaluates the semantic embedding of user inputs against a defined list of allowed business categories.
- Detections: Determines if a query falls outside the bounds of authorized subjects (e.g., blocking queries about medical diagnostics if your chatbot is configured only for IT support).
- Tuning: Define a string array of
allowed_topics. If an input fails to align semantically with any topic in the list, it is blocked.
- Mechanism: Performs string token length estimation.
- Detections: Blocks payloads exceeding the defined maximum to protect downstream LLM APIs from size exploitation or excessive billing charges.
- Mechanism: Analyzes generated LLM response completions using a multi-label classification model.
- Detections: Identifies profanity, hate speech, threats, harassment, and sexual content in responses.
- Tuning: Configured using
toxicity_threshold. Scores above the threshold trigger blocking rules. Default:0.80.
- Mechanism: Cross-references the generated model completion text against the original user input context or documents.
- Detections: Identifies unsupported claims, contradictions, or details not found in the baseline reference prompt.
- Tuning: Configured using
hallucination_enabledandblock_on_hallucinationflags. Ifblock_on_hallucinationis false, inconsistencies are logged as alerts/threats but not blocked.
- Review Logs: Use the Security Auditing panel on the GuardLayer Dashboard to inspect flagged requests.
- Observe Scores: Note the exact classification scores of false positives and false negatives.
- Calibrate: Adjust the slider in the Configuration dashboard page:
- Decrease threshold to increase sensitivity (catch more threats, higher false positives).
- Increase threshold to decrease sensitivity (reduce false positives, higher risk of bypass).
You can write custom plugins to execute arbitrary security checks in the Python-based input-guard service:
- Create a file
services/input-guard/src/guards/custom_my_guard.py. - Implement the standard guard class inheriting from
BaseGuard:from .base import BaseGuard, GuardResult class CustomMyGuard(BaseGuard): def __init__(self, config): self.enabled = config.get("my_guard_enabled", False) self.keywords = ["secret_code", "classified_info"] async def inspect(self, prompt: str) -> GuardResult: if not self.enabled: return GuardResult(triggered=False) for word in self.keywords: if word in prompt.lower(): return GuardResult( triggered=True, score=1.0, reason=f"Found restricted keyword: {word}" ) return GuardResult(triggered=False)
- Register the guard class in
services/input-guard/src/main.py.