ai_security_lab is a small companion project for a book about securing AI
applications.
The project is designed to support a simple teaching arc:
- start with a vulnerable AI support workflow
- show how prompt injection, unsafe tools, and data leakage happen
- harden the same workflow with explicit controls
- test the controls instead of trusting a prompt alone
The first scaffold stays small and deterministic by default. It starts CLI-first, but it also includes a minimal local web/API surface and an optional provider-backed reply mode for later inspection and live-model examples.
The companion project is intentionally narrower than a full AI governance or lifecycle reference. Its purpose is to show how AI application workflows fail in practice, and how to harden those workflows with visible code, tests, and explicit boundaries.
The lab is organized around a fictional support assistant:
examples/data/fictional customers and ticketsexamples/policies/retrievable support and security policiesexamples/uploads/untrusted user-supplied content, including malicious prompt-injection casesexamples/evals/bundled regression cases for workflow, payload, and tool behaviorsrc/ai_security_lab/workflows.pyvulnerable, hardened, and provider-backed assistant flowssrc/ai_security_lab/security.pyredaction, role checks, and approval checkssrc/ai_security_lab/validation.pystructured-output validation for AI-generated reply payloadssrc/ai_security_lab/provider.pyoptional OpenAI-compatible provider client for live-model reply modesrc/ai_security_lab/tools.pysmall tool contracts for sensitive actionssrc/ai_security_lab/audit.pyappend-only local audit loggingsrc/ai_security_lab/evals.pysmall behavior-eval runner for workflow, payload, and tool casessrc/ai_security_lab/web.pysmall local API for reply, validation, export, and audit scenariostests/regression checks for the controls
The scaffold is meant to support chapters such as:
- prompt injection
- RAG poisoning and malicious documents
- sensitive data leakage
- unsafe tool calls
- output validation
- approval gates
- audit logging
- structured-output validation
If you are turning the lab into book material, the most useful excerpts are small before-and-after pairs. A few of the vulnerable snippets below are taken directly from the lab, and a few are intentionally reduced "naive" variants so the security boundary fits on one page.
Vulnerable:
uploaded_note = load_uploaded_note(paths, ticket.uploaded_note_path)
retrieved = retrieve_policies(f"{ticket.subject} {ticket.body} {uploaded_note}", policy_docs)
if looks_like_prompt_injection(uploaded_note):
reply += " Uploaded note requested hidden data, so the workflow revealed internal notes."Hardened:
uploaded_note = load_uploaded_note(paths, ticket.uploaded_note_path)
retrieved = retrieve_policies(
ticket.body,
policy_docs,
allowed_names=allowed_policy_names_for_role(role),
)
if looks_like_prompt_injection(uploaded_note):
notes.append("Uploaded note triggered prompt-injection handling.")Use this pair to show that uploaded content must be treated as untrusted data, not merged into the same instruction path as trusted workflow context.
Vulnerable:
reply = (
f"Draft reply for {customer.full_name}: acknowledge the issue and reassure the customer. "
f"Internal notes: {customer.internal_notes}"
)Hardened:
safe_customer = redact_customer(customer)
reply = (
f"Draft reply for {safe_customer['full_name']}: acknowledge the issue, restate the public next "
"step, and avoid disclosing internal-only details."
)This is a compact way to teach that "do not leak secrets" is usually a data preparation problem before it becomes a prompt problem.
Vulnerable:
return {
"customer_reply": "We will review the outage and share the next update shortly.",
"allowed_action": "export_customer_report",
"citations": ["security-policy.md"],
"internal_notes": customer.internal_notes,
}Hardened:
candidate = {
"customer_reply": result.reply,
"allowed_action": "reply_only",
"citations": sorted(allowed_citations),
}
return validate_support_reply_payload(candidate, allowed_citations=allowed_citations)Use this pair to show that plausible JSON is still untrusted output until it passes schema, action, and citation checks.
Naive chapter version:
def export_customer_report(paths, customer_id):
customer = get_customer(paths, customer_id)
return {"customer": customer}Hardened:
if not can_export_customer_report(role):
raise PermissionError(f"Role {role} may not export customer reports.")
if requires_approval_for_export(role) and not approved:
raise PermissionError("Manager export requires explicit approval.")
return {
"customer": redact_customer(customer),
"notes": ["Returned redacted customer view for export flow."],
}This chapter pair works well for showing that tool safety usually needs both authorization and an explicit approval boundary for higher-risk actions.
Naive chapter version:
reply = generate_chat_reply_openai_compatible(...)
return WorkflowResult(reply=reply, evidence=evidence, notes=notes, leaked_internal_notes=False)Hardened:
reply = generate_chat_reply_openai_compatible(...)
safety_issues = provider_reply_safety_issues(reply, customer=customer)
if safety_issues:
raise RuntimeError(
"Provider reply failed post-generation safety checks: "
+ "; ".join(safety_issues)
)Use this pair to show that a live model path still needs deterministic checks after generation, especially when the model saw untrusted text.
Naive chapter version:
return WorkflowResult(reply=reply, evidence=evidence, notes=notes, leaked_internal_notes=False)Hardened:
append_audit_event(
paths,
AuditEvent(
event_type="support_reply",
actor_role=role,
resource_id=ticket.ticket_id,
outcome="allowed",
details={"prompt_injection_detected": True, "evidence": evidence},
),
)This pair helps readers see that hardening is not only about blocking bad behavior. It is also about leaving a reviewable trail when risky paths are used or blocked.
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
pytest
python -m ai_security_lab.cli vulnerable-reply --ticket-id T-1001
python -m ai_security_lab.cli hardened-reply --ticket-id T-1001 --role frontline
python -m ai_security_lab.cli provider-reply --ticket-id T-1001 --role frontline \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2
python -m ai_security_lab.webVulnerable flow:
python -m ai_security_lab.cli vulnerable-reply --ticket-id T-1001Hardened flow:
python -m ai_security_lab.cli hardened-reply --ticket-id T-1001 --role frontlineProvider-backed hardened flow:
python -m ai_security_lab.cli provider-reply \
--ticket-id T-1001 \
--role frontline \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2The provider-backed path reuses the hardened boundaries. It still redacts customer data, filters retrieval by role, records an audit event, and treats the uploaded note as untrusted text. It also runs a small post-generation safety check and rejects replies that appear to disclose internal-only content. The endpoint and model can also be set through:
AI_SECURITY_LAB_LLM_ENDPOINTAI_SECURITY_LAB_LLM_MODELAI_SECURITY_LAB_API_KEY
The provider-backed path expects an OpenAI-compatible chat completions endpoint. You can pass settings on the command line or through environment variables.
Set environment variables:
export AI_SECURITY_LAB_LLM_ENDPOINT=http://127.0.0.1:11434/v1/chat/completions
export AI_SECURITY_LAB_LLM_MODEL=llama3.2
# export AI_SECURITY_LAB_API_KEY=... # only if your provider requires oneRun the provider-backed CLI flow:
python -m ai_security_lab.cli provider-reply --ticket-id T-1001 --role frontlineOr pass the provider settings directly:
python -m ai_security_lab.cli provider-reply \
--ticket-id T-1001 \
--role frontline \
--llm-endpoint http://127.0.0.1:11434/v1/chat/completions \
--llm-model llama3.2Expected behavior:
- the reply still uses role-filtered retrieval
- customer data is redacted before the model-visible prompt
- internal-only content should not appear in the final reply
- the workflow returns an error if post-generation safety checks fail
Validate a structured reply payload:
python -m ai_security_lab.cli validate-payload \
--mode hardened \
--ticket-id T-1001 \
--role frontlineRun the bundled security eval set:
python -m ai_security_lab.cli run-evalsSensitive tool example:
python -m ai_security_lab.cli export-report \
--customer-id C-1001 \
--role manager \
--approvedRun the local web/API surface:
python -m ai_security_lab.webThen open:
http://127.0.0.1:8010
The local API exposes:
GET /api/healthGET /api/auditPOST /api/reply/vulnerablePOST /api/reply/hardenedPOST /api/reply/providerPOST /api/validate/replyPOST /api/export-report
Example hardened reply request:
curl -X POST http://127.0.0.1:8010/api/reply/hardened \
-H "content-type: application/json" \
-d '{
"ticket_id": "T-1001",
"role": "frontline"
}'For the frontline role, that path keeps manager-exceptions.md out of the
returned evidence.
Example provider-backed reply request:
curl -X POST http://127.0.0.1:8010/api/reply/provider \
-H "content-type: application/json" \
-d '{
"ticket_id": "T-1001",
"role": "frontline",
"llm_endpoint": "http://127.0.0.1:11434/v1/chat/completions",
"llm_model": "llama3.2"
}'The companion app is intentionally small, but it still has enough moving parts to surface real security boundaries:
- untrusted uploaded content
- retrievable policy content
- retrieval filters based on role
- role-based tool access
- redaction before model-visible output
- structured-output validation before downstream use
- explicit approvals for risky actions
- audit logs for review
That is enough to show how an AI application can fail, and how the same app can be tightened step by step.
- Hardened structured replies are validated against retrieved evidence, so citation names have to match the policy files the workflow actually used.
- Structured-output validation also rejects calm, plausible reply payloads when they ask for an unsafe downstream action.
- The bundled eval set in
examples/evals/security_cases.jsonlgives a small regression pass across workflow, payload, and tool scenarios. - The deterministic flows remain the default teaching baseline.
- The optional
provider-replypath turns the same hardened workflow into a live AI path through an OpenAI-compatible chat endpoint. - Uploaded-note paths are constrained to the local
examples/uploads/directory. - Provider request failures surface as errors from the CLI and as
502responses from the local API. - Malformed API request fields return structured
400responses instead of bubbling up as uncaught errors.
MIT