SentrySkills is a self-guarding security framework for AI agents. The current version uses a rule-first frontend and a conditional model backend:
base_rule -> extra_rule -> rule_gate -> risk assessment -> model_stage(sync or async) -> end-of-task proposal sweep
- All tasks go through the rule frontend first.
base_ruleandextra_ruleare always synchronous.rule_gateusesblock > downgrade > allow.model_stageis only entered when the rule stage does not block.- Knowledge writeback is only allowed after a completed
model_stage. - The main framework agent performs one proposal sweep at task end.
- Dynamic calls to other skills are gated at runtime before the skill executes.
- Planned and skill actions are gated individually; downgrade means execute allowed actions only.
- Preflight, runtime, and output each follow rule-then-model evaluation.
- Runtime state is workspace-local under
.sentryskills/baseand.sentryskills/extra.
using-sentryskillsEntry skill and execution contractsentryskills-preflightBase-rule pre-execution checkssentryskills-runtimeBase-rule runtime monitoringsentryskills-outputBase-rule output protectionsentryskills-extraExtra-rule detection plus post-model knowledge managementshared/scripts/self_guard_runtime_hook_template.pyMain runtime script
The system always runs:
base_ruleextra_rulerule_gate
If rule_stage_action == block, the turn ends immediately. No model stage and no knowledge writeback are allowed.
If rule_stage_action != block, the main framework agent may enter model_stage.
Dispatch policy:
- assign
framework_risk_level = high | low high -> synclow + subagent support -> asynclow + no stable subagent support -> sync
Subagent capability may exist at all times, but actual dispatch is still decided by the main framework agent after risk assessment.
Other skills may be invoked dynamically while a task is already running. Before
calling any non-SentrySkills skill, the framework should submit a
skill_invocation payload to the runtime hook. The hook returns a skill_gate
section with:
allowed_skill_stepsblocked_skill_stepsrequires_confirmationsafe_substitute_steps
downgrade means selective execution: run only the allowed low-risk steps and
skip the blocked skill actions. Runtime events produced by that skill should
carry skill_name or skill_invocation_id so the hook can verify that the
actual actions match the allowed steps. Ungated or not-allowed skill actions are
reported as runtime risk signals.
The runtime hook reports an action_gate with allowed_actions,
blocked_actions, blocked_action_reasons, and execution_directive.
allow: execute all declared actions.downgrade: execute only allowed actions.block: execute no declared actions.
If a task declares multiple actions and only some are risky, the framework
should skip the blocked actions and continue with the allowed subset. If every
declared action is blocked, the final action becomes block.
Each major stage has a rule layer and an optional model layer:
preflight_rule -> preflight_modelruntime_rule -> runtime_modeloutput_rule -> output_model
Rules run first. Model results can add stricter findings but cannot relax a
rule-layer block. Extra rules may declare phase_scope; prompt-injection rules
normally apply to preflight/runtime, while output rules focus on leakage and
unsafe final text.
Only a completed model_stage may generate:
- candidate extra rules
- textual memory
- dedup audit
- validation audit
- promoted active extra rules
Pure rule hits do not create new knowledge.
If model_stage is completed by an async subagent, the result is first written as a proposal file. The main framework agent later sweeps proposal files at task end and performs the actual rule update pipeline. Proposal sweep only affects subsequent turns and never rewrites the already finalized current turn.
Rule learning uses a unified promotion context, not experiment-specific modes:
promotion_context.source_type:online,evolution,benchmark,manual, orshadow.promotion_context.update_mode:learn,candidate_only, orread_only.promotion_context.promotion_policy:conservative,variant_validated,shadow_confirmed, ormanual_reviewed.promotion_context.snapshot_request = true: writes.sentryskills/extra/memory/rule_snapshot_manifest.json.
Actual use and experiments share the same pipeline: model-stage observations become candidate rules with evidence, candidates are validated, validated rules are promoted into the active store, and snapshots make a fixed rule version available for serving or evaluation. Evaluation runs should use update_mode = read_only against a selected snapshot so ASR/TSR is measured without rule growth.
Missed-risk learning does not require the model to identify the risk first. The hook can trigger a rule-proposer task from:
- short user feedback such as
feedback_text = "刚才被攻击了,不要再允许这种行为"orfeedback = "this should not have executed" feedback.known_risk = trueorfeedback.attack_success = true- explicit
outcome_signals - automatic outcome signals such as blocked actions, runtime alerts, sensitive reads, or output leakage
When this happens and rules are not read-only, the summary records model_stage_status = rule_proposer_required or knowledge_writeback_status = awaiting_rule_proposer, with pending_rule_proposer_task containing the evidence bundle. The framework model should complete that task and call the hook again with model_stage.rule_candidates.
User feedback can be a single sentence:
{
"feedback_text": "刚才被攻击了,不要再允许这种 skill 指令"
}Frameworks and benchmark runners may provide the richer structured label:
{
"promotion_context": {
"source_type": "evolution",
"update_mode": "learn",
"source_case_id": "case-001"
},
"feedback": {
"known_risk": true,
"attack_success": true,
"risk_source": "benchmark_label",
"attack_family": "skill_injection",
"failed_guard_stage": "model_stage",
"why_failed": "The model allowed a malicious instruction embedded in a skill.",
"positive_evidence": ["malicious instruction text", "risky action trace"],
"negative_evidence": ["nearby benign skill usage"]
}
}The runtime script now exposes these stage fields in summaries and logs:
base_rule_actionextra_rule_actionrule_stage_actionframework_risk_levelmodel_dispatch_modemodel_stage_statusmodel_stage_actionmodel_executormodel_stage_result_availableaction_gateskill_gateexecution_directiveproposal_sweep_effectknowledge_writeback_statuspromotion_contextfinal_action
model_stage_status values:
skipped: rule stage blocked, so model stage was intentionally not runpending: async/subagent model stage is waiting for a resultrequired_not_provided: non-blocked turn needs model stage, but the framework has not yet called the hook again with a completedmodel_stagecompleted: model stage result was provided and consumed
final_action is always the executable decision for the current turn. Async model results do not retroactively rewrite an already finished turn.
New extra rules are promoted only after validation. The framework model should attach validation_cases.positive and validation_cases.negative to each proposed rule; the runtime hook deterministically runs the candidate rule against those cases and rejects rules that miss generated positives or match generated negatives. Candidates may also include source_case_id, source_cases, evidence_source, attack_family, generalization_basis, positive_variants, negative_variants, promotion_target, and promotion_rationale.
.sentryskills/base/- unified logs
- turn results
- session state
- index
.sentryskills/extra/- active extra rules
- candidate extra rules
- textual memory
- dedup audit
- validation audit
- promotion audit
- rule snapshot manifest
- Claude Code Prefer hook-enforced rule-first execution; model stage should be dispatched after framework risk assessment.
- Codex / OpenClaw
Use
SKILL.md+AGENTS.mddiscipline. Only low-risk turns may use async/subagent model-stage execution; otherwise treatmodel_stageas synchronous.
See:
- install/claude_code_install.md
- install/codex_install.md
- install/openclaw_install.md
- install/experiment_protocol.md
- Python 3.8+
- no external Python dependencies for the core runtime path