[REVIEW] zero-trust-assessment: add policy decision continuity and fail-secure gates

## Skill Being Reviewed

**Skill name:** `zero-trust-assessment`
**Skill path:** `skills/identity/zero-trust-assessment/`

## False Positive Analysis

**Benign evidence that can be underscored if degraded mode is treated as failure:**

```yaml
access_path: "on-call engineer -> incident ticketing system"
policy_engine: "IdP conditional access"
policy_administrator: "ZTNA controller"
policy_enforcement_point: "resource gateway"
failure_mode:
  policy_engine_unreachable: "deny new sessions; keep existing read-only sessions for 15 minutes"
  stale_signal_action: "deny privileged actions"
break_glass:
  scope: "ticketing read/write only"
  approval: "incident commander plus security lead"
  alerting: "security channel and SIEM"
  session_capture: true
```

**Why this should not be over-reported:**

Zero trust systems can have a tightly bounded degraded mode for business continuity. A short, read-only cached session with explicit TTL, denial of new sessions, approval-based break-glass, alerting, and session capture is different from a PEP that silently allows traffic when the decision plane is unreachable.

The current skill has strong PE/PA/PEP and maturity language, but it does not require reviewers to collect failure-mode evidence. That can lead to two bad outcomes: flagging every degraded mode as a failure, or over-scoring a dynamic policy architecture without proving fail-secure behavior.

## Coverage Gaps

**Missed variant 1: cached allow decisions survive revocation**

```yaml
signal_freshness:
  max_allowed_age: "15 minutes"
  last_device_signal_age: "9 hours"
failure_mode:
  policy_engine_unreachable: "allow existing and new sessions"
  policy_cache_ttl: "24 hours"
observed_test:
  disabled_user_continued_access: true
  non_compliant_device_continued_access: true
```

**Why it should be caught:**

NIST SP 800-207 describes the policy engine as the component that grants, denies, or revokes access, and the policy administrator as the component that signals PEPs to allow or shut down sessions. If cached allow decisions outlive user disablement, device non-compliance, or high-risk signals, the system is not enforcing dynamic policy in the failure window.

**Missed variant 2: PEP fail-open behavior is undocumented**

```text
Access path: remote contractor -> finance reports
PEP: application proxy
Policy decision source: SaaS IdP
Observed outage behavior: unknown
Cache TTL: unknown
Break-glass boundary: unknown
```

**Why it should be caught:**

A maturity scorecard can claim "Advanced" or "Optimal" because a policy engine and PEP exist, while the real system may allow new or existing sessions when the decision plane is down. The skill should require fail-closed, bounded degraded, or Not Evaluable status for each critical access path.

**Missed variant 3: stale trust signals continue granting access**

```yaml
required_signals:
  - MDM compliance
  - EDR health
  - IdP risk
  - threat intelligence
last_successful_update:
  mdm: "22 hours ago"
  edr: "unknown"
stale_signal_action: "continue granting normal access"
```

**Why it should be caught:**

The skill already asks whether device state changes trigger access re-evaluation, but does not require a continuity table for every trust signal feeding the policy engine. Stale CDM/MDM/EDR/IdP/threat-intel signals can make "dynamic policy" a static cached decision.

## Edge Cases

- Some emergency access is appropriate, but it needs scope, approval, duration, alerting, session capture, post-use review, and credential/session rotation evidence.
- Some PEPs can enforce last-known deny/default-deny locally while the PE/PA is down; that should be credited differently from unknown or unbounded cached allow behavior.
- Offline/BYOD scenarios can require degraded access, but privileged actions and sensitive data paths should still fail secure or require explicit emergency workflow.
- A policy engine can be highly available while a required trust-signal source, such as MDM or EDR, is stale or unreachable.
- Long-lived ZTNA tunnels or refresh tokens can bypass risk changes unless revocation propagation and TTL evidence are checked.

## Remediation Quality

- [x] Fix resolves the vulnerability
- [x] Fix doesn't introduce new security issues
- [x] Fix doesn't break functionality
- **Issues found:** Add a `Policy Decision Continuity and Fail-Secure Behavior` section requiring PE/PA/PEP dependency inventory, failure mode, cache/token TTL, trust-signal freshness, stale-data action, revocation propagation, outage test evidence, and break-glass boundary evidence.

Recommended scoring changes:

1. Cap Identity, Devices, Networks, and Applications & Workloads at Initial when critical access paths cannot show PE/PA/PEP failure-mode evidence.
2. Treat unbounded cached access after deny, revocation, disabled account, or stale high-risk signal as High; raise to Critical for privileged, production, or regulated-data access.
3. Mark outage behavior, cache TTL, stale-signal action, or revocation propagation gaps as Not Evaluable when evidence is missing.
4. Do not penalize tightly scoped, time-bounded emergency access with approval, alerting, session capture, post-use review, and rotation evidence.

## Comparison to Other Tools

| Tool / Framework | Catches this? | Notes |
|------|:---:|------|
| NIST SP 800-207 | Partial | Defines PE/PA/PEP, grant/deny/revoke decisions, and dynamic policy, but the local skill needs failure-mode evidence fields. |
| NIST SP 800-53 SA-8 / SC-24 | Partial | Covers secure defaults, secure failure, and fail in known state, but the skill needs to translate those into zero-trust access-path checks. |
| CISA ZTMM v2 | Partial | Visibility, automation, and governance support this, but maturity scoring still needs concrete continuity evidence. |
| Vendor ZTNA dashboards | Partial | Can show component health, but may not prove access behavior during PE/PA/PEP or signal-source outages. |

## Overall Assessment

**Strengths:**

- Strong NIST SP 800-207 and CISA ZTMM pillar structure.
- Good emphasis on dynamic policy, continuous verification, telemetry, and not treating zero trust as a product purchase.
- Useful maturity scorecard and roadmap output.

**Needs improvement:**

- The skill does not require failure-mode evidence for PE/PA/PEP, trust-signal sources, or cached decisions.
- Maturity can be overstated when the decision plane exists but fail-open, stale-signal, or revocation-propagation behavior is unknown.
- Emergency access should be evaluated as bounded continuity evidence, not treated as either automatically safe or automatically noncompliant.

**Priority recommendations:**

1. Add a policy decision continuity matrix for each critical access path.
2. Add `ZT-CONT-*` findings for PE/PA/PEP outage behavior, stale signal handling, cache TTL, revocation propagation, and unbounded degraded modes.
3. Add output fields for failure mode, signal freshness, cache/token TTL, break-glass boundary, last outage test, and maturity impact.
4. Add a small calibration fixture for vulnerable cached-allow and benign bounded-degraded cases.

## Sources Checked

- NIST SP 800-207 Zero Trust Architecture: https://csrc.nist.gov/publications/detail/sp/800-207/final
- NIST SP 800-53 Rev. 5 update 1: https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final
- CISA Zero Trust Maturity Model v2.0: https://www.cisa.gov/zero-trust-maturity-model
- Existing related reviews checked: #85, #441, #609, #670, #1005

This review is distinct from #85 because it focuses on fail-secure continuity and outage behavior, not source freshness or pillar evidence generally. It is distinct from #670 because it covers PE/PA/PEP failure modes, cache TTL, stale signals, and revocation propagation across access paths, not only device posture freshness and CAE. It is distinct from #1005 because it focuses on failure-mode evidence, not coverage denominators and exceptions.

## Bounty Info

- [x] I have read and agree to the CONTRIBUTING.md bounty terms
- **Preferred payment method:** Payment details can be provided privately after maintainer acceptance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] zero-trust-assessment: add policy decision continuity and fail-secure gates #1303

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Sources Checked

Bounty Info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tool / Framework	Catches this?	Notes
NIST SP 800-207	Partial	Defines PE/PA/PEP, grant/deny/revoke decisions, and dynamic policy, but the local skill needs failure-mode evidence fields.
NIST SP 800-53 SA-8 / SC-24	Partial	Covers secure defaults, secure failure, and fail in known state, but the skill needs to translate those into zero-trust access-path checks.
CISA ZTMM v2	Partial	Visibility, automation, and governance support this, but maturity scoring still needs concrete continuity evidence.
Vendor ZTNA dashboards	Partial	Can show component health, but may not prove access behavior during PE/PA/PEP or signal-source outages.

[REVIEW] zero-trust-assessment: add policy decision continuity and fail-secure gates #1303

Description

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Sources Checked

Bounty Info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions