Skip to content

test: e2e tests harness + simple_prompt scenario#173

Open
luca-iachini wants to merge 67 commits into
mainfrom
fir-368-e2e-tests
Open

test: e2e tests harness + simple_prompt scenario#173
luca-iachini wants to merge 67 commits into
mainfrom
fir-368-e2e-tests

Conversation

@luca-iachini

@luca-iachini luca-iachini commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Black-box E2E harness validating the OpenFirma enforcement boundary against real coding agents (Claude Code + Codex CLI).

  • tests/e2e/ — new test target, hooked into the firma crate via [[test]] (no separate workspace member).
  • Each scenario runs two phases:
    1. Baseline — agent runs directly (no firma). Confirms it can complete the task + reach the mock server unconfined.
    2. Enforcement — agent runs under firma run. Confirms the expected ALLOW/DENY outcome + correct audit events.
  • Agent traffic is intercepted with a wiremock MockServer; LLM provider calls are stubbed so tests need no live API.
  • Audit trail parsed from the sidecar's JSONL log and asserted via insta snapshots (per agent + scenario).
  • scenario_tests! macro generates one #[tokio::test] per (agent, scenario); all #[ignore] — run with --include-ignored.
  • One scenario wired: simple_prompt (greeting → LLM provider → ALLOW). Remaining scenarios land in the follow-up PR.

Modules

File Role
setup.rs ScenarioSetup builder, FirmaConfigBuilder, git workspace init, mock server
scenario.rs EnforcementScenario trait + PhaseOutput
runner.rs run_scenario — drives baseline + enforcement phases
agent.rs AgentKind (Claude/Codex), spawn args
audit.rs FirmaAuditTrail — parse JSONL audit log, snapshot assertions
config.rs / policy.rs firma config + Cedar policy builders for scenarios
scenarios/simple_prompt.rs the simple_prompt scenario

Supporting changes

  • e2e-tests.yml CI: matrix ubuntu-latest (bwrap) × macos-latest (vz) × claude + codex; on v*.*.* tags + workflow_dispatch; actions pinned to commit SHAs.
  • nextest e2e profile + make e2e entry point; builds firma (debug) unless FIRMA_BIN is set.
  • firma-run fixes for spawned sidecar config: pin ca.dir to marker dir, strip TLS/ephemeral port in resolve_persisted_paths, wrap authority config in [authority].
  • normalizer: classify *.chatgpt.com subdomains as communication.external.send.

Run

make e2e

# single agent / scenario
cargo nextest run -p firma --test e2e --profile e2e -E 'test(claude::)'
cargo test --test e2e -- 'claude::simple_prompt' --include-ignored

Prerequisites

  • At least one agent on PATH: claude or codex
  • bwrap on Linux; vz sandbox on macOS (OS-provided)

Add the full integration test infrastructure: harness, config, audit
utilities, CI workflow, and supporting crate changes. Wire up one
scenario (normal_llm_call) to validate the end-to-end flow before
the remaining scenarios land in the follow-up PR.
Comment thread tests/e2e/scenarios/normal_llm_call.rs Outdated
Comment thread tests/e2e/audit.rs Outdated
Comment thread tests/e2e/audit.rs Outdated
Comment thread .github/workflows/integration-tests.yml Outdated
Comment thread .github/workflows/e2e-tests.yml
Comment thread tests/e2e/README.md Outdated
Comment thread tests/e2e/README.md Outdated
Comment thread tests/e2e/README.md
Comment thread tests/e2e/harness.rs Outdated
Comment thread tests/e2e/harness.rs Outdated
Comment thread tests/e2e/harness.rs Outdated
Comment thread tests/e2e/main.rs Outdated
Add 7 scenarios covering the key enforcement policies:
block_paste_service, block_unlisted_host, tool_call_exfil,
direct_tcp_bypass, fs_read_deny, fs_delete_deny, code_fibonacci.
supervisor writes flat AuthorityConfig TOML; firma authority --config
calls load_section(..., "authority") which expects a section wrapper.
Per-run authority always runs plaintext on loopback. User config may
have TLS cert paths and a fixed listen_addr; carrying those into the
spawned process causes FRAME_SIZE_ERROR (h2c client vs TLS server).
Clear tls config and select an ephemeral loopback port up front.
Default ca.dir is "./firma-ca/" relative to sidecar CWD (firma run's
CWD). sidecar_trust_env_overrides expects firma-ca.crt at
<marker_dir>/firma-ca/firma-ca.crt. Path mismatch meant the cert was
never found, env vars not injected into agent, agent rejected the
MITM CA with x509 unknown authority.
Remaining scenarios land on fir-368-integration-tests.
firma run hard-errors without structural network enforcement unless
this flag is set. Needed on macOS and Linux without bwrap.
Kill the process and collect buffered stdout/stderr instead of
returning empty strings, making timeout failures debuggable.
Snapshot dynamic fields (ids, timestamps, latency) so failures show
a structured diff of the full audit event rather than a bare string.
- .config/nextest.toml: e2e profile builds firma automatically unless
  FIRMA_BIN is set to a prebuilt binary
- Makefile: add `make e2e` target
- README: drop firma binary prereq (handled by nextest), update run
  commands to use nextest
@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/firma-run/src/sidecar/config.rs 55.00% 5 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread tests/e2e/scenario.rs Outdated
Comment thread tests/e2e/audit.rs Outdated
@LukeMathWalker

Copy link
Copy Markdown
Contributor

Ahead of merging, let's make sure that it succeeds in CI via workflow dispatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants