Skip to content

feat(inspect): SWIP-14 Inspect surface + preflight#1

Merged
wu-sheng merged 3 commits into
mainfrom
feat/inspect-menu
May 11, 2026
Merged

feat(inspect): SWIP-14 Inspect surface + preflight#1
wu-sheng merged 3 commits into
mainfrom
feat/inspect-menu

Conversation

@wu-sheng
Copy link
Copy Markdown
Member

Summary

End-to-end binding to SkyWalking's SWIP-14 Inspect API plus the supporting infrastructure (preflight, server-time, MQE-target resolution, source attribution) and a new /inspect page that lets operators browse the metric catalog, pick the entity that holds values, and plot the MQE series — all without leaving Studio.

Two commits:

  • feat(inspect): SWIP-14 Inspect surface + preflight + server-TZ handling — the feature.
  • fix(review): RBAC gate on revertToBundled; wire-log body cap; docker /data ownership + demo selectors — the post-implementation review-pass findings.

What landed

@vantage-studio/api-client

  • InspectClient (GET /inspect/metrics, GET /inspect/entities) + the wire types — MqeEntity / EntityRow / ExpressionResult mirror SkyWalking's GraphQL Entity input and execExpression response so the BFF can paste mqeEntity straight into the query.
  • formatInspectDate / isInspectDate helpers for the three step-specific formats (yyyy-MM-dd / yyyy-MM-dd HH / yyyy-MM-dd HHmm).
  • 34 round-trip tests against the e2e fixtures.

BFF (apps/bff/src/)

Route Purpose
GET /api/inspect/metrics proxy of admin /inspect/metrics.
GET /api/inspect/entities proxy with date / limit pre-validation.
GET /api/inspect/catalog metrics + Studio-derived rule attribution (OAL / MAL·OTEL / MAL·Telegraf / LAL→MAL / unknown).
GET /api/inspect/mqe-target resolved GraphQL base via /debugging/config/dump. studio.yaml's oap.mqe.{host,port} (each optional) overrides the discovered values — covers k8s setups where admin and REST land on different ingresses.
POST /api/inspect/exec proxies query execExpression against the resolved base.
GET /api/inspect/server-time caches OAP's getTimeInfo. SPA uses the offset to display browser-local dates while sending server-TZ strings on the wire.
GET /api/preflight per-module enablement report (admin-server, receiver-runtime-rule, dsl-debugging, inspect).

inspect:read is the new RBAC verb gating every /api/inspect/* route.

SPA (/inspect)

  • Widget board, 1 / 3 / 5 per row (default 3), per-card chart toggle (line / bar / area).
  • Catalog drawer: file tree + alphabetical metric list, + all per-file shortcut, select all visible / clear breadcrumb actions.
  • Per-widget entity editor: scope-aware form fields (the metric's scope is fixed by /inspect/metrics, only the relevant name fields are exposed), multi-select over resolved entities, custom-entity add.
  • Preset range: last 10m (default, MINUTE) / last 5h (HOUR) / last 2d (DAY) — preset sets step + range together.
  • Bucket-count guardrail before OAP's DurationUtils.MAX_TIME_RANGE = 500 cap: widget refuses to fire and shows an actionable message instead of letting OAP 502.
  • localStorage persistence of the board layout (widgets, selected entities, custom entities, chart kind, density, preset). Reset button clears widgets + storage.
  • ECharts with dispose-on-host-detach (otherwise stale instances kept their canvas after the v-if cycled through loading/error and entity changes drew nothing).
  • Cluster status (the landing page) grows a REQUIRED MODULES table fed by /api/preflight — actionable "OAP-side selectors Studio needs" instead of a header chip + modal.
  • Catalog.vue and OalCatalog.vue grow explicit refresh buttons.

Docs

  • docs/inspect.md — operator guide.
  • docs/install.md — lists all four required SW_* selectors with a "what breaks if missing" table; aligns the demo image tag with apache/skywalking-oap-server:admin-server.
  • docs/configure.mdoap.mqe.* schema fields; inspect:read in the verb table.
  • docs/auth.mdinspect:read in the verb table + role examples.

Review-pass fixes (second commit)

# Severity What Where
1 high revertToBundled now requires rule:write:structural (was rule:delete). Audit log carries the actually-checked verb. New regression test asserts a role with only rule:delete gets 403. apps/bff/src/oap/routes.ts
2 high docker-compose demo oap service was missing three of the four required selectors. Added SW_ADMIN_SERVER, SW_DSL_DEBUGGING, SW_INSPECT; aligned the image tag with install.md. deploy/docker/docker-compose.yml
3 high pnpm format:check was failing on the new files. Ran prettier across the tree. many
4 medium /api/inspect/server-time GraphQL fetch now honours oap.timeoutMs via AbortController (matches the rest of the OAP-bound calls). apps/bff/src/oap/server-time.ts
5 medium wire/fetch.ts no longer buffers the entire response body before truncating: streamy content types are skipped, Content-Length > max*4 returns a header-only marker, and the cloned reader bails after max*4 bytes via reader.cancel(). apps/bff/src/wire/fetch.ts
6 medium Dockerfile pre-creates /data with nonroot (65532:65532) ownership in the builder stage and copies it into the runtime image. Docker propagates that to the named volume on first mount, so the BFF can seed studio.yaml / audit.jsonl without an operator-side chown. deploy/docker/Dockerfile

Test plan

  • pnpm lint
  • pnpm format:check
  • pnpm -F @vantage-studio/{ui,bff,api-client} typecheck
  • pnpm test — 140 BFF + 63 UI + 34 api-client tests green.
  • pnpm -F @vantage-studio/ui build
  • pnpm -F @vantage-studio/bff build
  • Live verification against a local OAP container (SW_ADMIN_SERVER=default + the three SWIP-13/14 selectors): /inspect resolves 1746 metrics with 477 attributed to OAL + 1150 to MAL·OTEL + 4 to LAL→MAL + 115 unknown; service_cpm for e2e-service-provider returns a non-empty TIME_SERIES_VALUES; preflight reports all four modules enabled.
  • Verify the demo docker compose up brings up Studio + OAP + BanyanDB with every required selector wired (smoke against the updated compose file).

wu-sheng added 2 commits May 11, 2026 21:13
Adds Studio's binding to SkyWalking's SWIP-14 Inspect API along with
the supporting infrastructure (preflight, server-time, MQE-target
resolution) and an end-to-end Inspect page on /inspect.

Why: SWIP-13's runtime-rule editor only answers "which rules are
loaded". The natural follow-up — "which metrics did that rule produce
and which entities are emitting values right now" — required
operators to drop into MQE by hand. SWIP-14 exposes the metric
catalog + entity enumeration on admin-server (port 17128); this
change wires it into Studio so operators can browse, pick an entity,
and plot the MQE series without leaving the UI.

Wire types (@vantage-studio/api-client):
  - InspectClient (GET /inspect/metrics + /inspect/entities)
  - MqeEntity / EntityRow / ExpressionResult mirror SkyWalking's
    GraphQL Entity input + execExpression response, so the BFF
    can paste mqeEntity straight into the mutation
  - formatInspectDate / isInspectDate helpers for the step-specific
    yyyy-MM-dd / yyyy-MM-dd HH / yyyy-MM-dd HHmm shapes OAP parses
  - 34 round-trip tests against the e2e fixtures

BFF (apps/bff/src/):
  - oap/inspect-routes.ts — /api/inspect/{metrics,catalog,entities,
    mqe-target,server-time,exec}; all gated on a new `inspect:read`
    verb; 404s on /inspect/* promote to `inspect_not_enabled`
  - oap/inspect-exec.ts — POST /api/inspect/exec proxies
    `query execExpression` against the resolved MQE base (corrected
    from the initial `mutation` — execExpression lives on Query in
    metrics-v3.graphqls)
  - oap/mqe-target.ts — resolves the GraphQL base via
    /debugging/config/dump: sharing-server.restPort → core.restPort,
    host falls back to the admin URL when the bound host is 0.0.0.0.
    studio.yaml `oap.mqe.{host,port}` (both independently optional)
    overrides either piece — k8s setups where admin and REST land on
    different hostnames.
  - oap/server-time.ts — caches `getTimeInfo` from OAP's GraphQL.
    The SPA uses the offset to render dates in browser-local time
    while sending server-TZ strings on the wire. Accepts both the
    legacy integer (`800`) and current string (`"+0800"`/`"-05:00"`)
    timezone shapes; falls back to BFF local clock when OAP is
    unreachable. AbortController honours oap.timeoutMs.
  - oap/preflight.ts + preflight-routes.ts — /api/preflight reads
    /debugging/config/dump and reports which of Studio's four
    required selectors (admin-server, receiver-runtime-rule,
    dsl-debugging, inspect) are loaded.
  - inspect/attribution.ts + parser-oal.ts + parser-mal.ts —
    /api/inspect/catalog joins /inspect/metrics with a metric-name →
    {source,file} index built from /runtime/oal/files +
    /runtime/rule/list + /runtime/rule/bundled per MAL catalog. Best
    effort per side: when SW_RECEIVER_RUNTIME_RULE is off, the
    attribution gracefully degrades to source: "unknown" rather than
    failing the whole catalog merge.
  - config/schema.ts — `oap.mqe.{host,port}` schema additions.

SPA (apps/ui/src/views/Inspect.vue + supporting):
  - New /inspect route with widget board (1 / 3 / 5 per row,
    default 3) and per-card chart toggle (line / bar / area).
  - Catalog drawer (two-pane): file tree on the left, alphabetical
    metric list on the right, "+ all" per-file shortcut, "select
    all visible" / "clear" breadcrumb actions.
  - Per-widget entity editor: scope-aware form fields (not JSON
    paste — the metric's scope is fixed by the catalog, only the
    relevant name fields are exposed), multi-select over resolved
    entities, custom-entity add for hand-built MQE Entities.
  - Preset row: last 10m (default, step MINUTE) / 5h (HOUR) / 2d
    (DAY); preset selection sets step + range together.
  - Bucket-count guardrail: OAP's DurationUtils caps at 500 buckets
    per query, so widgets refuse to fire when start/end/step would
    exceed that — operator gets an actionable message instead of a
    502.
  - Browser-local date inputs; the BFF gets server-TZ strings via
    formatForServer(date, step, offsetMinutes). Mirrors the
    skywalking-booster-ui pattern but at minute precision.
  - localStorage persistence of the board layout (widgets, selected
    entities, custom entities, chart kind, density, preset). Reset
    button clears widgets + storage. Hydration uses watchEffect so
    a vue-query-cached catalog on re-entry still triggers the
    restore.
  - Cluster status (the landing page) grows a "Required modules"
    table fed by /api/preflight. Replaces the header chip + modal
    drafts.
  - Catalog.vue and OalCatalog.vue get explicit refresh buttons.
  - SPA-side BffClient methods + InspectCatalogResponse /
    InspectServerTimeResponse / PreflightResponse types.

Config / docs:
  - docs/inspect.md — operator guide.
  - docs/install.md — lists all four required SW_* selectors with a
    "what breaks if missing" table; aligns the demo image tag with
    apache/skywalking-oap-server:admin-server.
  - docs/configure.md — `oap.mqe.*` schema fields, `inspect:read`
    in the verb table.
  - docs/auth.md — `inspect:read` added to the verb table + role
    examples.

Build / CI:
  - prettier format pass on every new file + a handful of touched
    pre-existing ones (CI runs format:check before lint).
  - 117 BFF + 63 UI + 34 api-client tests green; lint clean;
    typecheck clean across all three workspaces.
…/data ownership + demo selectors

Four discrete fixes surfaced by the post-implementation review pass:

1. `mode=revertToBundled` on /api/rule/delete is a structural change
   (storage-identity flip — same write-class /api/rule already gates
   on `rule:write:structural` for `allowStorageChange=true` /
   `force=true`). The handler used to pick `rule:delete` for every
   mode, so a caller with only `rule:delete` could call the revert
   path directly and have it logged as `rule:delete` in audit. Now
   the handler picks `rule:write:structural` when mode is
   revertToBundled and `rule:delete` for the default mode; the audit
   record carries the actually-checked verb. Added a regression test
   that asserts a `reader` role with rule:delete (but no structural)
   gets 403.

2. wire/fetch.ts used to `await cloned.text()` on every response —
   reading the full body before truncating to maxBodyChars. A
   multi-MB /api/dump response would buffer entirely in BFF memory
   on every call. Replaced with a streaming reader that:
     - skips body capture for known-streamy content types
       (application/x-yaml, application/octet-stream, multipart/...,
       compression formats);
     - returns a "<N-byte response — capped>" marker when the
       upstream Content-Length is already past max*4;
     - reads at most max*4 bytes from the cloned body and aborts
       early via reader.cancel().
   The caller's response stream is untouched.

3. /api/inspect/server-time GraphQL fetch now honours
   oap.timeoutMs via AbortController, matching every other
   OAP-bound BFF call. Previously a hung getTimeInfo could leak
   indefinitely.

4. Dockerfile pre-creates /data with nonroot (65532:65532)
   ownership in the builder stage and `COPY --from=builder
   /seed-data /data` into the runtime image. Docker propagates the
   image's directory ownership to a named volume on first mount, so
   the BFF can write to studio.yaml / audit.jsonl without an
   operator-side chown. The `.keep` file inside is required —
   Docker skips entirely empty directories during the seed.

5. deploy/docker/docker-compose.yml's `oap` service was missing
   three of the four selectors install.md documents as required.
   Added SW_ADMIN_SERVER, SW_DSL_DEBUGGING, SW_INSPECT alongside the
   existing SW_RECEIVER_RUNTIME_RULE. The image tag is also aligned
   with install.md (`:admin-server`).

Fix 4 (server-time timeout) lands in the inspect commit above
because it's strictly internal to the new feature.

Tests: 140 BFF + 63 UI + 34 api-client all green; format:check
clean; eslint clean; typecheck clean.
@wu-sheng wu-sheng added the enhancement New feature or request label May 11, 2026
The "live · 5s" indicator on cluster status was hardcoded — an
operator who wanted a tighter pulse during a flap or a looser one
during a quiet period had to live with it. Adds a small dropdown
next to the indicator with off / 5s / 15s / 60s options. Selection
is persisted per-browser (localStorage `vs:cluster:poll:v1`); falls
back to 5s when no preference exists.

Wiring uses vue-query's function-form `refetchInterval`, so the
dropdown takes effect on the next tick without remounting the
queries. Applies to both cluster-state and dsl-debugging-status
panes (they share a cadence by design — both are "what's happening
right now" views). Preflight is unchanged — that one stays at 30s
because its underlying data (which OAP modules are loaded) only
flips on a restart.

When `off` is selected the indicator switches from "live · 5s" to
"manual" with a dimmed dot, and the "refresh now" button is still
the explicit escape hatch.
@wu-sheng wu-sheng merged commit 572a546 into main May 11, 2026
2 checks passed
@wu-sheng wu-sheng deleted the feat/inspect-menu branch May 11, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant