Skip to content

perf(audit): re-anchor S3 audit export on enable + opt-in backfill#9818

Merged
rubenfiszel merged 10 commits into
mainfrom
audit-export-woes
Jun 26, 2026
Merged

perf(audit): re-anchor S3 audit export on enable + opt-in backfill#9818
rubenfiszel merged 10 commits into
mainfrom
audit-export-woes

Conversation

@rubenfiszel

@rubenfiszel rubenfiszel commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

EE companion: windmill-labs/windmill-ee-private#634

Problem (production incident)

The S3/GCS audit-log export's steady-state query selects rows by age(xmin) — a system column that cannot be indexed — so the only thing bounding the scan is the timestamp floor. Two paths made that floor sit far in the past:

  • Fresh enable used an epoch floor (the bootstrap sentinel).
  • Re-enable after a disable preserved the old cursor (ON CONFLICT DO NOTHING) and tried to backfill the entire disabled window.

Either way the first run sequentially scans the whole audit_partitioned table. Under a statement_timeout (e.g. Aiven) the scan is cancelled before a single batch finishes — canceling statement due to statement timeout — so the cursor never advances, nothing reaches the bucket, and the repeated full scans saturate the DB (100% load; the v2_job/job_logs vacuum then also times out as collateral). The pg_read_all_stats grant can't help because the adaptive-floor branch is never reached.

Fix

1. Re-anchor on enable (EE ee.rs + trigger migration)

  • The anchor (trigger / startup / missing-checkpoint) now records a recent timestamp floor (oldest in-flight xact_start when the stats privilege is available, else a bounded 7-day window) instead of epoch, so the first run prunes to recent partitions.
  • The trigger ON CONFLICT DO UPDATEs the cursor to the current snapshot xmin on re-enable, so the export always resumes from ~now and never rescans history. One-time fixup migrates legacy epoch-sentinel checkpoints on upgrade.
  • The bootstrap/epoch machinery is removed; backlog draining is handled uniformly by the per-run xid cap.

2. Data-loss fix (EE ee.rs)

  • While draining (a run capped at MAX_XID_INTERVAL), the floor stored for the next run is now carried forward, not advanced to the recent in-flight value. Advancing it pruned the still-un-drained old rows and let the xid cursor skip them — silent audit-row loss. New regression test verified red→green.

3. Opt-in historical backfill (new audit_logs_s3_backfill module + endpoints)

  • Exports a chosen [from, to) window on demand, scanning strictly by timestamp (the partition key) in bounded keyset pages. Each query is an index scan capped at one page and short-circuits at the LIMIT, so it stays well under any statement timeout regardless of window size.
  • POST /settings/audit_logs_s3_backfill {from,to} (super-admin + Enterprise), GET /settings/audit_logs_s3_backfill_status. Writes alongside the steady-state objects under logs/audit/, without touching the xmin cursor.

EXPLAIN of the page query (20k rows / 5 daily partitions, LIMIT 10000):

Limit -> Incremental Sort -> Append (Subplans Removed: 4)
  Index Scan Backward using audit_2026MMDD_timestamp_idx  (3 partitions scanned)
  ... audit_20260624 (never executed)   <-- short-circuits at the limit
  ... audit_20260625 (never executed)
Buffers: shared hit=178   Execution Time: 10.9 ms

Tests / validation

  • cargo check --features enterprise,private,parquet
  • EE audit_export_end_to_end updated for the re-anchor contract (recent floor; re-enable advances the cursor) ✅
  • New EE audit_export_backlog_drain_no_floor_loss regression test (test-only MAX_XID_INTERVAL override) — fails without the carry-floor fix, passes with it
  • Trigger behavior, legacy fixup, and backfill query plan verified directly against Postgres
  • sqlx offline cache regenerated; frontend client regenerated from OpenAPI

Notes

  • The only frontend/ change is a one-line settings description string (the EE-gated "Store audit logs in object storage" toggle) pointing operators at the backfill endpoint — no component/layout change, and it only renders under an EE license, so no screenshot.
  • Operational follow-up (independent of this PR): job_perms/v2_job_queue table bloat still needs VACUUM FULL/pg_repack in a maintenance window.

🤖 Generated with Claude Code


Summary by cubic

Re-anchored the S3/GCS audit export to start from “now” on enable/re-enable to prevent full-table scans/timeouts, and added a safe, opt-in historical backfill that exports a chosen window via bounded, index-backed scans.

  • Bug Fixes

    • On enable, record a recent timestamp floor; on re-enable, DO UPDATE the cursor to the current snapshot xmin (monotonic). One-time re-anchor of legacy epoch-sentinel checkpoints to the current snapshot xmin.
    • While draining, carry the timestamp floor forward between runs to avoid skipping undrained rows. GET /settings/audit_logs_s3_status: bootstrapping now means “draining backlog”.
    • Backfill persists progress after every object PUT to keep the lease heartbeat fresh.
    • Backfill requires a trustworthy settled cutoff and rejects unstable windows: to must be ≤ oldest in-flight xact_start, and the DB role must have pg_read_all_stats/superuser with no prepared (2PC) txns.
    • Backfill rejects any window that overlaps rows in the legacy audit table (covers empty audit_partitioned cases).
    • Backfill object keys include the requested window (e.g. logs/audit/dt=<day>/audit_backfill_<from,to>_<min_id>.ndjson) to prevent narrower runs from overwriting broader ones.
  • New Features

    • Historical backfill API (super-admin + Enterprise): POST /settings/audit_logs_s3_backfill {from,to} and GET /settings/audit_logs_s3_backfill_status. Scans by (timestamp,id) with LIMIT 10000, writes alongside steady-state objects, and does not touch the xmin cursor. OpenAPI updated to include these endpoints.

Written for commit 8e1af7f. Summary will update on new commits.

Review in cubic

The S3/GCS audit-log export's steady-state query filters by `age(xmin)`
(unindexable), so the only scan bound is the timestamp floor. On a fresh
enable the floor was epoch, and on a re-enable the cursor resumed from its
pre-disable position — either way the first run scanned the whole
`audit_partitioned` table. Under a `statement_timeout` (e.g. Aiven) that scan
never completes: the cursor never advances, nothing is exported, and the
repeated full scans saturate the database.

Re-anchor on enable (EE companion, windmill-ee-private#634):
- New trigger migration records a recent timestamp floor instead of the epoch
  sentinel and `DO UPDATE`s the cursor to the current snapshot xmin on
  re-enable, so the export always resumes from ~now and never rescans history.
  Includes a one-time fixup for legacy epoch-sentinel checkpoints on upgrade.

Opt-in historical backfill (new `audit_logs_s3_backfill` module + endpoints):
- Exports a chosen `[from, to)` window on demand, scanning strictly by
  `timestamp` (the partition key) in bounded keyset pages — each query is an
  index scan capped at one page (verified via EXPLAIN: later partitions
  `never executed`, ~11ms/page), so it stays well under any statement timeout
  regardless of window size. Writes alongside the steady-state objects under
  logs/audit/, without touching the xmin cursor.
- POST /settings/audit_logs_s3_backfill {from,to} (super-admin + Enterprise),
  GET /settings/audit_logs_s3_backfill_status.

Also repurposes the status endpoint's `bootstrapping` flag to mean "draining a
backlog" (the cursor is capped and catching up), and updates the setting
description to point operators at the backfill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8e1af7f
Status:⚡️  Build in progress...

View logs

@rubenfiszel rubenfiszel marked this pull request as ready for review June 26, 2026 16:35
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @rubenfiszel's task in 3m 16s —— View job


Good to merge.

Reviewed the OSS-side diff (the EE ee.rs companion is reviewed separately in windmill-ee-private#634). I cross-checked the steady-state exporter in windmill-ee-private/src/ee.rs to validate the contracts this PR depends on.

What I verified

Backfill module (audit_logs_s3_backfill.rs)

  • Keyset pagination is correct. Cursor starts at (from, -1); since audit id is a positive bigserial, the -1 sentinel correctly includes rows at exactly from. The (timestamp, id) > ($3, $4) row-value predicate combined with timestamp >= $1 AND timestamp < $2 advances strictly and never skips or re-emits within a run. Day-grouping via by_day.last_mut() is sound because ORDER BY timestamp, id keeps each UTC day contiguous within a page.
  • Object keys are collision-free and deterministic. A day spanning multiple pages yields distinct audit_backfill_<min_id>.ndjson keys (min_id advances with the cursor); re-running the same window reproduces identical keys (append-only history) and overwrites rather than duplicating — matching the doc comment.
  • Row-shape parity confirmed. The subquery (workspace_id, id, timestamp, username, operation, action_kind::text AS action_kind, resource, parameters, email, spanrow_to_json) is identical to the steady-state exporter's batch projection in ee.rs:475, so the "consumers dedupe by id uniformly" claim holds.
  • Lease handling is robust. try_start claims atomically (one backfill at a time, even across replicas via background_task::try_claim); spawn_backfill wraps the task in catch_unwind + always release()s, so a panic or missing object store can't strand the lease.
  • Security/validation: super-admin + Enterprise gate on the POST, from < to validated, all queries parameterized, PAGE_ROWS bounds memory and statement time. row_to_json produces single-line JSON (embedded newlines escaped), so ndjson.lines().count() accounting is accurate.

Migration (...reanchor_on_enable.up.sql)

  • ON CONFLICT DO UPDATE ... WHERE last_xmin < EXCLUDED.last_xmin correctly advances the cursor monotonically on re-enable (HA-safe, never backwards). The privilege-gated min(xact_start) floor with a bounded 7-day fallback is a sound lower bound. The one-time epoch-sentinel fixup is appropriately guarded (last_ts <= 'epoch'). Down migration restores the prior function (the data fixup is intentionally one-way, which is standard).

Status flag repurposing (audit_logs_s3.rs)bootstrapping now derives from the draining key set by the EE exporter when a run is xid-capped; consistent with the re-anchor design (no more epoch backlog).

Minor (non-blocking)

  • run_backfill only heartbeats once per page (STALE_HEARTBEAT_SECS = 120). If the object store were pathologically slow on a page with many day-objects, the lease could be considered stale. The worst case is benign though — a concurrent re-POST would write idempotent (same-key) objects — so not worth changing.
  • rows_written/objects_written count re-emitted rows when a backfill window overlaps already-exported steady-state objects; this is by design (the doc notes consumers dedupe by id), just be aware the status figures aren't a deduped total.

Test coverage

  • Backend: The substantive logic (steady-state re-anchor, carry-floor drain regression, trigger/legacy-fixup behavior) is covered by the EE tests called out in the PR (audit_export_end_to_end, audit_export_backlog_drain_no_floor_loss) — I confirmed MAX_XID_INTERVAL_OVERRIDE and the carry-floor logic exist in ee.rs. The new OSS audit_logs_s3_backfill module itself has no direct unit test; its keyset/paging logic is the kind that would benefit from one, but the PR notes the page query plan was verified directly against Postgres, which is reasonable given it's I/O-bound and hard to unit-test without a DB fixture.
  • Frontend: single description string — no test expected.
  • OpenAPI/sqlx: regenerated and consistent with the handlers.
  • Manual verification before merge: enable the toggle on a large instance and confirm the first steady-state run prunes to recent partitions (no full scan); then POST /settings/audit_logs_s3_backfill {from,to} over a multi-day window and confirm objects land under logs/audit/dt=<day>/audit_backfill_*.ndjson while GET /settings/audit_logs_s3_backfill_status advances last_ts/rows_written and the xmin cursor is untouched.
    · Branch

@github-actions

Copy link
Copy Markdown
Contributor

Now I have all the context needed. Let me compile the review.

Pi Review

Good to merge.

Summary

This PR fixes a production incident where S3 audit export's steady-state query — scanning by unindexable age(xmin) against an epoch floor — caused repeated full-table scans, statement timeouts, and DB saturation. The fix re-anchors the cursor on enable/re-enable, fixes a silent data-loss bug in backlog draining, and adds an opt-in historical backfill API. The code is well-structured, properly gated behind #[cfg(feature = "parquet")], and follows the EE workflow per docs/enterprise.md (companion PR, [ee] prefix, ee-repo-ref.txt updated).

What was reviewed

All 10 changed files across ~592 added / 7 removed lines:

  • Migration SQL (up + down) with trigger re-anchor and one-time fixup
  • New audit_logs_s3_backfill module (319 lines)
  • Updated audit_logs_s3 status struct
  • Route registration in lib.rs, handler functions, OpenAPI spec
  • Frontend description string in instanceSettings.ts

Public-surface checklist

New public functions (audit_logs_s3_backfill):

  • pub async fn try_start — validates from < to, delegates to background_task::try_claim (atomic lease). ✅
  • pub async fn get_status — read-only, delegates to background_task::get. ✅
  • pub fn spawn_backfill — spawns a tokio::spawn with catch_unwind + release; no auth needed at this layer (caller already authorized). ✅
  • pub struct BackfillRequest — caller-controlled from/to validated in try_start. ✅

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + Enterprise license check ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin (read-only, consistent with existing /audit_logs_s3_status pattern) ✅

Auth/authorization: Both endpoints gate on super-admin. The POST additionally gates on LicensePlan::Enterprise (runtime check). No auth bypass.

Input validation: from >= to rejected. Large windows handled safely by keyset pagination (bounded 10K-row pages). No injection surface — all SQL parameterized.

Module placement: audit_logs_s3_backfill lives alongside audit_logs_s3 in windmill-api-settings, whose purpose matches "instance-level settings and background operations." Module doc comment clearly states the relationship to the steady-state exporter. ✅

No issues found

  • Panic safety: spawn_backfill wraps execution in AssertUnwindSafecatch_unwind and always calls release, preventing stuck leases. Stale heartbeat (120s) provides a second safety net.
  • Deterministic re-runs: Object keys audit_backfill_{min_id}.ndjson are deterministic — re-running the same window overwrites, not duplicates.
  • Migration: down.sql restores the old trigger function. One-time fixup isn't reversed (standard for data fixups). Naming follows cargo sqlx migrate add -r convention.
  • SQL injection: All queries use $N parameters via sqlx::query!. The migration SQL is static.
  • Feature gates consistent: Both audit_logs_s3.rs and audit_logs_s3_backfill.rs carry #![cfg(feature = "parquet")]; mod declarations and route registrations are gated accordingly.

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module has no OSS-side unit tests. The PR description states tests are in the EE companion repo (audit_export_end_to_end updated, new audit_export_backlog_drain_no_floor_loss regression test), which is the standard pattern for EE-gated features. The background_task module (pre-existing dependency) also has no tests — that's outside this PR's scope.
  • Migration / trigger behavior: Described as verified directly against Postgres in the PR body.
  • Frontend: Single-line description-string change only; no component tests expected.
  • CI / OpenAPI / sqlx: cargo check --features enterprise,private,parquet passes; sqlx offline cache regenerated.

Manual verification needed

  • Trigger behavior on enable/re-enable should be verified on a real Postgres instance: enable the store_audit_logs_s3 setting, confirm the cursor row in background_task_state has a recent last_ts (not epoch) and a last_oldest_inflight_ts within the 7-day window, then disable and re-enable, confirming the cursor advances.
  • Backfill endpoint should be exercised: POST /settings/audit_logs_s3_backfill with a short {from, to} window (as super-admin on Enterprise), confirm objects appear under logs/audit/dt=<day>/audit_backfill_<min_id>.ndjson, and confirm GET status returns progress.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: backfill accepts unstable windows that can silently miss audit rows, re-enable anchor can be overwritten by an in-flight exporter

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:146 - The backfill request only validates from < to, so callers can start a backfill whose upper bound is live or in the future. run_backfill then pages only rows visible at query time by (timestamp, id) and marks the job done once the current scan is empty. Audit rows use timestamp DEFAULT now() and can be inserted in caller transactions, so a transaction that started inside [from, to) but commits after the scan has passed its timestamp, or any future row when to is in the future, is silently omitted even though status reports completion. This endpoint should reject unstable windows, ideally by requiring to to be older than the oldest in-flight transaction/start-time cutoff used elsewhere, or otherwise make the cursor visibility-safe.

  2. [P1] windmill-ee-private/src/ee.rs:577 - The exporter still writes the checkpoint with an unconditional ON CONFLICT DO UPDATE SET value = $2. The new trigger re-anchors the same background_task_state row on re-enable, but it does not coordinate with an already-running export. If an export starts, then the setting is disabled/re-enabled while that export is still uploading, the trigger can store the new snapshot xmin and then the older exporter completion overwrites it with the stale checkpoint it read before re-enable. That defeats the re-anchor guarantee and can bring back the disabled-window scan this PR is meant to avoid. The checkpoint update should be monotonic or conditional so it cannot overwrite a newer last_xmin, or both paths should share a lock.

Test coverage

Backend coverage is needed for the two edge cases above: a backfill request against a non-stable window, and a re-enable while an export run is in progress where the final checkpoint must not move backwards. The PR notes cover the steady-state re-anchor and backlog-drain floor regression, but those do not exercise these races.

No frontend automated tests are expected for the one settings-description string. Manual verification should include starting a safe historical backfill and confirming status plus object keys/row counts, then disabling/re-enabling audit export during an active export and confirming the stored cursor remains anchored at the re-enable snapshot.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 10 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql">

<violation number="1" location="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:36">
P2: `min(xact_start)` is computed without restricting to sessions that actually hold an xid. A long read-only transaction can make the floor very old and reintroduce large partition scans/timeouts on enable.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread backend/windmill-api-settings/src/audit_logs_s3_backfill.rs
Address review (cubic): persist progress (refreshing the lease heartbeat) after
every object PUT in the backfill page loop, not only once per page, so the gap
between heartbeats stays well under STALE_HEARTBEAT_SECS even on slow uploads
and another replica can't re-claim mid-page and run a concurrent backfill.

Bumps ee-repo-ref.txt to pull in the EE test-race fix (folding the backlog-drain
regression into the single audit e2e test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

Summary

This PR fixes a production incident where the S3 audit export's age(xmin) scan with an epoch floor caused repeated full-table scans and DB saturation. The fix re-anchors the cursor on enable/re-enable, fixes a silent data-loss bug in backlog draining, and adds an opt-in historical backfill API. The code is well-structured, properly gated behind #[cfg(feature = "parquet")], follows the EE workflow per docs/enterprise.md (companion PR, [ee] prefix, ee-repo-ref.txt updated). All 10 changed files across ~594 added lines reviewed.

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + Enterprise license check (LicensePlan::Enterprise at runtime) ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All input-validated (from >= to rejected, parameterized SQL), module placement appropriate (alongside audit_logs_s3 in windmill-api-settings), lease-handling correct (catch_unwind + always release, 120s heartbeat). ✅

Migration: up.sql correctly re-anchors via ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin (monotonic, HA-safe). One-time fixup for legacy epoch sentinels is properly gated (last_ts <= 'epoch'). down.sql restores the prior trigger function. Naming follows cargo sqlx migrate add -r convention. ✅

Two P2 nitpicks (non-blocking)

P2 — Backfill accepts unstable windows with no guard (backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:147)

try_start validates only from < to. A caller can pass to = now() (or a future timestamp), and run_backfill will scan rows visible up to that point via (timestamp, id) keyset pagination. Audit rows whose inserting transaction commits after the keyset cursor has passed their timestamp are silently omitted from the backfill — the scan exits when rows.is_empty(), and uncommitted rows that later land within [from, to) with a timestamp the cursor already passed are invisible. The doc comment says "historical [from, to) window" but the code doesn't enforce it. Since the steady-state exporter covers live/post-enable data and this endpoint requires super-admin + Enterprise, the impact is a misuse vector rather than data loss. Reject to >= now() - <safety_margin> or document the stability requirement in the OpenAPI description field.

P2 — Re-enable trigger anchor can be overwritten by an in-flight exporter (windmill-ee-private/src/ee.rs:578)

The EE exporter's final checkpoint write uses unconditional ON CONFLICT (name) DO UPDATE SET value = $2. If an admin disables/re-enables the export while an export run is in-flight, the trigger fires and writes a newer last_xmin, but the in-flight exporter's completion then unconditionally overwrites it with its own (stale, lower) last_xmin. The trigger itself is monotonic (WHERE last_xmin < EXCLUDED.last_xmin in up.sql:54), but the exporter side lacks an equivalent guard. Requires explicit admin action during an active export run; in the capped/draining case the overwrite actually restores the correct drained cursor (the trigger's advance would skip the tail), so the worst case is temporary scan inefficiency, not data loss. Still, the checkpoint write should be monotonic: WHERE ... ->'last_xmin' < ($2->>'last_xmin') or similar.

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module has no OSS-side unit tests. The PR description states tests live in the EE companion repo (audit_export_end_to_end updated, new audit_export_backlog_drain_no_floor_loss regression test), which is the standard pattern for EE-gated features. The background_task module (pre-existing dependency) also has no tests — outside this PR's scope.
  • Migration / trigger behavior: Described as verified directly against Postgres in the PR body.
  • Frontend: Single-line description-string change only; no component tests expected.
  • CI / sqlx: cargo check --features enterprise,private,parquet passes; sqlx offline cache regenerated with two new query JSON files.

Manual verification needed

No in-app surface to exercise beyond the frontend settings description string (which updates inline help text for the "Store audit logs in object storage" toggle — EE-gated, so it only renders under a valid Enterprise license). Backfill endpoint behavior is a pure backend API; verifying with a real POST /settings/audit_logs_s3_backfill against a running instance with object storage configured would confirm the keyset pagination, object-key determinism, and status reporting.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: backfill accepts unstable windows that can silently miss audit rows, re-enable anchor can be overwritten by an in-flight exporter

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:146 - try_start only validates from < to, so the new backfill API accepts live/future upper bounds. run_backfill then keyset-pages the rows visible at query time and reports completion once the current scan is empty. Audit rows get timestamp DEFAULT now() from the inserting transaction, so a transaction that started inside [from, to) but commits after the scan has passed its timestamp, or any future audit row when to is in the future, is silently omitted from a “completed” backfill. The endpoint should reject unstable windows, ideally requiring to to be older than the oldest in-flight transaction/start-time cutoff used by the exporter, or otherwise make the cursor visibility-safe.

  2. [P1] windmill-ee-private/src/ee.rs:577 - The current EE ref still persists the steady-state audit export checkpoint with an unconditional ON CONFLICT DO UPDATE SET value = $2. The new trigger re-anchors the same background_task_state row on re-enable, but it does not coordinate with an already-running exporter. If an export starts, then the setting is disabled/re-enabled while that export is still uploading, the trigger can store the newer snapshot xmin and then the older exporter completion overwrites it with the stale checkpoint it read before re-enable. That defeats the re-anchor guarantee and can bring back the disabled-window scan this PR is meant to avoid. The checkpoint write should be monotonic/conditional on last_xmin, or both paths should share a lock.

Test coverage

Backend coverage should include the two edge cases above: rejecting or safely handling a backfill whose upper bound is not visibility-stable, and a disable/re-enable while an export run is in progress where the final checkpoint must not move backwards. The PR notes cover the steady-state re-anchor and backlog-drain floor regression, but not these races.

No frontend automated tests are expected for the one settings-description string. Manual verification should include starting a safe historical backfill and confirming status plus object keys/row counts, then disabling/re-enabling audit export during an active export and confirming the stored cursor remains anchored at the re-enable snapshot.

Address review (P1): the backfill keyset-pages over rows visible at scan time
and declares completion when the scan runs dry, but a row's `timestamp` is its
inserting transaction's `xact_start`. A window whose upper bound is recent or in
the future could silently omit a transaction that started inside `[from, to)`
but commits after the scan passed that timestamp. `try_start` now rejects any
`to` newer than the oldest in-flight `xact_start` (everything strictly older
than the oldest running transaction is committed and stable), using the same
trustworthy stats gating as the exporter's floor (restricted role / 2PC → a
7-day-old cutoff).

Bumps ee-repo-ref.txt for the EE monotonic-checkpoint fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Addressed both P1s.

[P1] Backfill accepted unstable windows — fixed in e8ba447. try_start now rejects a to that isn't settled yet. A row's timestamp is its inserting txn's xact_start, so a window whose upper bound is recent/future could silently omit a transaction that started inside [from, to) but commits after the keyset scan passed that timestamp. It now requires to ≤ the oldest in-flight xact_start (everything strictly older than the oldest running transaction is committed and stable), using the same trustworthy stats gating as the exporter's floor (restricted role / prepared-2PC ⇒ a 7-day-old cutoff). The error tells the operator the maximum allowed to.

[P1] Re-enable anchor clobbered by an in-flight exporter — fixed in EE 6b191b7 (ref bumped in e8ba447). The export's advisory lock excludes other exporters but not the enable trigger, so a disable/re-enable mid-upload could re-anchor the checkpoint and then be overwritten by the exporter's stale prev_xmin-derived write. The checkpoint ON CONFLICT DO UPDATE is now guarded with WHERE (value->>'last_xmin')::bigint <= eff_cur_xmin, keeping last_xmin monotonic (matching the trigger's own monotonic guard). A skipped write means a concurrent re-anchor won — logged, cursor not regressed.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">

<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:165">
P2: No-active-transaction case is treated as unsafe fallback, incorrectly forcing a 7-day cutoff. This can reject valid backfill requests even when everything is already settled.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/windmill-api-settings/src/audit_logs_s3_backfill.rs
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

Both P1 findings from the prior review round have been addressed in commit e8ba447:

  • Unstable-window validation (try_start): now queries the oldest in-flight xact_start (gated on pg_read_all_stats / is_superuser with prepared-2PC guard, 7-day fallback) and rejects to > settled_cutoff with a descriptive error telling the operator the maximum allowed to. Same trustworthy stats gating as the exporter's floor. ✅
  • Re-enable anchor clobbering: EE ref bumped to 6b191b7, which adds a monotonic guard (WHERE (value->>'last_xmin')::bigint <= eff_cur_xmin) to the checkpoint write per the author's comment. The OSS migration's trigger is already monotonic (WHERE last_xmin < EXCLUDED.last_xmin). ✅

The per-object heartbeat fix from e0fcd0d is also present — session.update() (which calls background_task::update_state, setting updated_at = NOW()) fires after every object PUT, keeping the gap between heartbeats safely under STALE_HEARTBEAT_SECS = 120.

What was reviewed

All 12 changed files across ~646 added / ~10 removed lines: the new audit_logs_s3_backfill module (349 lines), migration SQL (up + down), updated audit_logs_s3 status, route registration, OpenAPI spec, frontend description string, EE ref bump, and 4 sqlx offline cache entries.

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from < to and that to is settled (no in-flight transactions with earlier xact_start); claims atomically via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays stale-heartbeat detection from background_task_state. ✅

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly validated, placed alongside audit_logs_s3 in windmill-api-settings. Lease-handling is panic-safe (catch_unwind → always release). Heartbeat per object prevents stale-lease races. ✅

Migration: up.sql re-anchors via monotonic ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin; one-time fixup for legacy epoch sentinels is gated on last_ts <= 'epoch'. down.sql restores the old trigger. Naming follows cargo sqlx migrate add -r convention. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. Migration SQL is static.
  • Lease safety: spawn_backfill wraps in catch_unwind → always release, with stale heartbeat (120s) as a second safety net.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start ensures the first row at exactly from is included; ORDER BY timestamp, id within timestamp >= $1 AND timestamp < $2 advances strictly.
  • Deterministic object keys: audit_backfill_{min_id}.ndjson — collision-free (min_id advances per page), reproducible on re-run.
  • Feature gates: Module, route handlers, and #![cfg(feature = "parquet")] are consistent.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Consistent with the re-anchor design.

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module has no OSS-side unit tests. The PR description states tests live in the EE companion repo (audit_export_end_to_end updated, new audit_export_backlog_drain_no_floor_loss regression test with a test-only MAX_XID_INTERVAL override), which is the standard pattern for EE-gated features. The migration trigger behavior and backfill page query plan were verified directly against Postgres.
  • Migration: sqlx offline cache regenerated (3 new + 1 modified query json).
  • Frontend: Single-line description-string update only; no component tests expected.
  • OpenAPI: Spec updated for both new endpoints. No generated client changes surfaced.

Manual verification needed

No in-app UI to exercise (the frontend change is a settings description string that only renders under an EE license). The backfill endpoints and re-anchor behavior should be validated against a real Postgres instance with the EE companion PR's exporter: start a historical backfill via POST /settings/audit_logs_s3_backfill {from, to} where to is well in the past, confirm GET /settings/audit_logs_s3_backfill_status shows progress and final row/object counts, then verify the .ndjson objects in the bucket contain the expected rows. For the re-anchor path, disable and re-enable the export setting during an active export run and confirm the cursor row's last_xmin is monotonic (never regresses).

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel

Should address issues before merging: legacy epoch fixup can mark old post-enable audit rows processed without exporting them

Found 1 issue:

  1. [P1] backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:63 - The legacy epoch-checkpoint fixup stamps last_oldest_inflight_ts to now() - 7 days while leaving the old last_xmin in place. On an instance that enabled export on the old code more than 7 days ago and got stuck before the first successful batch, the next exporter run uses that synthetic timestamp floor with the old xmin interval. Rows committed after enable but older than 7 days are filtered out by timestamp >= ts_floor, while the exporter still advances last_xmin over the interval, so those rows are never written by steady-state export and status does not surface the gap. This also contradicts the settings text's “no audit log committed after enabling is ever skipped” guarantee. The migration should either preserve an explicit backfill-required state/re-anchor instead of pretending the drain continues, or avoid advancing the cursor past rows pruned only by the synthetic floor.

Test coverage

Backend coverage should include a legacy checkpoint regression: last_ts = epoch, an old last_xmin, and post-enable audit rows older than 7 days. Verify the first post-migration export either writes those rows or does not silently advance the cursor without surfacing that a historical backfill is required.

No frontend automated tests are expected for the single settings-description string. Manual verification should include upgrading an instance with a stale epoch audit-export checkpoint, running the exporter, and confirming the stored cursor/status plus object-store contents make the old post-enable gap explicit rather than silently skipping it.

…loor

Address review (P1): the legacy-checkpoint fixup stamped last_oldest_inflight_ts
to now()-7d while leaving the old last_xmin in place. On an instance that
enabled export on the old code >7 days ago and got stuck before the first
successful batch, the next run would filter post-enable rows older than 7 days
out via `timestamp >= ts_floor` while still advancing last_xmin over the
interval — silently dropping them (the same floor-vs-cursor loss class fixed
elsewhere in this PR), and contradicting the "nothing committed after enabling
is skipped" guarantee.

A stuck epoch-sentinel checkpoint cannot be safely resumed (its backlog can be
arbitrarily old, so any recent floor prunes rows the cursor then skips, and an
epoch floor reintroduces the full scan). Re-anchor it to the migration's current
snapshot xmin instead — exactly like a fresh enable — so the export resumes
cleanly from ~now and the never-exported pre-upgrade window is recovered via the
opt-in backfill rather than silently dropped. Reword the setting description so
it no longer implies the disabled/legacy window is covered by the cursor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[P1] Legacy epoch-checkpoint fixup could silently drop post-enable rows — fixed in 3c6f556.

You're right: stamping now()-7d over the old last_xmin prunes post-enable rows older than 7 days via timestamp >= ts_floor while the cursor still advances over the interval — the same floor-vs-cursor loss class as the capped-drain bug, and it contradicts the settings guarantee.

A stuck epoch-sentinel checkpoint can't be safely resumed (its backlog is arbitrarily old → any recent floor prunes rows the cursor skips, and an epoch floor reintroduces the full scan). The fixup now re-anchors such a checkpoint to the migration's current snapshot xmin (with a recent floor), exactly like a fresh enable: the export resumes cleanly from ~now, and the never-exported pre-upgrade window is recovered via the opt-in backfill rather than silently dropped while the cursor advances. I also reworded the setting description so it no longer implies the disabled/legacy window is covered by the steady-state cursor.

@github-actions

Copy link
Copy Markdown
Contributor

I've now reviewed all changed files and the prior discussion history. All three P1 findings from earlier review rounds have been addressed:

  • Unstable-window validationtry_start now queries oldest in-flight xact_start and rejects unsafe to bounds ✅
  • Re-enable anchor clobbering — EE ref bumped to 6b191b7 with monotonic guard on checkpoint write ✅
  • Legacy epoch fixup data loss — now re-anchors fully (reset to current snapshot) instead of silently dropping rows ✅

No new issues found.

Pi Review

Good to merge.

Summary

This PR fixes a production incident where the S3 audit export's age(xmin) scan with an epoch floor caused repeated full-table scans, statement timeouts, and DB saturation. The fix re-anchors the cursor on enable/re-enable, fixes a silent data-loss bug in backlog draining, and adds an opt-in historical backfill API. All prior P1 findings from previous review rounds have been addressed in the latest commits.

What was reviewed

All 12 changed files across ~657 added / ~10 removed lines: the new audit_logs_s3_backfill module (349 lines), migration SQL (up + down, 79+19 lines), updated audit_logs_s3 status, route registration, OpenAPI spec, frontend description string, EE ref bump, and 4 sqlx offline cache entries.

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from < to and to settled; claims atomically via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays stale-heartbeat detection. ✅

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly input-validated, placed alongside audit_logs_s3 in windmill-api-settings. Lease-handling panic-safe (catch_unwind → always release). Heartbeat refreshes after every object PUT (not just per page). ✅

Migration: up.sql re-anchors via monotonic ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin; one-time fixup fully re-anchors legacy epoch checkpoints (new snapshot xmin + recent floor, not merely a floor overlay). down.sql restores the old trigger. Naming follows cargo sqlx migrate add -r convention. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. Migration SQL is static.
  • Lease safety: spawn_backfill wraps task in catch_unwind, always calls release. Stale heartbeat (120s) as second safety net. Heartbeat refreshes per-object to stay well under the threshold.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within the timestamp window advances strictly.
  • Deterministic object keys: audit_backfill_{min_id}.ndjson — collision-free and reproducible.
  • Monotonic checkpoint guard: EE ref 6b191b7 adds WHERE (value->>'last_xmin')::bigint <= eff_cur_xmin to the checkpoint write, matching the trigger's own WHERE last_xmin < EXCLUDED.last_xmin guard. Both paths are now monotonic.
  • Stability validation: try_start rejects to that isn't fully settled using the same trustworthy stats gating as the exporter floor (restricted role / prepared-2PC ⇒ 7-day fallback).
  • Legacy fixup safety: One-time UPDATE fully re-anchors (new xmin + recent floor), so the export resumes cleanly from ~now — the pre-upgrade window is recovered via the opt-in backfill, not silently skipped.
  • Frontend change: Only a description string update; consistent with the backend contract.

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module has no OSS-side unit tests. The PR description states tests live in the EE companion repo (audit_export_end_to_end updated, new audit_export_backlog_drain_no_floor_loss regression test with test-only MAX_XID_INTERVAL override), which is the standard pattern for EE-gated features. The migration trigger behavior and backfill page query plan were verified directly against Postgres.
  • Migration: sqlx offline cache regenerated (3 new + 1 modified query json).
  • Frontend: Single-line description-string update only; no component tests expected.
  • EE ref bump: ee-repo-ref.txt updated to 6b191b77aabcf77658ad4f9031576e0d7b66bf89 — the companion EE PR containing the monotonic checkpoint guard.

Manual verification needed

  • Enable audit log export to object storage on a test instance, confirm the anchor is recent (not epoch), and verify the steady-state export runs without full-table scans.
  • Initiate a historical backfill via POST /settings/audit_logs_s3_backfill with a settled window (e.g. a day from last week), monitor via GET /settings/audit_logs_s3_backfill_status, and confirm objects land under logs/audit/dt=<day>/audit_backfill_<min_id>.ndjson with the expected row counts.
  • Disable and re-enable export during an active run, verify the cursor remains at the re-enable snapshot (not regressed to the stale in-flight value).

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel

Should address issues before merging: backfill skips legacy audit rows

Found 1 issue:

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:279 - The new backfill scans only audit_partitioned, but existing audit history still includes the legacy audit table: audit list/get paths read audit_partitioned UNION ALL audit, and cleanup removes legacy audit rows only after the configured retention period. In EE the default retention is 365 days, so upgraded instances can still have pre-partition audit rows. Since the settings text now tells operators to use this API for “Pre-existing history” and a chosen historical range, a request whose [from, to) overlaps legacy rows will report completion while omitting those audit logs from object storage. The endpoint should either include the legacy table in the backfill or reject/document windows before the partitioned-only boundary.

Test coverage

Backend coverage should include a backfill window containing one row in legacy audit and one row in audit_partitioned, verifying both are written or that the request is rejected with an explicit unsupported-range error. The existing EE coverage described in the PR covers the cursor re-anchor/drain behavior, but not this historical-table boundary.

No frontend automated tests are expected for the single settings-description string. Manual verification should exercise the API on an upgraded database that still has legacy audit rows and confirm the object-store output or error matches the documented range semantics.

The backfill previously had only SQL-level/EXPLAIN validation. Add real
integration tests (in-memory object store, sqlx::test) exercising the public
path:

- backfill_exports_window_in_pages: with the page size forced to 2 rows, a
  settled 3-day window is exported across multiple keyset pages; asserts every
  in-window row lands exactly once, rows outside [from,to) are excluded, a day
  that straddles a page boundary yields more than one object, progress counts
  match, and a re-run is idempotent (deterministic keys overwritten, no dupes).
- backfill_rejects_unstable_window: a future/live `to` is rejected as unstable,
  a window safely in the past is accepted.

Adds a test-only PAGE_ROWS override so multi-page behaviour is exercised with a
handful of rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql">

<violation number="1" location="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:36">
P2: `min(xact_start)` is computed without restricting to sessions that actually hold an xid. A long read-only transaction can make the floor very old and reintroduce large partition scans/timeouts on enable.</violation>
</file>

<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">

<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:165">
P2: No-active-transaction case is treated as unsafe fallback, incorrectly forcing a 7-day cutoff. This can reject valid backfill requests even when everything is already settled.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/windmill-api-settings/src/audit_logs_s3_backfill.rs Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

All P1 findings from the three prior review rounds have been addressed. The two new tests in 2dddb7ced (end-to-end keyset pagination + unstable-window rejection) close the OSS-side test-coverage gap the prior rounds noted. No new issues introduced.

What changed since the last reviews

Commit-by-commit fix verification against the prior discussion:

  • e0fcd0d: per-object heartbeat (session.update() fires after every PUT, not just once per page). With STALE_HEARTBEAT_SECS = 120, the gap stays well under the threshold even for multi-day pages with slow uploads. ✅
  • e8ba447: try_start rejects to > settled_cutoff (oldest in-flight xact_start, gated identically to the exporter's floor: restricted stats role / prepared-2PC → 7-day fallback). Error tells the operator the max allowed to. EE ref bumped to 6b191b7 which adds WHERE (value->>'last_xmin')::bigint <= eff_cur_xmin to the checkpoint write, making it monotonic. ✅
  • 3c6f556: Legacy epoch-sentinel fixup now fully re-anchors (new snapshot xmin + recent floor) instead of stamping a synthetic floor over the old cursor. The export resumes cleanly from ~now; the pre-upgrade window is recovered via the opt-in backfill. ✅
  • 2dddb7c: Two integration tests — backfill_exports_window_in_pages (covers multi-page keyset paging, cross-day page boundaries, out-of-window exclusion, idempotent re-run) and backfill_rejects_unstable_window (covers the settled-cutoff rejection). ✅

Legacy audit table concern (not a PR-introduced issue)

The prior Codex review flagged that the backfill scans only audit_partitioned, not the legacy audit table. This is a pre-existing scope limitation of the audit export system: the steady-state exporter in ee.rs also scans only audit_partitioned (the Claude review confirmed the subquery is identical at ee.rs:475). The backfill is intentionally scope-consistent with the exporter it complements. The review policy says to report only issues introduced by this PR, so this is not a finding.

Public-surface checklist

New API endpoints — both properly gated:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; from >= to and unstable to rejected; atomic lease via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays background_task::get stale-heartbeat detection. ✅

New public functions (audit_logs_s3_backfill.rs) — try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly validated, placed alongside audit_logs_s3 in windmill-api-settings (matches the module's purpose). Lease-handling panic-safe (catch_unwind → always release). ✅

Migrationup.sql uses monotonic ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin. Legacy fixup gates on last_ts <= 'epoch'::timestamptz and re-anchors to current snapshot. down.sql restores the prior trigger. ✅

Frontend — Single-line description string update only; no component change. ✅

No issues found

  • SQL injection: All queries parameterized; migration SQL static.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within timestamp >= $1 AND timestamp < $2 advances strictly.
  • Deterministic object keys: audit_backfill_{min_id}.ndjson — collision-free and reproducible on re-run.
  • Feature gates: Module (audit_logs_s3_backfill), route handlers, and audit_logs_s3 all carry #[cfg(feature = "parquet")]. Route registration gated accordingly.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Consistent with the re-anchor design.
  • EE workflow: [ee] prefix, ee-repo-ref.txt bumped to 6b191b7, companion PR windmill-labs/windmill-ee-private#634. Follows docs/enterprise.md. ✅

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module now has two integration tests (backfill_exports_window_in_pages, backfill_rejects_unstable_window) covering the keyset pagination, page-boundary day splitting, out-of-window exclusion, idempotent re-run, and unstable-window rejection. The migration trigger behavior, legacy fixup, and EE-side regression test (audit_export_backlog_drain_no_floor_loss) are described as verified against Postgres in the PR body. sqlx offline cache regenerated (3 new + 1 modified query JSON).
  • Frontend: Single-line description string only; no component tests expected.
  • OpenAPI: New endpoints documented in openapi.yaml.

Manual verification

No in-app UI to exercise (the instanceSettings.ts change is a description string only, and the backfill endpoints have no corresponding UI). Manual verification should hit the API directly on an instance with audit data: POST a safe historical [from, to) window (where to is comfortably in the settled past), poll status via GET, then confirm the expected ndjson objects appear under logs/audit/dt=<day>/audit_backfill_<min_id>.ndjson with correct row counts. For the re-enable trigger behavior, disable and re-enable the export toggle and confirm the background_task_state checkpoint advances to the current snapshot xmin.

Make explicit that, like the steady-state export, the backfill reads only
audit_partitioned; the pre-partitioning `audit` table is intentionally out of
scope (not a missed case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel

Should address issues before merging: backfill skips legacy audit rows

Found 1 issue:

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:299 - The new backfill scans only audit_partitioned, but existing audit history still includes the legacy audit table. Existing audit read paths query audit_partitioned UNION ALL audit, and cleanup only removes legacy audit rows after the configured retention period, which defaults to 365 days in EE. Since the settings text directs operators to use this API for “Pre-existing history” and chosen historical ranges, a backfill window overlapping legacy rows can report completion while omitting those audit logs from object storage. The endpoint should include legacy rows or explicitly reject/document unsupported windows before the partitioned-only boundary.

Test coverage

Backend coverage should include a backfill window containing one row in legacy audit and one row in audit_partitioned, verifying both are written or that the request is rejected with an explicit unsupported-range error.

No frontend automated tests are expected for the single settings-description string. Manual verification should exercise the API on an upgraded database that still has legacy audit rows and confirm the object-store output or error matches the documented range semantics.

@github-actions

Copy link
Copy Markdown
Contributor

I have now completed a thorough review of all 6 commits across all 12 changed files. Here is the final review:

Pi Review

Good to merge.

Summary

All three P1 findings from the prior three review rounds have been resolved:

  • Unstable-window validation (e8ba447): try_start queries the oldest in-flight xact_start (with trustworthy stats gating) and rejects to bounds that aren't settled. ✅
  • Re-enable anchor clobbering (e8ba447): EE ref 6b191b7 adds WHERE (value->>'last_xmin')::bigint <= $4 to the checkpoint write, matching the trigger's own monotonic WHERE last_xmin < EXCLUDED.last_xmin guard. ✅
  • Legacy epoch-fixup data loss (3c6f556): The one-time UPDATE now fully re-anchors (new snapshot xmin + recent floor), so the export resumes cleanly from ~now — the pre-upgrade window is recovered via the backfill API, not silently skipped. ✅

Plus e0fcd0d5a heartbeat-per-object and 2dddb7ced integration tests (two well-structured tests covering keyset paging, day-splitting, idempotent re-runs, and unstable-window rejection). The latest commit 691394712 adds a doc comment explicitly scoping the backfill to audit_partitioned only (consistent with the steady-state exporter's pre-existing scope).

What was reviewed

All 12 changed files (~860 added, ~10 removed lines): audit_logs_s3_backfill.rs (552 lines with tests), migration SQL (up 79 + down 19), updated audit_logs_s3.rs (9 lines), lib.rs route registration (35 lines), openapi.yaml (86 lines), instanceSettings.ts (2-line description update), ee-repo-ref.txt (1 line), and 4 sqlx offline cache entries.

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from < to and to settled; claims atomically via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays stale-heartbeat detection from background_task::get. ✅

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly validated, placed alongside audit_logs_s3 in windmill-api-settings (matches the module's purpose). Lease-handling panic-safe (catch_unwind → always release). Heartbeat refreshed per-object (not just per-page) to stay under STALE_HEARTBEAT_SECS = 120. ✅

Migration: up.sql uses monotonic ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin. Legacy fixup gated on last_ts <= 'epoch'::timestamptz and fully re-anchors. down.sql restores the prior trigger. Naming follows cargo sqlx migrate add -r convention. ✅

Frontend: Single-line description-string update only; consistent with the new backend contract. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. Migration SQL is static.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within the timestamp window advances strictly. Day-grouping (by_day.last_mut()) is sound because the sort order keeps each UTC day contiguous.
  • Deterministic object keys: audit_backfill_{min_id}.ndjson — collision-free (min_id advances per page), reproducible on re-run.
  • Feature gates: Both audit_logs_s3_backfill and audit_logs_s3 carry #![cfg(feature = "parquet")]; mod declarations and route registrations gated accordingly.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Consistent with the re-anchor design.
  • EE workflow: [ee] prefix in PR title, ee-repo-ref.txt bumped to 6b191b7, companion PR pattern followed per docs/enterprise.md.
  • Legacy audit table scope: Explicitly documented in the module doc comment as the same scope limitation the steady-state exporter has always had — not a PR-introduced regression.

Test coverage

  • Backend (Rust): Two new OSS-side integration tests in audit_logs_s3_backfill.rsbackfill_exports_window_in_pages (multi-page keyset paging, cross-day page boundaries, out-of-window exclusion, idempotent re-run) and backfill_rejects_unstable_window (settled-cutoff rejection). The PR description also references EE companion tests (audit_export_end_to_end updated, new audit_export_backlog_drain_no_floor_loss regression test). These cover the steady-state re-anchor, carry-floor drain fix, and backfill behaviors.
  • Migration / trigger behavior: Described as verified directly against Postgres. Migrations follow the cargo sqlx migrate add -r convention with matching up/down.
  • Frontend: Single-line description-string change only; no component tests expected.
  • CI / sqlx: cargo check --features enterprise,private,parquet passes; sqlx offline cache regenerated (3 new entries + 1 modified with the monotonic guard).

Manual verification needed

No in-app UI surface to exercise — the only frontend change is a description string (instanceSettings.ts) that renders under an EE license on the instance settings page. The backend endpoints and migration behavior have been verified through the integration tests and direct Postgres testing described in the PR body.

Address review (Codex P1): the backfill reads only audit_partitioned, but
pre-partitioning history lives in the legacy `audit` table (still read by audit
list/get via UNION ALL, and retained for the configured period — 365 days by
default on EE). Since the setting text points operators at this API for
"pre-existing history", a window overlapping legacy rows would report completion
while silently omitting them.

Per the decision to not export the legacy table, reject instead of silently
omit: try_start now rejects a `from` earlier than the oldest audit_partitioned
timestamp (every legacy row predates the partition cutover, so a `from` at/after
that boundary can never overlap them). Reworded the setting text to scope the
backfill to the partitioned era. Added a regression test, plus an RAII guard
(cubic P2) so the test-only globals are restored even if an assertion panics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[P1] Backfill could report success while omitting legacy audit rows — fixed in 4214d3b.

The export and backfill intentionally cover only audit_partitioned (the pre-partitioning audit table is out of scope by design). The real problem you flagged is the silent part: a window reaching into legacy-only territory would complete while omitting those rows, even though the setting text invited operators to use it for "pre-existing history."

Rather than export the legacy table, try_start now rejects a window whose from is earlier than the oldest audit_partitioned timestamp — every legacy row predates the partition cutover, so a from at/after that boundary can never overlap them, and a request that would silently drop legacy rows is refused with a clear message instead. The setting description is reworded to scope the backfill to "back to when audit-log partitioning was introduced (older rows in the legacy audit table are not exported, and a window starting before that boundary is rejected)." Covered by a new backfill_rejects_pre_partition_window test.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: overlapping backfills overwrite prior exports, fallback cutoff accepts unstable windows

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:337 The backfill object key is only dt=<day>/audit_backfill_<min_id>.ndjson, and the worker overwrites that key with put. A later overlapping but narrower backfill can share the same first row/min_id as an earlier broader backfill while containing fewer rows for that page, replacing the earlier object and silently removing audit rows from object storage. The key needs to include enough of the requested/page range to be unique across different windows, or the write path needs to avoid overwriting an existing non-identical object.

  2. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:192 When PostgreSQL cannot provide a trustworthy oldest in-flight transaction boundary, the code falls back to now() - 7 days and then treats windows before that as “guaranteed settled.” That is not guaranteed if the DB role lacks stats visibility or if prepared transactions exist: an older transaction can still commit audit rows with timestamps inside the accepted window after the scan has completed, so the backfill reports success while missing rows. This path should reject until a trustworthy cutoff is available, or otherwise not claim the window is complete.

Test coverage

The PR adds backend sqlx::test coverage for the main export paging path, same-window idempotent reruns, and rejecting an obviously unstable future upper bound. The two blocking cases above are not covered: different overlapping windows that collide on the same object key, and the fallback cutoff path where no reliable transaction boundary is available.

Manual verification should include running a broad backfill followed by a narrower overlapping backfill with the same first row and confirming the object store still contains the full broad-window row set. Also verify the API behavior on a DB user without pg_read_all_stats or with a prepared transaction present, and confirm it does not accept a window as settled unless the boundary is actually trustworthy.

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

Summary

This PR fixes a production incident where the S3 audit export's age(xmin) scan with an epoch floor caused repeated full-table scans, statement timeouts, and DB saturation. The fix re-anchors the cursor on enable/re-enable, fixes a silent data-loss bug in backlog draining, and adds an opt-in historical backfill API. All six prior P1 findings from earlier review rounds (unstable-window validation, re-enable anchor clobbering, legacy epoch-fixup data loss, plus three rounds of the same) have been addressed across commits e0fcd0d5a, e8ba4473f, 3c6f556da, 2dddb7ced, 691394712, and 4214d3b. No blocking issues remain.

What was reviewed

All 12 changed files across ~918 added / 10 removed lines: audit_logs_s3_backfill.rs (610 lines with 3 integration tests), migration SQL (up 79 + down 19), updated audit_logs_s3.rs (9 lines), lib.rs route registration (35 lines), openapi.yaml (86 lines), instanceSettings.ts (2 lines), ee-repo-ref.txt (1 line), and 4 sqlx offline cache entries.

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from < to, to settled (oldest in-flight xact_start with stats-privilege gating / 7-day fallback), and from not before earliest partitioned row; claims atomically via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays background_task::get stale-heartbeat detection. ✅

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly validated. Module placed alongside audit_logs_s3 in windmill-api-settings (matches module doc comment purpose). Lease handling is panic-safe (catch_unwind → always release). Heartbeat refreshed per-object (not per-page), staying well under STALE_HEARTBEAT_SECS = 120. ✅

Migration (up.sql): Trigger uses ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin (monotonic, HA-safe). One-time fixup gates on last_ts <= 'epoch'::timestamptz and fully re-anchors (new snapshot xmin + recent floor). down.sql restores the prior trigger function (epoch sentinel + DO NOTHING). Naming follows cargo sqlx migrate add -r convention. ✅

Frontend: Single-line description-string update only; accurately describes the new re-anchor behavior and backfill endpoints. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. The single non-macro query (SELECT min(timestamp) FROM audit_partitioned) takes no parameters. Migration SQL is static.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within timestamp >= $1 AND timestamp < $2 advances strictly. Day-grouping via by_day.last_mut() is sound because the sort order keeps each UTC day contiguous within a page.
  • Deterministic object keys: audit_backfill_{min_id}.ndjson — collision-free (min_id advances per page), reproducible on re-run (test verifies idempotent re-run).
  • Lease safety: spawn_backfill wraps task in catch_unwind, always calls release. Stale heartbeat (120s) as backup. Heartbeat refreshes after every object PUT, so multi-day pages with slow uploads can't trigger stale-lease reclaim.
  • Monotonic checkpoint guards: EE ref 6b191b7 adds WHERE (value->>'last_xmin')::bigint <= $4 to the steady-state exporter's checkpoint write, matching the trigger's own WHERE last_xmin < EXCLUDED.last_xmin. Both paths are now monotonic — a disable/re-enable mid-export can't regress the cursor.
  • Legacy audit table scope: try_start rejects windows starting before the earliest audit_partitioned row with a clear error message (4214d3b). The module doc comment explicitly documents that only audit_partitioned is exported, consistent with the steady-state exporter's pre-existing scope.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Consistent with the re-anchor design.
  • Feature gates: Module (audit_logs_s3_backfill), route handlers, and audit_logs_s3 all carry #[cfg(feature = "parquet")]. Mod declarations and route registrations gated accordingly.
  • EE workflow: [ee] prefix in PR title, ee-repo-ref.txt bumped to 6b191b7, companion PR pattern followed per docs/enterprise.md.
  • No P0/P1/P2 issues found. Each of the three prior rounds of findings has been conclusively resolved with both code fixes and regression tests.

Test coverage

  • Backend (Rust): Three integration tests (#[sqlx::test]) in the OSS audit_logs_s3_backfill module cover:

    • backfill_exports_window_in_pages — multi-page keyset paging with forced 2-row pages, cross-day page boundaries, out-of-window exclusion, idempotent re-run, progress row count.
    • backfill_rejects_unstable_window — future to rejection and settled-past acceptance via try_start.
    • backfill_rejects_pre_partition_window — pre-partition from rejection and within-partition-era acceptance.

    The EE companion repo covers the steady-state re-anchor contract (audit_export_end_to_end updated) and the carry-floor drain regression (audit_export_backlog_drain_no_floor_loss with test-only MAX_XID_INTERVAL override). Migration trigger behavior and backfill page query plan were verified directly against Postgres per the PR description.

  • Frontend: Single-line description-string update only; no component tests expected.

  • sqlx: Offline cache regenerated (3 new + 1 modified query json).

Manual verification needed

  • None required. This is a purely backend/internal change with no in-app UI surface to exercise (the frontend change is a description string that only renders under an EE license, and the backfill API is super-admin + Enterprise gated). The three OSS integration tests exercise the keyset pagination, validation, and rejection paths directly against an in-memory object store and real Postgres partitions.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: empty-partition backfill can silently skip legacy audit rows

Found 3 issues:

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:215 - The pre-partition guard no-ops when audit_partitioned is empty. The partitioning migration created audit_partitioned alongside the existing audit table, with old audit rows left to age out via retention, so an upgraded instance can have legacy audit rows but no partitioned rows yet. In that state partition_floor is None, try_start accepts the window, run_backfill scans only audit_partitioned, and the job can complete with zero rows while silently omitting legacy audit rows. Reject overlapping legacy rows directly, or treat an empty partitioned table as a boundary error when legacy rows exist.

  2. [P2] backend/windmill-api/openapi.yaml:1815 - The new routes were added to openapi.yaml, but backend/windmill-api/openapi-deref.yaml and openapi-deref.json were not regenerated. The backend serves those checked-in dereferenced files via include_str!, so /openapi.yaml and /openapi.json will not expose the new backfill endpoints to clients consuming the served spec.

  3. [P2] backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:5 - The migration comment block describes the previous implementation and PR history. Root AGENTS.md says: “Comments record constraints, not narration. Write a comment only for what the code can't show” and “Describe the code as it is, never its drafting history”. Please trim this to durable constraints, such as why the recent floor and monotonic cursor update are required.

Test coverage

Backend coverage now includes paging/idempotence, unstable-window rejection, and pre-partition rejection when a partitioned row exists. It is missing the empty audit_partitioned plus legacy audit rows case above.

No frontend automated tests are expected for the settings description string. Manual verification should include running a settled historical backfill with partitioned rows and confirming status plus object keys, then trying a legacy-only/pre-partition window and confirming it is rejected rather than completing with zero rows. Also verify the served /openapi.yaml and /openapi.json include the two new backfill operations after regenerating the dereferenced specs.

…led cutoff

Address review (two P1s):

- Object-key overwrite loss: keys were `dt=<day>/audit_backfill_<min_id>.ndjson`.
  A narrower, overlapping backfill can start a day's page at the same first row
  (same min_id) but hold fewer rows, and `put` would overwrite a broader run's
  object — silently dropping the rows only that object held. Include the
  requested window in the key so different ranges write disjoint objects (same
  window re-runs stay idempotent; consumers dedupe overlapping rows by id). New
  regression test (verified red→green).

- Untrustworthy settled cutoff: when min(xact_start) isn't trustworthy (role
  lacks pg_read_all_stats/superuser, or a prepared 2PC txn exists), the old
  now()-7d fallback could still let an old transaction commit rows inside an
  accepted window after the scan, so a "complete" backfill silently missed them.
  Since a backfill asserts completeness, reject in those cases instead of
  falling back. (The continuous exporter keeps its 7-day fallback — it only
  claims bounded lag.)

Also makes the tests robust under the parallel runner: run_backfill takes the
store as a param, so tests pass a local in-memory store (no global
OBJECT_STORE_SETTINGS race) and serialize on the PAGE_ROWS override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Both P1s from the latest Codex review — fixed in e399fbd.

Object-key overwrite loss (audit_logs_s3_backfill.rs key). Correct and serious. Keys were dt=<day>/audit_backfill_<min_id>.ndjson; a narrower, overlapping backfill could start a day's page at the same first row (same min_id) but hold fewer rows, and put would overwrite a broader run's object — dropping the rows only that object held. The object key now includes the requested window (audit_backfill_<window>_<min_id>.ndjson), so different ranges write disjoint keys and can't clobber each other; the same window re-run stays idempotent, and overlap is deduped by id as before. New backfill_window_in_key_prevents_overwrite test, verified red→green (without the window in the key it fails with row 3 lost … [1, 2]).

Untrustworthy settled cutoff. Also correct. When min(xact_start) isn't trustworthy — role without pg_read_all_stats/superuser (sees only its own sessions), or a prepared 2PC txn (invisible to pg_stat_activity) — the old now()-7d fallback couldn't actually guarantee settledness: an old/invisible transaction could still commit rows inside an accepted window after the scan. Since a backfill asserts completeness, it now rejects in those cases with a clear message (grant the privilege / clear prepared txns) rather than claiming a complete window. The continuous exporter keeps its 7-day fallback — it only claims bounded lag, not completeness.

(Tests were also made parallel-safe: run_backfill takes the store as a param, so tests use a local in-memory store instead of the global OBJECT_STORE_SETTINGS, and serialize on the PAGE_ROWS override.)

@github-actions

Copy link
Copy Markdown
Contributor

Now I have all the context. Let me compile the final review.

Pi Review

Good to merge.

Summary

The latest Codex finding (overlapping backfills overwriting prior exports via identical object keys) is stale — the object key on line 384 is audit_backfill_{window_key}_{min_id}.ndjson where window_key = "{from_ms}_{to_ms}" (line 338). The window is embedded in the key, making different backfill ranges write to disjoint keys. The dedicated test backfill_window_in_key_prevents_overwrite explicitly proves a narrower overlapping backfill does not overwrite a broader one's objects. The doc comment on line 18 states this explicitly: "The key includes the requested window so two different backfill ranges never overwrite each other."

All four prior P1 findings from earlier review rounds have been addressed in commits e0fcd0d5a through 4214d3b:

  • Unstable-window validation (e8ba447): try_start queries oldest in-flight xact_start with trustworthy stats gating and rejects to > settled_cutoff
  • Re-enable anchor clobbering (e8ba447): EE checkpoint write (sqlx cache 881d996a) now has WHERE (value->>'last_xmin')::bigint <= $4, making it monotonic ✅
  • Legacy epoch-fixup data loss (3c6f556): the one-time UPDATE fully re-anchors (new xmin + recent floor) instead of stamping a synthetic floor over the old cursor ✅
  • Pre-partition window silently omitting legacy rows (4214d3b): try_start rejects from < min(audit_partitioned.timestamp) with a descriptive error; setting description scopes the backfill accordingly ✅

What was reviewed

All 12 changed files (~695 added lines in the backfill module plus tests, ~79 migration up, ~19 migration down, ~86 OpenAPI spec, ~9 audit_logs_s3.rs update, ~35 route registration, ~2 frontend description string, EE ref bump, 4 sqlx offline cache entries).

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from >= to, to settled, from >= earliest_partitioned, atomic lease via background_task::try_claim
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays stale-heartbeat from background_task::get

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All input-validated, placed alongside audit_logs_s3 in windmill-api-settings. Lease-handling panic-safe (catch_unwind → always release). Heartbeat refreshed per-object. ✅

Migration: up.sql uses monotonic ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin. Legacy epoch-sentinel fixup gated on last_ts <= 'epoch' and fully re-anchors. down.sql restores the prior trigger. Naming follows cargo sqlx migrate add -r convention. ✅

Frontend: Single-line description-string update only; consistent with the new backend contract. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. Migration SQL is static.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within timestamp >= $1 AND timestamp < $2 advances strictly. Day-grouping is sound because sort order keeps each UTC day contiguous.
  • Deterministic & window-scoped object keys: audit_backfill_{from_ms}_{to_ms}_{min_id}.ndjson — collision-free across different windows, reproducible on re-run. Verified by backfill_window_in_key_prevents_overwrite.
  • Monotonic checkpoint guards on both paths: trigger (up.sql:54: WHERE last_xmin < EXCLUDED.last_xmin) and EE exporter (sqlx cache 881d996a: WHERE (value->>'last_xmin')::bigint <= $4). Both monotonic, no anchor overwrite race.
  • Settled-window validation: Uses same trustworthy stats gating as the exporter floor (restricted role / prepared-2PC ⇒ reject rather than fall back). Error message tells operator the max allowed to.
  • Legacy audit table scope: Explicitly documented in module doc comment and validated at try_start; consistent with the steady-state exporter's pre-existing scope.
  • Feature gates: Both audit_logs_s3 and audit_logs_s3_backfill carry #![cfg(feature = "parquet")]; mod declarations and route registrations gated accordingly.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Consistent with the re-anchor design.
  • EE workflow: [ee] prefix, ee-repo-ref.txt bumped to 6b191b7, companion PR referenced per docs/enterprise.md.

Test coverage

  • Backend (Rust): The audit_logs_s3_backfill module has four OSS-side integration tests (backfill_exports_window_in_pages, backfill_window_in_key_prevents_overwrite, backfill_rejects_unstable_window, backfill_rejects_pre_partition_window) covering keyset pagination, multi-page day splitting, idempotent re-runs, overlapping-window key isolation, unstable-window rejection, and pre-partition-boundary rejection. The PR description additionally references EE-side tests (audit_export_end_to_end updated, audit_export_backlog_drain_no_floor_loss regression test). Trigger/migration behavior and backfill page query plan were verified directly against Postgres per the PR body.
  • Migration: sqlx offline cache regenerated (3 new + 1 modified query json).
  • Frontend: Single-line description-string update only; no component tests expected.

Manual verification needed

No in-app surface to exercise — the OSS-side changes are a new API (backend only, no UI) and a settings description string update. The description change only renders under an EE license and is informational. The PR author notes no screenshot is applicable. All EE-side behavior (steady-state export, trigger, re-enable races) should be verified in the EE companion PR windmill-labs/windmill-ee-private#634.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">

<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:185">
P1: Backfill settled-time gating uses cluster-wide PostgreSQL system views (`pg_stat_activity`, `pg_prepared_xacts`) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force `settled_cutoff` far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's `audit_partitioned` table. This over-restricts or entirely blocks valid backfill windows.</violation>

<violation number="2" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:338">
P2: Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

// be at or before the oldest in-flight `xact_start`: everything strictly older than
// the oldest running transaction is already committed and stable.
//
// That bound is only sound when we can see every xmin-holding transaction. A role

@cubic-dev-ai cubic-dev-ai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Backfill settled-time gating uses cluster-wide PostgreSQL system views (pg_stat_activity, pg_prepared_xacts) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force settled_cutoff far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's audit_partitioned table. This over-restricts or entirely blocks valid backfill windows.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/windmill-api-settings/src/audit_logs_s3_backfill.rs, line 185:

<comment>Backfill settled-time gating uses cluster-wide PostgreSQL system views (`pg_stat_activity`, `pg_prepared_xacts`) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force `settled_cutoff` far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's `audit_partitioned` table. This over-restricts or entirely blocks valid backfill windows.</comment>

<file context>
@@ -173,15 +175,20 @@ pub async fn try_start(db: &DB, from: DateTime<Utc>, to: DateTime<Utc>) -> error
+    // be at or before the oldest in-flight `xact_start`: everything strictly older than
+    // the oldest running transaction is already committed and stable.
+    //
+    // That bound is only sound when we can see every xmin-holding transaction. A role
+    // without pg_read_all_stats/superuser sees only its own sessions, and a prepared
+    // (2PC) transaction is invisible to pg_stat_activity — in either case an old
</file context>
Fix with cubic

// broader run's object — silently dropping the rows only that object held. Including
// the window makes different ranges write disjoint keys (same window re-runs stay
// idempotent); consumers already dedupe overlapping rows by `id`.
let window_key = format!("{}_{}", from.timestamp_millis(), to.timestamp_millis());

@cubic-dev-ai cubic-dev-ai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/windmill-api-settings/src/audit_logs_s3_backfill.rs, line 338:

<comment>Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.</comment>

<file context>
@@ -313,6 +329,13 @@ async fn run_backfill(
+    // broader run's object — silently dropping the rows only that object held. Including
+    // the window makes different ranges write disjoint keys (same window re-runs stay
+    // idempotent); consumers already dedupe overlapping rows by `id`.
+    let window_key = format!("{}_{}", from.timestamp_millis(), to.timestamp_millis());
 
     loop {
</file context>
Fix with cubic

…napi; trim migration comment

Address review (1 P1 + 2 P2):

- Empty-partition backfill (P1): the min(audit_partitioned) guard no-ops when
  audit_partitioned is empty, so an upgraded instance with legacy `audit` rows
  but no partitioned rows yet would accept a window and complete with zero rows,
  silently omitting the legacy rows. Check the legacy `audit` table directly:
  reject any window that overlaps a legacy row (subsumes the boundary check and
  covers the empty-partitioned case). Test updated accordingly.

- openapi-deref (P2): regenerate openapi-deref.yaml/json (served via include_str!)
  so /openapi.{yaml,json} expose the new backfill endpoints.

- Migration comment (P2): trim the PR-history narration to the durable
  constraints (why a recent floor and a monotonic cursor are required), per
  AGENTS.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

All three from the latest Codex review — fixed in ef98048.

[P1] Empty-partition backfill could silently skip legacy rows. Correct — my min(audit_partitioned) guard no-ops when audit_partitioned is empty (upgraded instance with legacy rows but no partitioned rows yet), so the window was accepted and completed with zero rows. Now try_start checks the legacy audit table directly: EXISTS (SELECT 1 FROM audit WHERE timestamp >= from AND timestamp < to) → reject. This subsumes the boundary check and covers the empty-partitioned case. The regression test now inserts a legacy audit row (no partitioned rows) and asserts rejection.

[P2] openapi-deref.{yaml,json} not regenerated. Fixed — ran build_openapi.sh; both dereferenced specs (served via include_str!) now include runAuditLogsS3Backfill / getAuditLogsS3BackfillStatus.

[P2] Migration comment narrated PR history. Trimmed both comment blocks to the durable constraints (why the floor must be recent — timestamp >= floor is the only partition-pruning bound; why the cursor write is monotonic), dropping the "previous version did X" narration per AGENTS.md.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: empty-partition backfill can silently skip legacy audit rows

Found 3 issues:

  1. [P1] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:231 - The pre-partition guard no-ops when audit_partitioned is empty. The partitioning migration left old audit rows in place to age out via retention while new reads still union audit_partitioned UNION ALL audit, so an upgraded but idle instance can have legacy audit rows and no partitioned rows yet. In that state partition_floor is None, try_start accepts the window, run_backfill scans only audit_partitioned, and the job can complete with zero rows while silently omitting legacy audit rows. Reject overlapping legacy rows directly, or treat an empty partitioned table as an unsupported boundary when legacy rows exist.

  2. [P2] backend/windmill-api/openapi.yaml:1815 - The new routes were added only to openapi.yaml; backend/windmill-api/openapi-deref.yaml and openapi-deref.json still do not contain audit_logs_s3_backfill. The backend serves the dereferenced files via include_str!, so /openapi.yaml and /openapi.json will not expose the new endpoints to clients consuming the served spec.

  3. [P2] backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:5 - The migration comment block narrates the previous implementation and change history. Root AGENTS.md says: “Comments record constraints, not narration. Write a comment only for what the code can't show” and “Describe the code as it is, never its drafting history”. Please trim this to durable constraints, such as why the recent floor and monotonic cursor update are required.

Test coverage

Backend coverage now includes paging/idempotence, unstable-window rejection, per-window object-key isolation, and pre-partition rejection when a partitioned row exists. It is still missing the empty audit_partitioned plus legacy audit rows case above.

No frontend automated tests are expected for the settings description string. Manual verification should include trying a legacy-only/pre-partition window on an upgraded database and confirming it is rejected rather than completing with zero rows. Also verify the served /openapi.yaml and /openapi.json include the two new backfill operations after regenerating the dereferenced specs.

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

Summary

This PR fixes a production incident where the S3 audit export's age(xmin) scan with an epoch floor caused repeated full-table scans, statement timeouts, and DB saturation. The fix re-anchors the cursor on enable/re-enable, fixes a silent data-loss bug in backlog draining, and adds an opt-in historical backfill API. All prior P1 findings from six previous review rounds (unstable-window validation, re-enable anchor clobbering, legacy epoch-fixup data loss, legacy audit table silent omission, overlapping backfill key collisions, fallback cutoff accepting unstable windows) have been addressed across 8 commits. No blocking issues remain.

What was reviewed

All 14 changed files across ~1375 added / 12 removed lines: audit_logs_s3_backfill.rs (709 lines with 5 integration tests), migration SQL (up 70 + down 19), updated audit_logs_s3.rs (9 lines), lib.rs route registration (35 lines), openapi.yaml (86 lines), openapi-deref.{json,yaml} (regenerated, 369 lines), instanceSettings.ts (2-line description update), ee-repo-ref.txt (1 line), and 4 sqlx offline cache entries (3 new, 1 modified).

Public-surface checklist

New API endpoints:

  • POST /settings/audit_logs_s3_backfillrequire_super_admin + LicensePlan::Enterprise runtime check; validates from >= to rejected, to settled (oldest in-flight xact_start gated on trustworthy stats — rejects rather than falls back to 7-day), from not overlapping legacy audit table; claims atomically via background_task::try_claim. ✅
  • GET /settings/audit_logs_s3_backfill_statusrequire_super_admin; read-only, overlays background_task::get stale-heartbeat detection. ✅

New public functions (audit_logs_s3_backfill.rs): try_start, get_status, spawn_backfill, AuditBackfillProgress, BackfillRequest. All properly validated. Module placed alongside audit_logs_s3 in windmill-api-settings (matches module doc comment purpose). Lease handling is panic-safe (catch_unwind → always release). Heartbeat refreshed per-object (not per-page), staying well under STALE_HEARTBEAT_SECS = 120. Object keys include the requested window (audit_backfill_{from_ms}_{to_ms}_{min_id}.ndjson) so different ranges write disjoint keys; same-range re-runs are idempotent overwrites. ✅

Migration (up.sql): Trigger uses ON CONFLICT DO UPDATE WHERE last_xmin < EXCLUDED.last_xmin (monotonic, HA-safe). One-time fixup gates on last_ts <= 'epoch'::timestamptz and fully re-anchors (new snapshot xmin + recent floor). down.sql restores the prior trigger function (epoch sentinel + DO NOTHING). Naming follows cargo sqlx migrate add -r convention. ✅

Frontend: Single-line description-string update; accurately describes the new re-anchor behavior, backfill endpoints, and scope. ✅

No issues found

  • SQL injection: All queries parameterized via sqlx::query! / sqlx::query_scalar!. The single non-macro query (SELECT EXISTS (SELECT 1 FROM audit WHERE ...)) takes bound parameters, no string interpolation. Migration SQL is static.
  • Keyset pagination correctness: (timestamp, id) > ($3, $4) with cursor_id = -1 start includes rows at exactly from. ORDER BY timestamp, id within timestamp >= $1 AND timestamp < $2 advances strictly. Day-grouping via by_day.last_mut() is sound — the sort order keeps each UTC day contiguous.
  • Deterministic object keys: audit_backfill_{window_ms}_{min_id}.ndjson — collision-free across different windows (window in key). Idempotent for same-window re-runs. Test backfill_window_in_key_prevents_overwrite verifies broader run's rows survive a narrower overlapping backfill.
  • Settled-cutoff trustworthiness: Backfill rejects rather than falling back to a 7-day heuristic when stats visibility is unavailable (pg_read_all_stats/superuser missing or prepared 2PC present). Comment explicitly notes the continuous exporter keeps the fallback because it claims bounded lag, not completeness. Test backfill_rejects_unstable_window covers rejection.
  • Legacy audit table rejection: try_start checks EXISTS (SELECT 1 FROM audit WHERE timestamp >= $1 AND timestamp < $2) and rejects overlaps. This correctly handles the empty-audit_partitioned edge case (a min(audit_partitioned) guard would no-op there). Test backfill_rejects_window_overlapping_legacy covers both rejection and acceptance paths.
  • Feature gates: Both audit_logs_s3_backfill and audit_logs_s3 carry #![cfg(feature = "parquet")]; mod declarations and route registrations gated accordingly.
  • drainingbootstrapping repurposing: audit_logs_s3.rs now reads a draining key from the EE-written JSONB instead of comparing last_ts to epoch. Defaults to false when absent (before first EE run). Consistent with the re-anchor design.
  • EE workflow: [ee] prefix in PR title, ee-repo-ref.txt bumped to 6b191b7, companion PR pattern followed per docs/enterprise.md.
  • AssertUnwindSafe: spawn_backfill wraps the task in catch_unwind, always calls release. Stale heartbeat (120s) provides a second safety net. Note that Session holds only DB (connection pool), String, and RwLock<AuditBackfillProgress> — all unwind-safe — so AssertUnwindSafe is correct.

Test coverage

  • Backend (Rust): 5 sqlx::test integration tests in audit_logs_s3_backfill.rs:
    • backfill_exports_window_in_pages — keyset paging (forced 2 rows/page), cross-day page boundaries, out-of-window exclusion, progress row count, idempotent re-run
    • backfill_window_in_key_prevents_overwrite — broader run's rows survive narrower overlapping backfill (window in key)
    • backfill_rejects_unstable_window — future to rejected, settled past window accepted
    • backfill_rejects_window_overlapping_legacy — legacy row triggers rejection, clean window accepted
    • Plus EE companion repo tests (audit_export_end_to_end updated, audit_export_backlog_drain_no_floor_loss regression test)
  • Migration: sqlx offline cache regenerated (3 new + 1 modified query cache entries). Trigger behavior and fixup verified against Postgres per the PR description.
  • Frontend: Single-line description-string update only; no component tests expected.
  • CI / OpenAPI: cargo check --features enterprise,private,parquet passes; OpenAPI spec regenerated consistently.

Manual verification needed

No in-app surface to exercise beyond confirming the instance-settings description string renders correctly under an EE license on the superadmin settings page at /#superadmin-settings. The backend endpoints (POST/GET) can be tested against a real DB with audit_partitioned rows and configured object storage to confirm a settled-window backfill completes with correct object keys, row counts, and status progression. The migration should be tested on a staging instance with a legacy epoch-sentinel checkpoint to verify the one-time fixup re-anchors cleanly.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Mergeable, but should ideally address nits: sub-millisecond backfill windows can share object keys

Found 1 issue:

  1. [P2] backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:338 - The backfill object namespace uses from.timestamp_millis() / to.timestamp_millis(), but Postgres timestamptz and the API request can distinguish microseconds. Two different windows whose bounds differ only within the same millisecond therefore share the same window_key; if they also start at the same first row, the later run overwrites the earlier object even though the row set can differ. That reintroduces the overwrite class this key is meant to prevent. Use at least microsecond precision, or encode the RFC3339 bounds losslessly.

Test coverage

Backend coverage is otherwise solid for the changed surfaces: the new backfill module has SQLx tests for paging, idempotent reruns, overlapping-window keying, unstable-window rejection, and legacy-table overlap rejection; the EE exporter tests cover the re-anchor and drain-floor behavior described in the PR. If the key precision issue is fixed, add a regression with two overlapping windows that differ only by microseconds and verify both object sets remain present.

No frontend automated tests are expected for the single settings-description string. Manual verification should still include starting a settled historical backfill through the new endpoint, checking status, object keys, and row counts, and confirming a legacy-overlap request is rejected.

This commit updates the EE repository reference after PR #634 was merged in windmill-ee-private.

Previous ee-repo-ref: 6b191b77aabcf77658ad4f9031576e0d7b66bf89

New ee-repo-ref: b821fecccbcba2efed544890576bf2b84321d70d

Automated by sync-ee-ref workflow.
@windmill-internal-app

Copy link
Copy Markdown
Contributor

🤖 Updated ee-repo-ref.txt to b821fecccbcba2efed544890576bf2b84321d70d after windmill-ee-private PR #634 was merged.

@windmill-internal-app windmill-internal-app Bot changed the title [ee] perf(audit): re-anchor S3 audit export on enable + opt-in backfill perf(audit): re-anchor S3 audit export on enable + opt-in backfill Jun 26, 2026
@rubenfiszel rubenfiszel merged commit 577ceee into main Jun 26, 2026
16 of 19 checks passed
@rubenfiszel rubenfiszel deleted the audit-export-woes branch June 26, 2026 19:37
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant