perf(audit): re-anchor S3 audit export on enable + opt-in backfill#9818
Conversation
The S3/GCS audit-log export's steady-state query filters by `age(xmin)`
(unindexable), so the only scan bound is the timestamp floor. On a fresh
enable the floor was epoch, and on a re-enable the cursor resumed from its
pre-disable position — either way the first run scanned the whole
`audit_partitioned` table. Under a `statement_timeout` (e.g. Aiven) that scan
never completes: the cursor never advances, nothing is exported, and the
repeated full scans saturate the database.
Re-anchor on enable (EE companion, windmill-ee-private#634):
- New trigger migration records a recent timestamp floor instead of the epoch
sentinel and `DO UPDATE`s the cursor to the current snapshot xmin on
re-enable, so the export always resumes from ~now and never rescans history.
Includes a one-time fixup for legacy epoch-sentinel checkpoints on upgrade.
Opt-in historical backfill (new `audit_logs_s3_backfill` module + endpoints):
- Exports a chosen `[from, to)` window on demand, scanning strictly by
`timestamp` (the partition key) in bounded keyset pages — each query is an
index scan capped at one page (verified via EXPLAIN: later partitions
`never executed`, ~11ms/page), so it stays well under any statement timeout
regardless of window size. Writes alongside the steady-state objects under
logs/audit/, without touching the xmin cursor.
- POST /settings/audit_logs_s3_backfill {from,to} (super-admin + Enterprise),
GET /settings/audit_logs_s3_backfill_status.
Also repurposes the status endpoint's `bootstrapping` flag to mean "draining a
backlog" (the cursor is capped and catching up), and updates the setting
description to point operators at the backfill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @rubenfiszel's task in 3m 16s —— View job Good to merge. Reviewed the OSS-side diff (the EE What I verifiedBackfill module (
Migration (
Status flag repurposing ( Minor (non-blocking)
Test coverage
|
|
Now I have all the context needed. Let me compile the review. Pi ReviewGood to merge. SummaryThis PR fixes a production incident where S3 audit export's steady-state query — scanning by unindexable What was reviewedAll 10 changed files across ~592 added / 7 removed lines:
Public-surface checklistNew public functions (
New API endpoints:
Auth/authorization: Both endpoints gate on super-admin. The POST additionally gates on Input validation: Module placement: No issues found
Test coverage
Manual verification needed
|
Codex Reviewcc @rubenfiszel
Test coverageBackend coverage is needed for the two edge cases above: a backfill request against a non-stable window, and a re-enable while an export run is in progress where the final checkpoint must not move backwards. The PR notes cover the steady-state re-anchor and backlog-drain floor regression, but those do not exercise these races. No frontend automated tests are expected for the one settings-description string. Manual verification should include starting a safe historical backfill and confirming status plus object keys/row counts, then disabling/re-enabling audit export during an active export and confirming the stored cursor remains anchored at the re-enable snapshot. |
There was a problem hiding this comment.
2 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql">
<violation number="1" location="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:36">
P2: `min(xact_start)` is computed without restricting to sessions that actually hold an xid. A long read-only transaction can make the floor very old and reintroduce large partition scans/timeouts on enable.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Address review (cubic): persist progress (refreshing the lease heartbeat) after every object PUT in the backfill page loop, not only once per page, so the gap between heartbeats stays well under STALE_HEARTBEAT_SECS even on slow uploads and another replica can't re-claim mid-page and run a concurrent backfill. Bumps ee-repo-ref.txt to pull in the EE test-race fix (folding the backlog-drain regression into the single audit e2e test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pi ReviewGood to merge. SummaryThis PR fixes a production incident where the S3 audit export's Public-surface checklistNew API endpoints:
New public functions ( Migration: Two P2 nitpicks (non-blocking)P2 — Backfill accepts unstable windows with no guard (
|
Codex Reviewcc @rubenfiszel
Test coverageBackend coverage should include the two edge cases above: rejecting or safely handling a backfill whose upper bound is not visibility-stable, and a disable/re-enable while an export run is in progress where the final checkpoint must not move backwards. The PR notes cover the steady-state re-anchor and backlog-drain floor regression, but not these races. No frontend automated tests are expected for the one settings-description string. Manual verification should include starting a safe historical backfill and confirming status plus object keys/row counts, then disabling/re-enabling audit export during an active export and confirming the stored cursor remains anchored at the re-enable snapshot. |
Address review (P1): the backfill keyset-pages over rows visible at scan time and declares completion when the scan runs dry, but a row's `timestamp` is its inserting transaction's `xact_start`. A window whose upper bound is recent or in the future could silently omit a transaction that started inside `[from, to)` but commits after the scan passed that timestamp. `try_start` now rejects any `to` newer than the oldest in-flight `xact_start` (everything strictly older than the oldest running transaction is committed and stable), using the same trustworthy stats gating as the exporter's floor (restricted role / 2PC → a 7-day-old cutoff). Bumps ee-repo-ref.txt for the EE monotonic-checkpoint fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed both P1s. [P1] Backfill accepted unstable windows — fixed in e8ba447. [P1] Re-enable anchor clobbered by an in-flight exporter — fixed in EE 6b191b7 (ref bumped in e8ba447). The export's advisory lock excludes other exporters but not the enable trigger, so a disable/re-enable mid-upload could re-anchor the checkpoint and then be overwritten by the exporter's stale |
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">
<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:165">
P2: No-active-transaction case is treated as unsafe fallback, incorrectly forcing a 7-day cutoff. This can reject valid backfill requests even when everything is already settled.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
Pi ReviewGood to merge. Both P1 findings from the prior review round have been addressed in commit e8ba447:
The per-object heartbeat fix from e0fcd0d is also present — What was reviewedAll 12 changed files across ~646 added / ~10 removed lines: the new Public-surface checklistNew API endpoints:
New public functions ( Migration: No issues found
Test coverage
Manual verification neededNo in-app UI to exercise (the frontend change is a settings description string that only renders under an EE license). The backfill endpoints and re-anchor behavior should be validated against a real Postgres instance with the EE companion PR's exporter: start a historical backfill via |
Codex Reviewcc @rubenfiszel Should address issues before merging: legacy epoch fixup can mark old post-enable audit rows processed without exporting them Found 1 issue:
Test coverageBackend coverage should include a legacy checkpoint regression: No frontend automated tests are expected for the single settings-description string. Manual verification should include upgrading an instance with a stale epoch audit-export checkpoint, running the exporter, and confirming the stored cursor/status plus object-store contents make the old post-enable gap explicit rather than silently skipping it. |
…loor Address review (P1): the legacy-checkpoint fixup stamped last_oldest_inflight_ts to now()-7d while leaving the old last_xmin in place. On an instance that enabled export on the old code >7 days ago and got stuck before the first successful batch, the next run would filter post-enable rows older than 7 days out via `timestamp >= ts_floor` while still advancing last_xmin over the interval — silently dropping them (the same floor-vs-cursor loss class fixed elsewhere in this PR), and contradicting the "nothing committed after enabling is skipped" guarantee. A stuck epoch-sentinel checkpoint cannot be safely resumed (its backlog can be arbitrarily old, so any recent floor prunes rows the cursor then skips, and an epoch floor reintroduces the full scan). Re-anchor it to the migration's current snapshot xmin instead — exactly like a fresh enable — so the export resumes cleanly from ~now and the never-exported pre-upgrade window is recovered via the opt-in backfill rather than silently dropped. Reword the setting description so it no longer implies the disabled/legacy window is covered by the cursor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
[P1] Legacy epoch-checkpoint fixup could silently drop post-enable rows — fixed in 3c6f556. You're right: stamping A stuck epoch-sentinel checkpoint can't be safely resumed (its backlog is arbitrarily old → any recent floor prunes rows the cursor skips, and an epoch floor reintroduces the full scan). The fixup now re-anchors such a checkpoint to the migration's current snapshot xmin (with a recent floor), exactly like a fresh enable: the export resumes cleanly from ~now, and the never-exported pre-upgrade window is recovered via the opt-in backfill rather than silently dropped while the cursor advances. I also reworded the setting description so it no longer implies the disabled/legacy window is covered by the steady-state cursor. |
|
I've now reviewed all changed files and the prior discussion history. All three P1 findings from earlier review rounds have been addressed:
No new issues found. Pi ReviewGood to merge. SummaryThis PR fixes a production incident where the S3 audit export's What was reviewedAll 12 changed files across ~657 added / ~10 removed lines: the new Public-surface checklistNew API endpoints:
New public functions ( Migration: No issues found
Test coverage
Manual verification needed
|
Codex Reviewcc @rubenfiszel Should address issues before merging: backfill skips legacy Found 1 issue:
Test coverageBackend coverage should include a backfill window containing one row in legacy No frontend automated tests are expected for the single settings-description string. Manual verification should exercise the API on an upgraded database that still has legacy |
The backfill previously had only SQL-level/EXPLAIN validation. Add real integration tests (in-memory object store, sqlx::test) exercising the public path: - backfill_exports_window_in_pages: with the page size forced to 2 rows, a settled 3-day window is exported across multiple keyset pages; asserts every in-window row lands exactly once, rows outside [from,to) are excluded, a day that straddles a page boundary yields more than one object, progress counts match, and a re-run is idempotent (deterministic keys overwritten, no dupes). - backfill_rejects_unstable_window: a future/live `to` is rejected as unstable, a window safely in the past is accepted. Adds a test-only PAGE_ROWS override so multi-page behaviour is exercised with a handful of rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql">
<violation number="1" location="backend/migrations/20260626132251_audit_logs_s3_reanchor_on_enable.up.sql:36">
P2: `min(xact_start)` is computed without restricting to sessions that actually hold an xid. A long read-only transaction can make the floor very old and reintroduce large partition scans/timeouts on enable.</violation>
</file>
<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">
<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:165">
P2: No-active-transaction case is treated as unsafe fallback, incorrectly forcing a 7-day cutoff. This can reject valid backfill requests even when everything is already settled.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
Pi ReviewGood to merge. All P1 findings from the three prior review rounds have been addressed. The two new tests in What changed since the last reviewsCommit-by-commit fix verification against the prior discussion:
Legacy
|
Make explicit that, like the steady-state export, the backfill reads only audit_partitioned; the pre-partitioning `audit` table is intentionally out of scope (not a missed case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex Reviewcc @rubenfiszel Should address issues before merging: backfill skips legacy Found 1 issue:
Test coverageBackend coverage should include a backfill window containing one row in legacy No frontend automated tests are expected for the single settings-description string. Manual verification should exercise the API on an upgraded database that still has legacy |
|
I have now completed a thorough review of all 6 commits across all 12 changed files. Here is the final review: Pi ReviewGood to merge. SummaryAll three P1 findings from the prior three review rounds have been resolved:
Plus What was reviewedAll 12 changed files (~860 added, ~10 removed lines): Public-surface checklistNew API endpoints:
New public functions ( Migration: Frontend: Single-line description-string update only; consistent with the new backend contract. ✅ No issues found
Test coverage
Manual verification neededNo in-app UI surface to exercise — the only frontend change is a description string ( |
Address review (Codex P1): the backfill reads only audit_partitioned, but pre-partitioning history lives in the legacy `audit` table (still read by audit list/get via UNION ALL, and retained for the configured period — 365 days by default on EE). Since the setting text points operators at this API for "pre-existing history", a window overlapping legacy rows would report completion while silently omitting them. Per the decision to not export the legacy table, reject instead of silently omit: try_start now rejects a `from` earlier than the oldest audit_partitioned timestamp (every legacy row predates the partition cutover, so a `from` at/after that boundary can never overlap them). Reworded the setting text to scope the backfill to the partitioned era. Added a regression test, plus an RAII guard (cubic P2) so the test-only globals are restored even if an assertion panics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
[P1] Backfill could report success while omitting legacy The export and backfill intentionally cover only Rather than export the legacy table, |
Codex Reviewcc @rubenfiszel
Test coverageThe PR adds backend Manual verification should include running a broad backfill followed by a narrower overlapping backfill with the same first row and confirming the object store still contains the full broad-window row set. Also verify the API behavior on a DB user without |
Pi ReviewGood to merge. SummaryThis PR fixes a production incident where the S3 audit export's What was reviewedAll 12 changed files across ~918 added / 10 removed lines: Public-surface checklistNew API endpoints:
New public functions ( Migration ( Frontend: Single-line description-string update only; accurately describes the new re-anchor behavior and backfill endpoints. ✅ No issues found
Test coverage
Manual verification needed
|
Codex Reviewcc @rubenfiszel Found 3 issues:
Test coverageBackend coverage now includes paging/idempotence, unstable-window rejection, and pre-partition rejection when a partitioned row exists. It is missing the empty No frontend automated tests are expected for the settings description string. Manual verification should include running a settled historical backfill with partitioned rows and confirming status plus object keys, then trying a legacy-only/pre-partition window and confirming it is rejected rather than completing with zero rows. Also verify the served |
…led cutoff Address review (two P1s): - Object-key overwrite loss: keys were `dt=<day>/audit_backfill_<min_id>.ndjson`. A narrower, overlapping backfill can start a day's page at the same first row (same min_id) but hold fewer rows, and `put` would overwrite a broader run's object — silently dropping the rows only that object held. Include the requested window in the key so different ranges write disjoint objects (same window re-runs stay idempotent; consumers dedupe overlapping rows by id). New regression test (verified red→green). - Untrustworthy settled cutoff: when min(xact_start) isn't trustworthy (role lacks pg_read_all_stats/superuser, or a prepared 2PC txn exists), the old now()-7d fallback could still let an old transaction commit rows inside an accepted window after the scan, so a "complete" backfill silently missed them. Since a backfill asserts completeness, reject in those cases instead of falling back. (The continuous exporter keeps its 7-day fallback — it only claims bounded lag.) Also makes the tests robust under the parallel runner: run_backfill takes the store as a param, so tests pass a local in-memory store (no global OBJECT_STORE_SETTINGS race) and serialize on the PAGE_ROWS override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Both P1s from the latest Codex review — fixed in e399fbd. Object-key overwrite loss ( Untrustworthy settled cutoff. Also correct. When (Tests were also made parallel-safe: |
|
Now I have all the context. Let me compile the final review. Pi ReviewGood to merge. SummaryThe latest Codex finding (overlapping backfills overwriting prior exports via identical object keys) is stale — the object key on line 384 is All four prior P1 findings from earlier review rounds have been addressed in commits
What was reviewedAll 12 changed files (~695 added lines in the backfill module plus tests, ~79 migration up, ~19 migration down, ~86 OpenAPI spec, ~9 Public-surface checklistNew API endpoints:
New public functions ( Migration: Frontend: Single-line description-string update only; consistent with the new backend contract. ✅ No issues found
Test coverage
Manual verification neededNo in-app surface to exercise — the OSS-side changes are a new API (backend only, no UI) and a settings description string update. The description change only renders under an EE license and is informational. The PR author notes no screenshot is applicable. All EE-side behavior (steady-state export, trigger, re-enable races) should be verified in the EE companion PR |
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs">
<violation number="1" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:185">
P1: Backfill settled-time gating uses cluster-wide PostgreSQL system views (`pg_stat_activity`, `pg_prepared_xacts`) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force `settled_cutoff` far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's `audit_partitioned` table. This over-restricts or entirely blocks valid backfill windows.</violation>
<violation number="2" location="backend/windmill-api-settings/src/audit_logs_s3_backfill.rs:338">
P2: Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| // be at or before the oldest in-flight `xact_start`: everything strictly older than | ||
| // the oldest running transaction is already committed and stable. | ||
| // | ||
| // That bound is only sound when we can see every xmin-holding transaction. A role |
There was a problem hiding this comment.
P1: Backfill settled-time gating uses cluster-wide PostgreSQL system views (pg_stat_activity, pg_prepared_xacts) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force settled_cutoff far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's audit_partitioned table. This over-restricts or entirely blocks valid backfill windows.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/windmill-api-settings/src/audit_logs_s3_backfill.rs, line 185:
<comment>Backfill settled-time gating uses cluster-wide PostgreSQL system views (`pg_stat_activity`, `pg_prepared_xacts`) without scoping to the current database. A long-running or prepared transaction in an unrelated database can force `settled_cutoff` far into the past or cause a hard NULL rejection—even though such a transaction can never commit rows into this database's `audit_partitioned` table. This over-restricts or entirely blocks valid backfill windows.</comment>
<file context>
@@ -173,15 +175,20 @@ pub async fn try_start(db: &DB, from: DateTime<Utc>, to: DateTime<Utc>) -> error
+ // be at or before the oldest in-flight `xact_start`: everything strictly older than
+ // the oldest running transaction is already committed and stable.
+ //
+ // That bound is only sound when we can see every xmin-holding transaction. A role
+ // without pg_read_all_stats/superuser sees only its own sessions, and a prepared
+ // (2PC) transaction is invisible to pg_stat_activity — in either case an old
</file context>
| // broader run's object — silently dropping the rows only that object held. Including | ||
| // the window makes different ranges write disjoint keys (same window re-runs stay | ||
| // idempotent); consumers already dedupe overlapping rows by `id`. | ||
| let window_key = format!("{}_{}", from.timestamp_millis(), to.timestamp_millis()); |
There was a problem hiding this comment.
P2: Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/windmill-api-settings/src/audit_logs_s3_backfill.rs, line 338:
<comment>Window key truncates timestamp precision to milliseconds, weakening the documented guarantee that different backfill ranges never overwrite each other.</comment>
<file context>
@@ -313,6 +329,13 @@ async fn run_backfill(
+ // broader run's object — silently dropping the rows only that object held. Including
+ // the window makes different ranges write disjoint keys (same window re-runs stay
+ // idempotent); consumers already dedupe overlapping rows by `id`.
+ let window_key = format!("{}_{}", from.timestamp_millis(), to.timestamp_millis());
loop {
</file context>
…napi; trim migration comment
Address review (1 P1 + 2 P2):
- Empty-partition backfill (P1): the min(audit_partitioned) guard no-ops when
audit_partitioned is empty, so an upgraded instance with legacy `audit` rows
but no partitioned rows yet would accept a window and complete with zero rows,
silently omitting the legacy rows. Check the legacy `audit` table directly:
reject any window that overlaps a legacy row (subsumes the boundary check and
covers the empty-partitioned case). Test updated accordingly.
- openapi-deref (P2): regenerate openapi-deref.yaml/json (served via include_str!)
so /openapi.{yaml,json} expose the new backfill endpoints.
- Migration comment (P2): trim the PR-history narration to the durable
constraints (why a recent floor and a monotonic cursor are required), per
AGENTS.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
All three from the latest Codex review — fixed in ef98048. [P1] Empty-partition backfill could silently skip legacy rows. Correct — my [P2] [P2] Migration comment narrated PR history. Trimmed both comment blocks to the durable constraints (why the floor must be recent — |
Codex Reviewcc @rubenfiszel Found 3 issues:
Test coverageBackend coverage now includes paging/idempotence, unstable-window rejection, per-window object-key isolation, and pre-partition rejection when a partitioned row exists. It is still missing the empty No frontend automated tests are expected for the settings description string. Manual verification should include trying a legacy-only/pre-partition window on an upgraded database and confirming it is rejected rather than completing with zero rows. Also verify the served |
Pi ReviewGood to merge. SummaryThis PR fixes a production incident where the S3 audit export's What was reviewedAll 14 changed files across ~1375 added / 12 removed lines: Public-surface checklistNew API endpoints:
New public functions ( Migration ( Frontend: Single-line description-string update; accurately describes the new re-anchor behavior, backfill endpoints, and scope. ✅ No issues found
Test coverage
Manual verification neededNo in-app surface to exercise beyond confirming the instance-settings description string renders correctly under an EE license on the superadmin settings page at |
Codex Reviewcc @rubenfiszel Found 1 issue:
Test coverageBackend coverage is otherwise solid for the changed surfaces: the new backfill module has SQLx tests for paging, idempotent reruns, overlapping-window keying, unstable-window rejection, and legacy-table overlap rejection; the EE exporter tests cover the re-anchor and drain-floor behavior described in the PR. If the key precision issue is fixed, add a regression with two overlapping windows that differ only by microseconds and verify both object sets remain present. No frontend automated tests are expected for the single settings-description string. Manual verification should still include starting a settled historical backfill through the new endpoint, checking status, object keys, and row counts, and confirming a legacy-overlap request is rejected. |
This commit updates the EE repository reference after PR #634 was merged in windmill-ee-private. Previous ee-repo-ref: 6b191b77aabcf77658ad4f9031576e0d7b66bf89 New ee-repo-ref: b821fecccbcba2efed544890576bf2b84321d70d Automated by sync-ee-ref workflow.
|
🤖 Updated |
EE companion: windmill-labs/windmill-ee-private#634
Problem (production incident)
The S3/GCS audit-log export's steady-state query selects rows by
age(xmin)— a system column that cannot be indexed — so the only thing bounding the scan is the timestamp floor. Two paths made that floor sit far in the past:ON CONFLICT DO NOTHING) and tried to backfill the entire disabled window.Either way the first run sequentially scans the whole
audit_partitionedtable. Under astatement_timeout(e.g. Aiven) the scan is cancelled before a single batch finishes —canceling statement due to statement timeout— so the cursor never advances, nothing reaches the bucket, and the repeated full scans saturate the DB (100% load; thev2_job/job_logsvacuum then also times out as collateral). Thepg_read_all_statsgrant can't help because the adaptive-floor branch is never reached.Fix
1. Re-anchor on enable (EE
ee.rs+ trigger migration)xact_startwhen the stats privilege is available, else a bounded 7-day window) instead of epoch, so the first run prunes to recent partitions.ON CONFLICT DO UPDATEs the cursor to the current snapshot xmin on re-enable, so the export always resumes from ~now and never rescans history. One-time fixup migrates legacy epoch-sentinel checkpoints on upgrade.2. Data-loss fix (EE
ee.rs)MAX_XID_INTERVAL), the floor stored for the next run is now carried forward, not advanced to the recent in-flight value. Advancing it pruned the still-un-drained old rows and let the xid cursor skip them — silent audit-row loss. New regression test verified red→green.3. Opt-in historical backfill (new
audit_logs_s3_backfillmodule + endpoints)[from, to)window on demand, scanning strictly bytimestamp(the partition key) in bounded keyset pages. Each query is an index scan capped at one page and short-circuits at theLIMIT, so it stays well under any statement timeout regardless of window size.POST /settings/audit_logs_s3_backfill {from,to}(super-admin + Enterprise),GET /settings/audit_logs_s3_backfill_status. Writes alongside the steady-state objects underlogs/audit/, without touching the xmin cursor.EXPLAIN of the page query (20k rows / 5 daily partitions,
LIMIT 10000):Tests / validation
cargo check --features enterprise,private,parquet✅audit_export_end_to_endupdated for the re-anchor contract (recent floor; re-enable advances the cursor) ✅audit_export_backlog_drain_no_floor_lossregression test (test-onlyMAX_XID_INTERVALoverride) — fails without the carry-floor fix, passes with it ✅Notes
frontend/change is a one-line settings description string (the EE-gated "Store audit logs in object storage" toggle) pointing operators at the backfill endpoint — no component/layout change, and it only renders under an EE license, so no screenshot.job_perms/v2_job_queuetable bloat still needsVACUUM FULL/pg_repackin a maintenance window.🤖 Generated with Claude Code
Summary by cubic
Re-anchored the S3/GCS audit export to start from “now” on enable/re-enable to prevent full-table scans/timeouts, and added a safe, opt-in historical backfill that exports a chosen window via bounded, index-backed scans.
Bug Fixes
DO UPDATEthe cursor to the current snapshot xmin (monotonic). One-time re-anchor of legacy epoch-sentinel checkpoints to the current snapshot xmin.GET /settings/audit_logs_s3_status:bootstrappingnow means “draining backlog”.tomust be ≤ oldest in-flightxact_start, and the DB role must havepg_read_all_stats/superuser with no prepared (2PC) txns.audittable (covers emptyaudit_partitionedcases).logs/audit/dt=<day>/audit_backfill_<from,to>_<min_id>.ndjson) to prevent narrower runs from overwriting broader ones.New Features
POST /settings/audit_logs_s3_backfill {from,to}andGET /settings/audit_logs_s3_backfill_status. Scans by(timestamp,id)withLIMIT 10000, writes alongside steady-state objects, and does not touch the xmin cursor. OpenAPI updated to include these endpoints.Written for commit 8e1af7f. Summary will update on new commits.