Skip to content

Storage mapping connection test passes without delete permission, then shard pruner fails later #3073

Description

@jwhartley

Observation

When a tenant configures their own cloud storage bucket, the storage mapping connection test validates connectivity but does not exercise a delete. If the bucket policy is missing the delete action (s3:DeleteObject, GCS storage.objects.delete, Azure blob delete), the test still passes.

The cost shows up later. A storage mapping holds two kinds of data under distinct prefixes in the same bucket: collection data under collection-data/, and per-task recovery logs under recovery/. Recovery-log fragments are pruned continuously as shards checkpoint, so the data plane issues deletes under recovery/ constantly. When the bucket policy lacks delete, that pruning fails with AccessDenied on every run, which fires recurring alerts in #alerts and leaves superseded recovery fragments accumulating in the customer's bucket.

Collection-data deletes are different: they only happen when a finite retention is set on the collection. With no retention set, collection fragments are retained indefinitely, so collection-data/ deletes may never occur for an own-bucket tenant. recovery/ is the prefix where deletes are unconditional.

We currently have two tenants in this state (one on a public data plane with its own bucket, one BYOC). The required IAM policy already lists delete in the UI (awsHooks.ts:65), and the generated policy grants it bucket-wide, so this is a validation gap rather than a docs gap: setups that predate the delete action, or that applied a hand-scoped policy, sail through the connection test and only break later.

Why it matters

  • The problem is invisible at the one moment the customer is set up to fix it (config time), and only surfaces as failed background jobs and growing storage.
  • It generates ongoing internal failure alerts that each require manual outreach to the tenant to resolve.

Where it lives (for context, defer to owners)

  • Health-check orchestration, in crates/control-plane-api/src/server/public/graphql/storage_mappings.rs:
    • check_store_health (line 86) calls the Gazette fragment_store_health RPC (line 90)
    • run_all_health_checks (line 103)
    • enforced on create_storage_mapping (line 165) and update_storage_mapping (line 285)
    • exposed via the test_connection_health mutation (line 421)
  • The actual probe is the Gazette FragmentStoreHealth RPC (go.gazette.dev/core broker), so verifying delete likely needs a change there rather than in flow.

Question for the team

Could the fragment store health probe also exercise a delete, by writing a throwaway probe object and then deleting it, so a missing delete permission is caught at config time?

The probe should target the recovery/ prefix specifically. That is where the data plane deletes unconditionally (recovery-log pruning), and it is the prefix failing in the wild. A probe placed under the data prefix (where put/get is presumably already exercised) would pass even when delete on recovery/ is missing, which is exactly the failure mode here. It must use a throwaway object and never touch real recovery/ fragments.

If a destructive probe is undesirable, is there a lighter signal we could surface instead?

Separately: when a tenant brings their own bucket, should delete on recovery/ be treated as a hard requirement at setup? (Raised by Will: if a tenant uses their own bucket for recovery-log data, we need delete to manage it via pruning.)

If we do add a delete check, the GCS (storage.objects.delete) and Azure (blob delete) paths want the same treatment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions