Observation
When a tenant configures their own cloud storage bucket, the storage mapping connection test validates connectivity but does not exercise a delete. If the bucket policy is missing the delete action (s3:DeleteObject, GCS storage.objects.delete, Azure blob delete), the test still passes.
The cost shows up later. A storage mapping holds two kinds of data under distinct prefixes in the same bucket: collection data under collection-data/, and per-task recovery logs under recovery/. Recovery-log fragments are pruned continuously as shards checkpoint, so the data plane issues deletes under recovery/ constantly. When the bucket policy lacks delete, that pruning fails with AccessDenied on every run, which fires recurring alerts in #alerts and leaves superseded recovery fragments accumulating in the customer's bucket.
Collection-data deletes are different: they only happen when a finite retention is set on the collection. With no retention set, collection fragments are retained indefinitely, so collection-data/ deletes may never occur for an own-bucket tenant. recovery/ is the prefix where deletes are unconditional.
We currently have two tenants in this state (one on a public data plane with its own bucket, one BYOC). The required IAM policy already lists delete in the UI (awsHooks.ts:65), and the generated policy grants it bucket-wide, so this is a validation gap rather than a docs gap: setups that predate the delete action, or that applied a hand-scoped policy, sail through the connection test and only break later.
Why it matters
- The problem is invisible at the one moment the customer is set up to fix it (config time), and only surfaces as failed background jobs and growing storage.
- It generates ongoing internal failure alerts that each require manual outreach to the tenant to resolve.
Where it lives (for context, defer to owners)
- Health-check orchestration, in
crates/control-plane-api/src/server/public/graphql/storage_mappings.rs:
check_store_health (line 86) calls the Gazette fragment_store_health RPC (line 90)
run_all_health_checks (line 103)
- enforced on
create_storage_mapping (line 165) and update_storage_mapping (line 285)
- exposed via the
test_connection_health mutation (line 421)
- The actual probe is the Gazette
FragmentStoreHealth RPC (go.gazette.dev/core broker), so verifying delete likely needs a change there rather than in flow.
Question for the team
Could the fragment store health probe also exercise a delete, by writing a throwaway probe object and then deleting it, so a missing delete permission is caught at config time?
The probe should target the recovery/ prefix specifically. That is where the data plane deletes unconditionally (recovery-log pruning), and it is the prefix failing in the wild. A probe placed under the data prefix (where put/get is presumably already exercised) would pass even when delete on recovery/ is missing, which is exactly the failure mode here. It must use a throwaway object and never touch real recovery/ fragments.
If a destructive probe is undesirable, is there a lighter signal we could surface instead?
Separately: when a tenant brings their own bucket, should delete on recovery/ be treated as a hard requirement at setup? (Raised by Will: if a tenant uses their own bucket for recovery-log data, we need delete to manage it via pruning.)
If we do add a delete check, the GCS (storage.objects.delete) and Azure (blob delete) paths want the same treatment.
Observation
When a tenant configures their own cloud storage bucket, the storage mapping connection test validates connectivity but does not exercise a delete. If the bucket policy is missing the delete action (
s3:DeleteObject, GCSstorage.objects.delete, Azure blob delete), the test still passes.The cost shows up later. A storage mapping holds two kinds of data under distinct prefixes in the same bucket: collection data under
collection-data/, and per-task recovery logs underrecovery/. Recovery-log fragments are pruned continuously as shards checkpoint, so the data plane issues deletes underrecovery/constantly. When the bucket policy lacks delete, that pruning fails with AccessDenied on every run, which fires recurring alerts in#alertsand leaves superseded recovery fragments accumulating in the customer's bucket.Collection-data deletes are different: they only happen when a finite
retentionis set on the collection. With no retention set, collection fragments are retained indefinitely, socollection-data/deletes may never occur for an own-bucket tenant.recovery/is the prefix where deletes are unconditional.We currently have two tenants in this state (one on a public data plane with its own bucket, one BYOC). The required IAM policy already lists delete in the UI (
awsHooks.ts:65), and the generated policy grants it bucket-wide, so this is a validation gap rather than a docs gap: setups that predate the delete action, or that applied a hand-scoped policy, sail through the connection test and only break later.Why it matters
Where it lives (for context, defer to owners)
crates/control-plane-api/src/server/public/graphql/storage_mappings.rs:check_store_health(line 86) calls the Gazettefragment_store_healthRPC (line 90)run_all_health_checks(line 103)create_storage_mapping(line 165) andupdate_storage_mapping(line 285)test_connection_healthmutation (line 421)FragmentStoreHealthRPC (go.gazette.dev/core broker), so verifying delete likely needs a change there rather than in flow.Question for the team
Could the fragment store health probe also exercise a delete, by writing a throwaway probe object and then deleting it, so a missing delete permission is caught at config time?
The probe should target the
recovery/prefix specifically. That is where the data plane deletes unconditionally (recovery-log pruning), and it is the prefix failing in the wild. A probe placed under the data prefix (where put/get is presumably already exercised) would pass even when delete onrecovery/is missing, which is exactly the failure mode here. It must use a throwaway object and never touch realrecovery/fragments.If a destructive probe is undesirable, is there a lighter signal we could surface instead?
Separately: when a tenant brings their own bucket, should delete on
recovery/be treated as a hard requirement at setup? (Raised by Will: if a tenant uses their own bucket for recovery-log data, we need delete to manage it via pruning.)If we do add a delete check, the GCS (
storage.objects.delete) and Azure (blob delete) paths want the same treatment.