Storage mapping connection test passes without delete permission, then shard pruner fails later

## Observation

When a tenant configures their own cloud storage bucket, the storage mapping connection test validates connectivity but does not exercise a delete. If the bucket policy is missing the delete action (`s3:DeleteObject`, GCS `storage.objects.delete`, Azure blob delete), the test still passes.

The cost shows up later. A storage mapping holds two kinds of data under distinct prefixes in the same bucket: collection data under `collection-data/`, and per-task recovery logs under `recovery/`. Recovery-log fragments are pruned continuously as shards checkpoint, so the data plane issues deletes under `recovery/` constantly. When the bucket policy lacks delete, that pruning fails with AccessDenied on every run, which fires recurring alerts in `#alerts` and leaves superseded recovery fragments accumulating in the customer's bucket.

Collection-data deletes are different: they only happen when a finite `retention` is set on the collection. With no retention set, collection fragments are retained indefinitely, so `collection-data/` deletes may never occur for an own-bucket tenant. `recovery/` is the prefix where deletes are unconditional.

We currently have two tenants in this state (one on a public data plane with its own bucket, one BYOC). The required IAM policy already lists delete in the UI ([`awsHooks.ts:65`](https://github.com/estuary/ui/blob/main/src/components/admin/Settings/StorageMappings/Dialog/ConnectionTest/instructions/awsHooks.ts#L65)), and the generated policy grants it bucket-wide, so this is a validation gap rather than a docs gap: setups that predate the delete action, or that applied a hand-scoped policy, sail through the connection test and only break later.

## Why it matters

- The problem is invisible at the one moment the customer is set up to fix it (config time), and only surfaces as failed background jobs and growing storage.
- It generates ongoing internal failure alerts that each require manual outreach to the tenant to resolve.

## Where it lives (for context, defer to owners)

- Health-check orchestration, in `crates/control-plane-api/src/server/public/graphql/storage_mappings.rs`:
  - `check_store_health` (line 86) calls the Gazette `fragment_store_health` RPC (line 90)
  - `run_all_health_checks` (line 103)
  - enforced on `create_storage_mapping` (line 165) and `update_storage_mapping` (line 285)
  - exposed via the `test_connection_health` mutation (line 421)
- The actual probe is the Gazette `FragmentStoreHealth` RPC (go.gazette.dev/core broker), so verifying delete likely needs a change there rather than in flow.

## Question for the team

Could the fragment store health probe also exercise a delete, by writing a throwaway probe object and then deleting it, so a missing delete permission is caught at config time?

The probe should target the `recovery/` prefix specifically. That is where the data plane deletes unconditionally (recovery-log pruning), and it is the prefix failing in the wild. A probe placed under the data prefix (where put/get is presumably already exercised) would pass even when delete on `recovery/` is missing, which is exactly the failure mode here. It must use a throwaway object and never touch real `recovery/` fragments.

If a destructive probe is undesirable, is there a lighter signal we could surface instead?

Separately: when a tenant brings their own bucket, should delete on `recovery/` be treated as a hard requirement at setup? (Raised by Will: if a tenant uses their own bucket for recovery-log data, we need delete to manage it via pruning.)

If we do add a delete check, the GCS (`storage.objects.delete`) and Azure (blob delete) paths want the same treatment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Storage mapping connection test passes without delete permission, then shard pruner fails later #3073

Observation

Why it matters

Where it lives (for context, defer to owners)

Question for the team

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Storage mapping connection test passes without delete permission, then shard pruner fails later #3073

Description

Observation

Why it matters

Where it lives (for context, defer to owners)

Question for the team

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions