Skip to content

Auto-discover removes manual bindings on filesource captures despite addNewBindings: false #3032

Description

@jwhartley

Priority: Low (control-plane triage, 2026-06-18)

Very few captures use this pattern today, and the multiple-file-source-bindings guide has been updated to steer users away from the broken setup (#3037).


Summary

addNewBindings: false is documented and intended to stop auto-discover from changing a capture's binding set, but it does not prevent binding removal. The discover-merge reconciler rewrites the binding list to exactly what discovery returns and drops any existing binding whose resource path isn't in that output, regardless of addNewBindings. The flag only controls whether newly discovered bindings are added enabled vs disabled (update_only).

This breaks acutely on filesource captures (source-s3, source-gcs, source-azure-blob-storage, source-google-drive), where discovery always returns exactly one binding: the bucket+prefix root. A user who manually splits one bucket into multiple sub-prefix bindings, which is the workflow in the official guide that explicitly tells them to set addNewBindings: false to protect those bindings, has every manual binding treated as unmatched and removed on each ~2h auto-discover run, then a single disabled root binding re-added. The pipeline silently stops feeding their collections until they restore the bindings, after which the next run reverts them again. A real capture configured exactly per that guide was clobbered repeatedly.

Guide: https://docs.estuary.dev/guides/flowctl/multiple-file-source-bindings/

Reproduction

  1. Create an S3 capture over a bucket with sub-folders (e.g. Hive-style key=value/ partitions).
  2. Manually configure two bindings, each with a sub-prefix stream (e.g. bucket/a=x/ → collection A, bucket/a=y/ → collection B), set autoDiscover.addNewBindings: false, and delete the auto-created root binding.
  3. Wait for the next auto-discover run. Both sub-prefix bindings are removed and the root is re-added disabled. History shows auto-discover changes (0 added, 0 modified, 2 removed, 1 added (disabled)).

Root cause

  • The removal pass is unconditional and not gated by update_only: crates/control-plane-api/src/discovers/specs.rs:213-227 (model.bindings = next_bindings drops any existing binding whose resource path isn't in the discovery output). addNewBindings: false sets update_only = true (crates/agent/src/controllers/capture/auto_discover.rs:251), which is consumed only at specs.rs:181 to disable newly-added bindings.
  • Filesource discovery emits a single binding (DiscoverRoot() = bucket+prefix) and never enumerates sub-prefixes: estuary/connectors filesource/filesource.go:200-226. So every hand-authored sub-prefix binding is unmatched on every run.
  • Auto-discover is on by default for new captures: crates/agent/src/discovers.rs:296-299. The controller runs whenever the field is present: crates/agent/src/controllers/capture.rs:37.
  • The same merge backs manual "Refresh bindings" and flowctl discover (crates/control-plane-api/src/discovers/mod.rs:160build_merged_catalog:219), so those propose the same removal, but they are review-gated (the user sees the diff before Save+Publish), so they are out of scope here.

Impact

Silent, recurring breakage of any filesource capture that uses more than the auto-discovered root binding, including captures set up per the official multi-binding guide. Auto-discover provides no value for filesource: it can't find new streams (one bucket = one root) and contributes nothing to schema (schema comes from the parser plus continuous read-schema inference, which is independent of auto-discover). The only observable effect of periodic auto-discover on filesource is the clobber.

Proposed fix

Stop running periodic auto-discover for filesource captures. This must be server-side, since captures are published via flowctl as well as the dashboard.

  • Gate the controller (capture.rs:37) to skip auto-discover for filesource connectors, e.g. if model.auto_discover.is_some() && !is_filesource(&model.endpoint) && !model.shards.disable. Heals existing captures at runtime (the stored auto_discover field is simply ignored for these connectors), no migration needed, and covers both UI and flowctl.
  • Optionally, stop defaulting auto_discover on at creation for filesource (discovers.rs:296) so new captures are born clean.
  • Detect filesource by connector image (the four images above). Hardcoding the list is a minor maintenance cost but keeps the change entirely in flow control-plane with no protocol or connector work.
  • Add a test for the now-skipped path. It is currently untested: test_auto_discovers_update_only only covers the all-matched case where nothing is removed.

Rejected alternative

Gating the removal pass on update_only globally (so addNewBindings: false would mean "don't add and don't remove" for all connectors). Rejected: too many existing CDC captures rely on auto-removing a binding when its source table is dropped; this would change their behavior and could leave erroring bindings for dropped tables.

Optional future work

A connector-declared "non-enumerating" capability in the Spec response, consumed by the reconciler to skip the removal pass for such connectors. This would also protect the manual "Refresh bindings" / flowctl discover path (not just the periodic controller) and avoid the hardcoded image list, but it's more work (protocol + connector + control-plane) and isn't required to resolve this issue.

Docs correction

The multiple-file-source-bindings guide is wrong on two points (corrected in #3037):

  1. It tells users to set addNewBindings: false to prevent auto-discover overwriting manual bindings. That never prevented removal. The correct guidance for filesource is to disable auto-discover entirely (autoDiscover: null), which this fix makes the default.
  2. It claims keeping evolveIncompatibleCollections: true "preserves schema inference." That flag only controls collection reset on incompatible key change (reset_on_key_change); it has nothing to do with inference. Schema inference is governed by the collection read schema (the flow://inferred-schema reference) and is unaffected by auto-discover settings.

Related

  • estuary/flow#1291 introduced the discover-merge; its description already warns it "can easily result in removing bindings where it wasn't intended."
  • estuary/connectors#520 sketches per-binding prefix/matchKeys config (never built), the path to real filesource enumeration.
  • estuary/connectors#2945 / #3351 were recent multi-binding correctness fixes (runtime / prefix filtering), not discovery or auto-discover changes.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions