Skip to content

fix(connectors): Add unwrap_envelope transform and envelope detection to fix source-sink format mismatch#3197

Open
atharvalade wants to merge 2 commits into
apache:masterfrom
atharvalade:fix/shared-source-sink-contract
Open

fix(connectors): Add unwrap_envelope transform and envelope detection to fix source-sink format mismatch#3197
atharvalade wants to merge 2 commits into
apache:masterfrom
atharvalade:fix/shared-source-sink-contract

Conversation

@atharvalade
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3174

Rationale

Sources (e.g. Postgres) wrap row data in a DatabaseRecord envelope while sinks (e.g. Iceberg) expect flat JSON matching the target table schema — no shared contract exists, producing silent null failures.

What changed?

The Postgres source emits {table_name, operation_type, timestamp, data: {...}, old_data} envelopes, but the Iceberg sink's Arrow JSON reader maps these nested structures to top-level fields as null, silently violating non-nullable constraints.

This adds a reusable unwrap_envelope transform to the connector SDK that extracts a nested field (e.g. data) and promotes it as the top-level payload, plus explicit envelope detection in the Iceberg sink that errors with an actionable message instead of failing silently.

Local Execution

  • Passed
  • Pre-commit hooks ran (fmt, clippy, license-headers, trailing-whitespace, trailing-newline all clean; 119 tests pass across SDK + integration suites)

AI Usage

  1. Opus 4.6
  2. used for codebase exploration and following existing transform patterns
  3. All 8 new unit tests pass locally, clippy/fmt clean, existing 111 tests unaffected
  4. Yes, all code can be explained

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 91.27907% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.58%. Comparing base (1ae123f) to head (7c92361).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...nectors/sdk/src/transforms/json/unwrap_envelope.rs 96.52% 5 Missing ⚠️
...re/connectors/sinks/iceberg_sink/src/router/mod.rs 60.00% 3 Missing and 1 partial ⚠️
core/connectors/sdk/src/transforms/mod.rs 0.00% 3 Missing ⚠️
...e/connectors/sdk/src/transforms/unwrap_envelope.rs 80.00% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3197       +/-   ##
=============================================
- Coverage     73.78%   56.58%   -17.20%     
  Complexity      943      943               
=============================================
  Files          1200     1201        +1     
  Lines        109094    97051    -12043     
  Branches      85994    73951    -12043     
=============================================
- Hits          80492    54916    -25576     
- Misses        25866    39544    +13678     
+ Partials       2736     2591      -145     
Components Coverage Δ
Rust Core 52.24% <91.27%> (-22.68%) ⬇️
Java SDK 58.44% <ø> (ø)
C# SDK 69.47% <ø> (ø)
Python SDK 81.43% <ø> (ø)
Node SDK 91.44% <ø> (ø)
Go SDK 39.80% <ø> (ø)
Files with missing lines Coverage Δ
core/connectors/sdk/src/transforms/json/mod.rs 53.84% <ø> (ø)
core/connectors/sdk/src/transforms/mod.rs 26.66% <0.00%> (-2.97%) ⬇️
...e/connectors/sdk/src/transforms/unwrap_envelope.rs 80.00% <80.00%> (ø)
...re/connectors/sinks/iceberg_sink/src/router/mod.rs 40.71% <60.00%> (+1.48%) ⬆️
...nectors/sdk/src/transforms/json/unwrap_envelope.rs 96.52% <96.52%> (ø)

... and 275 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@atharvalade atharvalade changed the title Add unwrap_envelope transform and envelope detection to fix source-sink format mismatch fix(connectors): Add unwrap_envelope transform and envelope detection to fix source-sink format mismatch Apr 29, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added S-stale Inactive issue or pull request and removed S-stale Inactive issue or pull request labels May 7, 2026
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 14, 2026

@atharvalade please rebase this PR

/author

@github-actions github-actions Bot added the S-waiting-on-author PR is waiting on author response label May 14, 2026
@atharvalade atharvalade force-pushed the fix/shared-source-sink-contract branch from 8defa26 to 7c92361 Compare May 17, 2026 06:32
@atharvalade
Copy link
Copy Markdown
Contributor Author

@atharvalade please rebase this PR

/author

done

@atharvalade
Copy link
Copy Markdown
Contributor Author

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-waiting-on-author PR is waiting on author response labels May 17, 2026
Copy link
Copy Markdown
Contributor

@hubcio hubcio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review notes (not posted as line comments because targets are out-of-diff or already tracked):

  • the FFI (consume) return code is still discarded at core/connectors/runtime/src/sink.rs:642-650 (ConsumeCallback returns i32, sdk/src/sink.rs:28-36) and AutoCommit::When(AutoCommitWhen::PollingMessages) at core/connectors/runtime/src/sink.rs:467 commits offsets before the sink runs. this PR widens the blast radius: any envelope-detected batch now return Err(Error::InvalidRecord) from the sink, gets logged, and the offsets have already advanced. tracked in #2927 and #2928 but this PR adds a brand new fail path triggerable purely by source/sink misconfiguration, so they should land together.

  • core/connectors/sinks/delta_sink/src/sink.rs:106 has the same envelope-vs-flat-JSON mismatch and no detection or unwrap guidance. fixing one sink and leaving the other broken means #3174 is still reachable through the postgres → delta path. either replicate the detection (with the tighter shape suggested below) or, better, document unwrap_envelope as the single fix and drop sink-side sniffing entirely.

  • neither core/connectors/sdk/README.md nor core/connectors/sinks/iceberg_sink/README.md mention the new unwrap_envelope transform or the source-compatibility requirement for postgres_source → iceberg. discoverability is the whole point of this fix; please add at least a short entry to both.

  • architecturally, sink-side envelope sniffing is the wrong layer. transform discipline + docs is the real fix. if sink-side detection stays, it should be opt-in via config, not a hardcoded postgres-shaped heuristic.

overall, i'm not fan of this PR.

return false;
};
obj.contains_key("table_name") && obj.contains_key("data")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this heuristic checks 2 of the 5 keys on DatabaseRecord (table_name, operation_type, timestamp, data, old_data - see core/connectors/sources/postgres_source/src/lib.rs:110-116).

false positives: any legit iceberg table modeling audit logs, catalogs, or CDC metadata that happens to have table_name + data columns gets the whole batch rejected.

false negatives: Debezium / Kafka-Connect envelopes use before / after / op / source / payload and slip right through into JsonArrowReader, which then writes nulls - the exact bug #3174 is supposed to fix.

also it bakes a postgres-source shape into a generic iceberg sink. options: tighten to the full 4-or-5 key shape, or move detection behind an opt-in detect_envelope config flag, or drop sink-side sniffing entirely and rely on unwrap_envelope + docs.

.collect();

if let Some(first) = msgs.first()
&& looks_like_envelope(first)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sniff only looks at msgs.first(). mixed batch where the first message is flat and the rest are envelopes (or vice versa) skips detection entirely - envelope rows then hit JsonArrowReader and write nulls, which is the silent-corruption mode #3174 was opened for. envelope-first drops the whole batch including any valid flat rows.

either scan all messages, or document and enforce a batch-homogeneity invariant somewhere upstream.

'unwrap_envelope' transform with field = \"data\" to your \
connector config to extract the inner payload."
);
return Err(Error::InvalidRecord);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error::InvalidRecord is a unit variant; the actionable hint ("add an unwrap_envelope transform with field = data") lives only in the error! log and is lost the moment the caller receives the error. Error::InvalidRecordValue(String) already exists (sdk/src/lib.rs:389) and is used in delta_sink and influxdb_source - switch to it and carry the hint in the variant.

this is also a concrete instance of #3176 (overloaded InvalidRecord).

(detected 'table_name' + 'data' fields). The Iceberg sink \
expects flat JSON matching the target table schema. Add an \
'unwrap_envelope' transform with field = \"data\" to your \
connector config to extract the inner payload."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two issues with this error message:

  1. it hardcodes field = "data", which only matches the postgres envelope shape. if/when detection broadens to Debezium (after) or other shapes, the suggestion is wrong.

  2. no connector id / plugin id in the log line, so in a multi-tenant deployment with several iceberg sinks you cannot tell which one rejected the batch.

warn!(
"unwrap_envelope: field '{}' not found in payload, passing through unchanged",
self.field
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a per-message warn! on a hot path. with the default batch_length = 1000, a single misconfigured connector floods stdout with 1000 identical warnings per poll cycle, masking real errors.

fix options: downgrade to debug!, rate-limit (warn-once per (stream, topic, config)), or count occurrences and emit a single summary per batch.

/// entire payload with the contents of `data`.
#[derive(Debug, Serialize, Deserialize)]
pub struct UnwrapEnvelopeConfig {
pub field: String,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnwrapEnvelopeConfig.field: String has no validation. an empty string deserializes fine through from_config at transforms/mod.rs:137-141, then every message hits the missing-field branch and warn-spams at message rate (compounds the log flood on the json side).

reject empty field either via a custom Deserialize impl, or inline in UnwrapEnvelope::new, or in the from_config arm.

) -> Result<Option<DecodedMessage>, Error> {
let Payload::Json(OwnedValue::Object(ref mut map)) = message.payload else {
return Ok(Some(message));
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-Object JSON (Array, scalar, null) silently passes through with no metric and no log. this is an operational blindspot - if a misconfigured source starts emitting arrays where the sink expects an unwrapped object, there's nothing in the logs that says the transform was a no-op.

at minimum a debug! with the actual payload variant, or a counter, so operators can see when the transform isn't doing anything.

}
TransformType::UnwrapEnvelope => {
let cfg: UnwrapEnvelopeConfig =
serde_json::from_value(raw.clone()).map_err(|_| Error::InvalidConfig)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map_err(|_| Error::InvalidConfig) throws away the serde error detail. for a user with a typo in field or a wrong type, all they see is InvalidConfig - no line number, no field name, no expected-vs-got. this is a pre-existing pattern in every arm of from_config, but this PR adds another instance.

worth either fixing all arms in a follow-up, or at least propagating the serde error through Error::InvalidConfigDetail(String) or similar.

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No shared contract between source output and sink input format

2 participants