Fixes #29141: bound source_api_calls metric cardinality and guard ingestion_pipeline index#29149
Conversation
…and guard ingestion_pipeline index
Source connectors that embed opaque IDs deep in API paths (e.g. Sigma's
/workbooks/{id}/lineage/elements/{elementId}) produced a distinct
source_api_calls metric key per entity. This exploded operationMetrics into
thousands of dynamic fields, causing SearchIndexing to fail with
'Limit of total fields [1000] has been exceeded'.
- Extend API path normalization to collapse opaque identifier segments (tokens
containing a digit, dashless UUIDs, longer hex) while preserving path words
and API version segments.
- Set dynamic:false on pipelineStatuses in the ingestion_pipeline index mapping
(all languages) so arbitrary telemetry keys never expand the mapping.
Signed-off-by: Thiago Costa <thiago.costa@betfanatics.com>
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| # Opaque identifier: an id-charset token of reasonable length that contains at least one | ||
| # digit. Plain path words ("workbooks", "elements", "lineage", ...) have no digit and so | ||
| # are preserved; version segments are excluded above. | ||
| _OPAQUE_ID_RE = re.compile(r"^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$") |
There was a problem hiding this comment.
💡 Quality: Opaque-ID regex may collapse legitimate path words containing digits
_OPAQUE_ID_RE = ^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$ treats any segment of length >=4 that contains at least one digit as an entity ID. While this fixes the cardinality explosion (the goal), it will also collapse static, non-identifier path words that happen to contain a digit — e.g. oauth2, utf8, log4j, s3api, ec2, or version-like words such as v1beta1 (the _VERSION_RE guard only matches the exact form ^v\d+$). Those distinct endpoints will all be rewritten to {id} and merged into the same metric key, reducing the usefulness/granularity of source_api_calls for affected connectors.
This does not cause incorrect behavior or the indexing failure (change 2 fully protects the index), so it is minor and largely an accepted trade-off. If finer metric granularity matters, consider tightening the heuristic — e.g. only collapse when the digit ratio/length suggests a token, or exclude segments that are purely [A-Za-z]+\d+-style versioned resource names — or maintain a small allowlist of known static segments. At minimum, the docstring/comment could note that digit-bearing static words are also collapsed.
Was this helpful? React with 👍 / 👎
Code Review 👍 Approved with suggestions 0 resolved / 1 findingsBounds source_api_calls metric cardinality through path normalization and prevents index field explosion by setting dynamic mapping to false for pipelineStatuses. Consider refining the opaque-ID regex to ensure legitimate path words containing digits are not inadvertently collapsed. 💡 Quality: Opaque-ID regex may collapse legitimate path words containing digits📄 ingestion/src/metadata/ingestion/connections/source_api_client.py:49 📄 ingestion/src/metadata/ingestion/connections/source_api_client.py:52-61
This does not cause incorrect behavior or the indexing failure (change 2 fully protects the index), so it is minor and largely an accepted trade-off. If finer metric granularity matters, consider tightening the heuristic — e.g. only collapse when the digit ratio/length suggests a token, or exclude segments that are purely 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| # Opaque identifier: an id-charset token of reasonable length that contains at least one | ||
| # digit. Plain path words ("workbooks", "elements", "lineage", ...) have no digit and so | ||
| # are preserved; version segments are excluded above. | ||
| _OPAQUE_ID_RE = re.compile(r"^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$") |
There was a problem hiding this comment.
_OPAQUE_ID_RE false-positives on common path words that contain a digit
Any path segment ≥ 4 characters that contains at least one digit will be collapsed to {id}. This catches Sigma-style tokens correctly, but it also catches common API path words: oauth2 → {id}, api2 → {id}, v1.2 → {id}. A path like /oauth2/token would be normalized to /{id}/token and merged with unrelated routes in the recorded metrics.
_VERSION_RE already excludes the common v\d+ form but does not cover compound version or protocol names. Since dynamic: false is the real safety net, this only affects metric usefulness, not correctness — but a test case for oauth2 (or similar) would pin the known behavior and prevent a future "fix" from silently making things worse.
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
🟡 Playwright Results — all passed (17 flaky)✅ 4267 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped
🟡 17 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Fixes #29141
Problem
Source-side ingestion metrics (#24828) record
operationMetrics.source_api_callskeyed by HTTP method + API path. For connectors that embed opaque entity IDs deep in the path — notably the Sigma connector — those nested IDs are not normalized out of the key:Each workbook element / page ID becomes a distinct literal key, so
source_api_callsgrows one new key per source entity per run. On a Sigma service with a few hundred workbooks this produces several thousand unique keys, which OpenSearch dynamically maps as thousands of fields in theingestion_pipelineindex. The defaultindex.mapping.total_fields.limit(1000) is exceeded and the document fails to index:The affected ingestion-pipeline document is dropped from search (and, while indexes are inconsistent, queries can surface as
search_phase_execution_exception: all shards failed).Root cause:
TrackedREST._extract_api_path()only normalized strict UUIDs, pure-numeric, and 24-char-hex segments. Sigma'spage_id/element_id/node_idare opaque alphanumeric tokens, so they passed through unchanged.Changes
1. Bound metric-key cardinality at the source (
ingestion/.../connections/source_api_client.py)normalize_api_path()/_is_id_segment()helpers.workbooks,lineage,elements,pages, …) contain no digit and are preserved; API version segments (v1/v2) are explicitly excluded._extract_api_path()now delegates to the shared helper (behavior preserved, logic unit-testable).2. Make the index immune to field explosion (
ingestion_pipeline_index_mapping.json, all languages)"dynamic": falseon thepipelineStatusesobject, matching the existingvotes/descriptionSourcesconvention in the same mapping.operationMetrics.source_api_calls.*) are stored in_sourcebut never expand the mapping — so this class of failure cannot recur for any connector.Tests
New
ingestion/tests/unit/metadata/ingestion/connections/test_source_api_client.py— parametrized coverage for identifier vs. non-identifier segments and full path normalization, including the exact Sigma paths from the issue.Known limitation
The ingestion-side normalization (change 1) won't catch an opaque ID that contains no digit at all. Change 2 protects the search index regardless, so it cannot cause the reported failure; change 1 simply keeps the recorded metrics meaningful for the common case.
Greptile Summary
This PR fixes a field-explosion bug (#29141) where Sigma connector paths containing opaque entity IDs caused
source_api_callsmetric keys to grow unboundedly, eventually exceeding OpenSearch'sindex.mapping.total_fields.limitof 1000 and dropping ingestion-pipeline documents.source_api_client.py: Lifts ID-detection regexes to module level, adds anormalize_api_path()helper that also handles dashless UUIDs, long hex tokens, and opaque alphanumeric ID segments (must be ≥ 4 chars and contain at least one digit)._extract_api_path()now delegates to this helper, keeping the method thin and the logic independently testable."dynamic": falsetopipelineStatuses, matching the existingvotes/descriptionSourcesconvention. This is the hard safety net — any telemetry key that slips through normalization is stored in_sourceonly and cannot expand the mapping again.Confidence Score: 4/5
Safe to merge. The two-layer fix (path normalization + dynamic: false mapping) reliably prevents the reported index failure regardless of normalization edge cases.
The dynamic: false mapping change is an unambiguous, convention-consistent fix for the root problem. The normalization heuristic in _OPAQUE_ID_RE has a known false-positive for common path words that contain digits (e.g. oauth2), which would silently produce slightly misleading metric keys. No data loss or functional breakage results, but a pinning test for the known boundary case would make the tradeoff explicit.
source_api_client.py — the _OPAQUE_ID_RE heuristic deserves a quick look to confirm acceptable false-positive behavior on any connectors that use OAuth2 or similar digit-bearing path segments.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Connector as Sigma Connector participant TrackedREST participant normalize as normalize_api_path() participant Metrics as OperationMetricsState participant OS as OpenSearch Connector->>TrackedREST: GET /workbooks/3f2a.../lineage/elements/_8j4kP9x TrackedREST->>normalize: normalize_api_path("/workbooks/3f2a.../lineage/elements/_8j4kP9x") Note over normalize: _is_id_segment("3f2a...") → True (UUID)<br/>_is_id_segment("_8j4kP9x") → True (opaque ID) normalize-->>TrackedREST: "/workbooks/{id}/lineage/elements/{id}" TrackedREST->>Metrics: "record_operation("source_api_calls", "GET:/workbooks/{id}/lineage/elements/{id}")" Metrics->>OS: index pipelineStatus doc Note over OS: pipelineStatuses.dynamic=false<br/>operationMetrics.source_api_calls.*<br/>stored in _source only — no field explosion%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Connector as Sigma Connector participant TrackedREST participant normalize as normalize_api_path() participant Metrics as OperationMetricsState participant OS as OpenSearch Connector->>TrackedREST: GET /workbooks/3f2a.../lineage/elements/_8j4kP9x TrackedREST->>normalize: normalize_api_path("/workbooks/3f2a.../lineage/elements/_8j4kP9x") Note over normalize: _is_id_segment("3f2a...") → True (UUID)<br/>_is_id_segment("_8j4kP9x") → True (opaque ID) normalize-->>TrackedREST: "/workbooks/{id}/lineage/elements/{id}" TrackedREST->>Metrics: "record_operation("source_api_calls", "GET:/workbooks/{id}/lineage/elements/{id}")" Metrics->>OS: index pipelineStatus doc Note over OS: pipelineStatuses.dynamic=false<br/>operationMetrics.source_api_calls.*<br/>stored in _source only — no field explosionReviews (1): Last reviewed commit: "Fixes #29141: bound source_api_calls met..." | Re-trigger Greptile