Skip to content

Fixes #29141: bound source_api_calls metric cardinality and guard ingestion_pipeline index#29149

Open
thiago-costa-fanatic wants to merge 1 commit into
open-metadata:mainfrom
thiago-costa-fanatic:thiago/fix-29141-source-api-metric-cardinality
Open

Fixes #29141: bound source_api_calls metric cardinality and guard ingestion_pipeline index#29149
thiago-costa-fanatic wants to merge 1 commit into
open-metadata:mainfrom
thiago-costa-fanatic:thiago/fix-29141-source-api-metric-cardinality

Conversation

@thiago-costa-fanatic

@thiago-costa-fanatic thiago-costa-fanatic commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Fixes #29141

Problem

Source-side ingestion metrics (#24828) record operationMetrics.source_api_calls keyed by HTTP method + API path. For connectors that embed opaque entity IDs deep in the path — notably the Sigma connector — those nested IDs are not normalized out of the key:

GET:/workbooks/{id}/lineage/elements/<rawElementId>
GET:/workbooks/{id}/pages/<rawPageId>/elements

Each workbook element / page ID becomes a distinct literal key, so source_api_calls grows one new key per source entity per run. On a Sigma service with a few hundred workbooks this produces several thousand unique keys, which OpenSearch dynamically maps as thousands of fields in the ingestion_pipeline index. The default index.mapping.total_fields.limit (1000) is exceeded and the document fails to index:

Limit of total fields [1000] has been exceeded

The affected ingestion-pipeline document is dropped from search (and, while indexes are inconsistent, queries can surface as search_phase_execution_exception: all shards failed).

Root cause: TrackedREST._extract_api_path() only normalized strict UUIDs, pure-numeric, and 24-char-hex segments. Sigma's page_id / element_id / node_id are opaque alphanumeric tokens, so they passed through unchanged.

Changes

1. Bound metric-key cardinality at the source (ingestion/.../connections/source_api_client.py)

  • Lift the ID regexes to module level and add normalize_api_path() / _is_id_segment() helpers.
  • Normalization now also collapses opaque identifier segments (id-charset tokens containing a digit), dashless UUIDs, and longer hex tokens — in addition to the original cases.
  • Static path words (workbooks, lineage, elements, pages, …) contain no digit and are preserved; API version segments (v1/v2) are explicitly excluded.
  • _extract_api_path() now delegates to the shared helper (behavior preserved, logic unit-testable).

2. Make the index immune to field explosion (ingestion_pipeline_index_mapping.json, all languages)

  • Set "dynamic": false on the pipelineStatuses object, matching the existing votes / descriptionSources convention in the same mapping.
  • Explicitly mapped fields stay searchable; arbitrary telemetry keys (operationMetrics.source_api_calls.*) are stored in _source but never expand the mapping — so this class of failure cannot recur for any connector.

Tests

New ingestion/tests/unit/metadata/ingestion/connections/test_source_api_client.py — parametrized coverage for identifier vs. non-identifier segments and full path normalization, including the exact Sigma paths from the issue.

Known limitation

The ingestion-side normalization (change 1) won't catch an opaque ID that contains no digit at all. Change 2 protects the search index regardless, so it cannot cause the reported failure; change 1 simply keeps the recorded metrics meaningful for the common case.

Greptile Summary

This PR fixes a field-explosion bug (#29141) where Sigma connector paths containing opaque entity IDs caused source_api_calls metric keys to grow unboundedly, eventually exceeding OpenSearch's index.mapping.total_fields.limit of 1000 and dropping ingestion-pipeline documents.

  • source_api_client.py: Lifts ID-detection regexes to module level, adds a normalize_api_path() helper that also handles dashless UUIDs, long hex tokens, and opaque alphanumeric ID segments (must be ≥ 4 chars and contain at least one digit). _extract_api_path() now delegates to this helper, keeping the method thin and the logic independently testable.
  • Index mapping files (all 4 locales): Adds "dynamic": false to pipelineStatuses, matching the existing votes/descriptionSources convention. This is the hard safety net — any telemetry key that slips through normalization is stored in _source only and cannot expand the mapping again.
  • New test file: Parametrized unit tests cover identifier vs. non-identifier segments and the exact Sigma paths from the issue.

Confidence Score: 4/5

Safe to merge. The two-layer fix (path normalization + dynamic: false mapping) reliably prevents the reported index failure regardless of normalization edge cases.

The dynamic: false mapping change is an unambiguous, convention-consistent fix for the root problem. The normalization heuristic in _OPAQUE_ID_RE has a known false-positive for common path words that contain digits (e.g. oauth2), which would silently produce slightly misleading metric keys. No data loss or functional breakage results, but a pinning test for the known boundary case would make the tradeoff explicit.

source_api_client.py — the _OPAQUE_ID_RE heuristic deserves a quick look to confirm acceptable false-positive behavior on any connectors that use OAuth2 or similar digit-bearing path segments.

Important Files Changed

Filename Overview
ingestion/src/metadata/ingestion/connections/source_api_client.py Lifts ID-detection regexes to module level and introduces normalize_api_path(). Logic is correct for the primary use-case; _OPAQUE_ID_RE has a known false-positive risk for path words containing digits (e.g. oauth2).
ingestion/tests/unit/metadata/ingestion/connections/test_source_api_client.py Good parametrized coverage for identifier vs. non-identifier segments and the exact Sigma paths. Missing a test case that documents the known false-positive for segments like oauth2.
openmetadata-spec/src/main/resources/elasticsearch/en/ingestion_pipeline_index_mapping.json Adds dynamic: false to pipelineStatuses, consistent with the existing votes/descriptionSources convention. Explicit mapped fields remain searchable; unbounded telemetry keys are stored in _source only.
openmetadata-spec/src/main/resources/elasticsearch/jp/ingestion_pipeline_index_mapping.json Same dynamic: false guard applied to the Japanese locale mapping, consistent with the EN change.
openmetadata-spec/src/main/resources/elasticsearch/ru/ingestion_pipeline_index_mapping.json Same dynamic: false guard applied to the Russian locale mapping, consistent with the EN change.
openmetadata-spec/src/main/resources/elasticsearch/zh/ingestion_pipeline_index_mapping.json Same dynamic: false guard applied to the Chinese locale mapping, consistent with the EN change.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Connector as Sigma Connector
    participant TrackedREST
    participant normalize as normalize_api_path()
    participant Metrics as OperationMetricsState
    participant OS as OpenSearch

    Connector->>TrackedREST: GET /workbooks/3f2a.../lineage/elements/_8j4kP9x
    TrackedREST->>normalize: normalize_api_path("/workbooks/3f2a.../lineage/elements/_8j4kP9x")
    Note over normalize: _is_id_segment("3f2a...") → True (UUID)<br/>_is_id_segment("_8j4kP9x") → True (opaque ID)
    normalize-->>TrackedREST: "/workbooks/{id}/lineage/elements/{id}"
    TrackedREST->>Metrics: "record_operation("source_api_calls", "GET:/workbooks/{id}/lineage/elements/{id}")"
    Metrics->>OS: index pipelineStatus doc
    Note over OS: pipelineStatuses.dynamic=false<br/>operationMetrics.source_api_calls.*<br/>stored in _source only — no field explosion
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Connector as Sigma Connector
    participant TrackedREST
    participant normalize as normalize_api_path()
    participant Metrics as OperationMetricsState
    participant OS as OpenSearch

    Connector->>TrackedREST: GET /workbooks/3f2a.../lineage/elements/_8j4kP9x
    TrackedREST->>normalize: normalize_api_path("/workbooks/3f2a.../lineage/elements/_8j4kP9x")
    Note over normalize: _is_id_segment("3f2a...") → True (UUID)<br/>_is_id_segment("_8j4kP9x") → True (opaque ID)
    normalize-->>TrackedREST: "/workbooks/{id}/lineage/elements/{id}"
    TrackedREST->>Metrics: "record_operation("source_api_calls", "GET:/workbooks/{id}/lineage/elements/{id}")"
    Metrics->>OS: index pipelineStatus doc
    Note over OS: pipelineStatuses.dynamic=false<br/>operationMetrics.source_api_calls.*<br/>stored in _source only — no field explosion
Loading

Reviews (1): Last reviewed commit: "Fixes #29141: bound source_api_calls met..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

…and guard ingestion_pipeline index

Source connectors that embed opaque IDs deep in API paths (e.g. Sigma's
/workbooks/{id}/lineage/elements/{elementId}) produced a distinct
source_api_calls metric key per entity. This exploded operationMetrics into
thousands of dynamic fields, causing SearchIndexing to fail with
'Limit of total fields [1000] has been exceeded'.

- Extend API path normalization to collapse opaque identifier segments (tokens
  containing a digit, dashless UUIDs, longer hex) while preserving path words
  and API version segments.
- Set dynamic:false on pipelineStatuses in the ingestion_pipeline index mapping
  (all languages) so arbitrary telemetry keys never expand the mapping.

Signed-off-by: Thiago Costa <thiago.costa@betfanatics.com>
@thiago-costa-fanatic thiago-costa-fanatic requested a review from a team as a code owner June 17, 2026 22:47
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@github-actions

Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

# Opaque identifier: an id-charset token of reasonable length that contains at least one
# digit. Plain path words ("workbooks", "elements", "lineage", ...) have no digit and so
# are preserved; version segments are excluded above.
_OPAQUE_ID_RE = re.compile(r"^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Opaque-ID regex may collapse legitimate path words containing digits

_OPAQUE_ID_RE = ^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$ treats any segment of length >=4 that contains at least one digit as an entity ID. While this fixes the cardinality explosion (the goal), it will also collapse static, non-identifier path words that happen to contain a digit — e.g. oauth2, utf8, log4j, s3api, ec2, or version-like words such as v1beta1 (the _VERSION_RE guard only matches the exact form ^v\d+$). Those distinct endpoints will all be rewritten to {id} and merged into the same metric key, reducing the usefulness/granularity of source_api_calls for affected connectors.

This does not cause incorrect behavior or the indexing failure (change 2 fully protects the index), so it is minor and largely an accepted trade-off. If finer metric granularity matters, consider tightening the heuristic — e.g. only collapse when the digit ratio/length suggests a token, or exclude segments that are purely [A-Za-z]+\d+-style versioned resource names — or maintain a small allowlist of known static segments. At minimum, the docstring/comment could note that digit-bearing static words are also collapsed.

Was this helpful? React with 👍 / 👎

@gitar-bot

gitar-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Bounds source_api_calls metric cardinality through path normalization and prevents index field explosion by setting dynamic mapping to false for pipelineStatuses. Consider refining the opaque-ID regex to ensure legitimate path words containing digits are not inadvertently collapsed.

💡 Quality: Opaque-ID regex may collapse legitimate path words containing digits

📄 ingestion/src/metadata/ingestion/connections/source_api_client.py:49 📄 ingestion/src/metadata/ingestion/connections/source_api_client.py:52-61

_OPAQUE_ID_RE = ^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$ treats any segment of length >=4 that contains at least one digit as an entity ID. While this fixes the cardinality explosion (the goal), it will also collapse static, non-identifier path words that happen to contain a digit — e.g. oauth2, utf8, log4j, s3api, ec2, or version-like words such as v1beta1 (the _VERSION_RE guard only matches the exact form ^v\d+$). Those distinct endpoints will all be rewritten to {id} and merged into the same metric key, reducing the usefulness/granularity of source_api_calls for affected connectors.

This does not cause incorrect behavior or the indexing failure (change 2 fully protects the index), so it is minor and largely an accepted trade-off. If finer metric granularity matters, consider tightening the heuristic — e.g. only collapse when the digit ratio/length suggests a token, or exclude segments that are purely [A-Za-z]+\d+-style versioned resource names — or maintain a small allowlist of known static segments. At minimum, the docstring/comment could note that digit-bearing static words are also collapsed.

🤖 Prompt for agents
Code Review: Bounds source_api_calls metric cardinality through path normalization and prevents index field explosion by setting dynamic mapping to false for pipelineStatuses. Consider refining the opaque-ID regex to ensure legitimate path words containing digits are not inadvertently collapsed.

1. 💡 Quality: Opaque-ID regex may collapse legitimate path words containing digits
   Files: ingestion/src/metadata/ingestion/connections/source_api_client.py:49, ingestion/src/metadata/ingestion/connections/source_api_client.py:52-61

   `_OPAQUE_ID_RE = ^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$` treats any segment of length >=4 that contains at least one digit as an entity ID. While this fixes the cardinality explosion (the goal), it will also collapse static, non-identifier path words that happen to contain a digit — e.g. `oauth2`, `utf8`, `log4j`, `s3api`, `ec2`, or version-like words such as `v1beta1` (the `_VERSION_RE` guard only matches the exact form `^v\d+$`). Those distinct endpoints will all be rewritten to `{id}` and merged into the same metric key, reducing the usefulness/granularity of `source_api_calls` for affected connectors.
   
   This does not cause incorrect behavior or the indexing failure (change 2 fully protects the index), so it is minor and largely an accepted trade-off. If finer metric granularity matters, consider tightening the heuristic — e.g. only collapse when the digit ratio/length suggests a token, or exclude segments that are purely `[A-Za-z]+\d+`-style versioned resource names — or maintain a small allowlist of known static segments. At minimum, the docstring/comment could note that digit-bearing static words are also collapsed.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

# Opaque identifier: an id-charset token of reasonable length that contains at least one
# digit. Plain path words ("workbooks", "elements", "lineage", ...) have no digit and so
# are preserved; version segments are excluded above.
_OPAQUE_ID_RE = re.compile(r"^(?=.{4,}$)(?=.*\d)[A-Za-z0-9._~-]+$")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _OPAQUE_ID_RE false-positives on common path words that contain a digit

Any path segment ≥ 4 characters that contains at least one digit will be collapsed to {id}. This catches Sigma-style tokens correctly, but it also catches common API path words: oauth2{id}, api2{id}, v1.2{id}. A path like /oauth2/token would be normalized to /{id}/token and merged with unrelated routes in the recorded metrics.

_VERSION_RE already excludes the common v\d+ form but does not cover compound version or protocol names. Since dynamic: false is the real safety net, this only affects metric usefulness, not correctness — but a test case for oauth2 (or similar) would pin the known behavior and prevent a future "fix" from silently making things worse.

@PubChimps PubChimps added the safe to test Add this label to run secure Github workflows on PRs label Jun 17, 2026
@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (17 flaky)

✅ 4267 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 301 0 0 4
🟡 Shard 2 802 0 4 9
🟡 Shard 3 805 0 3 8
🟡 Shard 4 840 0 3 12
🟡 Shard 5 720 0 1 47
🟡 Shard 6 799 0 6 8
🟡 17 flaky test(s) (passed on retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values To Be Not In Set (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › EditAll User: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Resolving incident & re-run pipeline (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/FrequentlyJoined.spec.ts › should display frequently joined columns (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on apiCollection (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Integer (shard 4, 1 retry)
  • Pages/DomainDataProductsRightPanel.spec.ts › Should edit owners for data product from domain context (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Announcement create, edit & delete (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Delete Messaging Service (shard 6, 1 retry)
  • Pages/Users.spec.ts › Create and Delete user (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

2 participants