Skip to content

Fix flaky "Entity not found: <service>" on list (concurrent cascade delete)#29150

Closed
harshach wants to merge 1 commit into
mainfrom
harshach/flaky-entity-not-found-service
Closed

Fix flaky "Entity not found: <service>" on list (concurrent cascade delete)#29150
harshach wants to merge 1 commit into
mainfrom
harshach/flaky-entity-not-found-service

Conversation

@harshach

@harshach harshach commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Describe your changes:

Fixes the flaky Entity not found: storageService/mlmodelService <id> failures on the list path (seen in CI on ContainerResourceIT/MlModelResourceIT). A list resolves each row's parent service in bulk, in a statement separate from the relationship lookup; when a sibling test cascade-hard-deletes that service mid-list, the CONTAINS relationship row is still visible while the service entity row is already gone, and the non-lenient bulk resolver threw and failed the whole list instead of the one affected row. PR #29093 fixed this read-side TOCTOU for the single-GET path; this extends the same tolerance to the LIST batch path. I added Entity.getEntityReferenceByIdOrNull() (catches EntityNotFoundException → null, mirroring getFromEntityRef) and used it in the 8 non-lenient batch service resolvers (Container, MlModel, Topic, Pipeline, SearchIndex, LLMModel, IngestionPipeline, Directory) plus Container's parent/children resolvers, with null-guarded puts.

Type of change:

  • Bug fix

High-level design:

The failure is read-side: a list loads rows, then resolves each row's CONTAINS parent service in a separate statement. The single-GET path (getContainer/getFromEntityRef) already tolerates the parent being concurrently hard-deleted (returns null); the per-repo batchFetch* loops did not — they called the throwing Entity.getEntityReferenceById(). The fix adds a lenient sibling and swaps it in only where the bulk path was still throwing. Repos that already resolve via the batch getEntityReferencesByIds (Table, Database, DatabaseSchema, Dashboard, Chart, APICollection, APIEndpoint) were already lenient and are untouched. No schema, API, or behavior change beyond returning a null service reference (instead of a 404/500) when the parent was concurrently deleted.

Tests:

Use cases covered

  • Listing any service-scoped entity while its parent service is concurrently hard-deleted returns 200 with the service tolerated as null, instead of failing the whole page.

Backend integration tests

  • Added BaseEntityIT#list_toleratesConcurrentlyHardDeletedService — a generic, deterministic regression test that runs for every entity extending BaseEntityIT. It deletes only the service entity row + evicts the entity cache (the exact concurrent-delete window: relationship row present, entity row gone), then asserts the scoped list succeeds. It is namespace-ownership-gated (only deletes a service this test created) and skips entities with no owned service parent.
  • Files: openmetadata-integration-tests/.../BaseEntityIT.java

Manual testing performed

  • Ran the test on the mysql-elasticsearch profile: verified RED without the fix (ApiException (404): Entity not found: storageService <id> at the list call) and GREEN with it, for both ContainerResourceIT (storageService) and TopicResourceIT (messagingService). mvn spotless:check passes.

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added tests (integration) and listed them above.
  • I have added a test that covers the exact scenario we are fixing.

🤖 Generated with Claude Code

Greptile Summary

Extends the read-side TOCTOU tolerance (introduced in #29093 for single-GET) to the LIST batch path by adding Entity.getEntityReferenceByIdOrNull and swapping it into 8 non-lenient batchFetch* service resolvers with null-guarded map puts. A new generic regression test in BaseEntityIT reproduces the race window deterministically for every entity type.

  • Entity.getEntityReferenceByIdOrNull catches EntityNotFoundException and returns null, mirroring the existing getFromEntityRef and getEntityReferencesByIds tolerance, with a DEBUG log to keep the signal without noise.
  • 8 repository batch resolvers (Container, Directory, MlModel, Topic, Pipeline, SearchIndex, LLMModel, IngestionPipeline) are switched to the lenient variant; Container's parent/child resolvers are also covered.
  • list_toleratesConcurrentlyHardDeletedService in BaseEntityIT deletes only the service entity row and evicts the cache (the exact concurrent-delete window), then asserts the scoped list returns 200 instead of throwing a 404.

Confidence Score: 4/5

Safe to merge — the production fix is minimal and well-scoped; the only note is that the deduplication map in ContainerRepository and DirectoryRepository won't cache null results, so a full page of containers under the same concurrently-deleted service will each hit the DB separately.

The core change is a targeted catch-and-null around getEntityReferenceById in batch resolvers, with careful null-guards before every map put. The pattern is consistent across all 8 repositories. The one thing to watch is computeIfAbsent not caching null in ContainerRepository and DirectoryRepository — the deduplication advertised in the comment only works for live services; deleted services incur N lookups instead of 1 per page.

ContainerRepository.java and DirectoryRepository.java — both use computeIfAbsent for deduplication but that map won't retain null results, so the optimisation doesn't apply to the concurrent-delete case.

Important Files Changed

Filename Overview
openmetadata-service/src/main/java/org/openmetadata/service/Entity.java Adds getEntityReferenceByIdOrNull — a lenient wrapper that catches EntityNotFoundException and returns null, mirrors existing single-entity and batch-resolver tolerance.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java Switches three batch resolvers (parent containers, child containers, storage service) to the lenient variant with null-guards; computeIfAbsent won't cache null so the deduplication map doesn't protect against repeated DB hits for the same deleted service.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/DirectoryRepository.java Switches batchFetchFromByType to the lenient variant; same computeIfAbsent-null deduplication gap as ContainerRepository. Non-batch paths keep the throwing variant correctly.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/MlModelRepository.java Switches batchFetchMlModelService to the lenient variant with a null-guard; straightforward and correct.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TopicRepository.java Switches messaging-service batch resolver to the lenient variant with a null-guard inside a forEach lambda; clean.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java Switches pipeline-service batch resolver to the lenient variant with a null-guard; straightforward.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SearchIndexRepository.java Switches search-service batch resolver to the lenient variant with a null-guard; straightforward.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/IngestionPipelineRepository.java Switches ingestion-pipeline service batch resolver to the lenient variant with a null-guard; straightforward.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/LLMModelRepository.java Switches LLM-service batch resolver to the lenient variant with a null-guard; straightforward.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/BaseEntityIT.java Adds list_toleratesConcurrentlyHardDeletedService regression test that deterministically reproduces the TOCTOU window (entity row deleted, relationship row intact) and asserts the list succeeds; isServiceEntity does a needless full DB fetch for the type-check.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant ListEndpoint
    participant RelationshipDAO
    participant BatchResolver
    participant EntityDAO

    Client->>ListEndpoint: "GET /containers?service=svc"
    ListEndpoint->>RelationshipDAO: findFromBatch(containerIds, CONTAINS)
    Note over RelationshipDAO: Returns rows — service row still present
    RelationshipDAO-->>ListEndpoint: "[{serviceId, containerId}, ...]"

    Note over EntityDAO: Concurrent hard-delete removes service entity row

    ListEndpoint->>BatchResolver: resolve service references
    alt Before fix
        BatchResolver->>EntityDAO: fetch service by ID
        EntityDAO-->>BatchResolver: EntityNotFoundException
        BatchResolver-->>Client: 404 / 500
    else After fix
        BatchResolver->>EntityDAO: fetch service by ID
        EntityDAO-->>BatchResolver: EntityNotFoundException returns null
        BatchResolver-->>ListEndpoint: "serviceRef = null (skipped in map)"
        ListEndpoint-->>Client: 200 OK (service field null for affected rows)
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant ListEndpoint
    participant RelationshipDAO
    participant BatchResolver
    participant EntityDAO

    Client->>ListEndpoint: "GET /containers?service=svc"
    ListEndpoint->>RelationshipDAO: findFromBatch(containerIds, CONTAINS)
    Note over RelationshipDAO: Returns rows — service row still present
    RelationshipDAO-->>ListEndpoint: "[{serviceId, containerId}, ...]"

    Note over EntityDAO: Concurrent hard-delete removes service entity row

    ListEndpoint->>BatchResolver: resolve service references
    alt Before fix
        BatchResolver->>EntityDAO: fetch service by ID
        EntityDAO-->>BatchResolver: EntityNotFoundException
        BatchResolver-->>Client: 404 / 500
    else After fix
        BatchResolver->>EntityDAO: fetch service by ID
        EntityDAO-->>BatchResolver: EntityNotFoundException returns null
        BatchResolver-->>ListEndpoint: "serviceRef = null (skipped in map)"
        ListEndpoint-->>Client: 200 OK (service field null for affected rows)
    end
Loading

Reviews (1): Last reviewed commit: "Fix flaky "Entity not found: <service>" ..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…ade delete)

The list batch path resolved each row's parent service via the singular
Entity.getEntityReferenceById(), which throws EntityNotFoundException when a
sibling test cascade-hard-deletes the service mid-list — failing the whole list
with "Entity not found: storageService/mlmodelService <id>" (flaky
ContainerResourceIT, MlModelResourceIT). PR #29093 fixed the single-GET path;
this extends the same read-side tolerance to the LIST batch path.

- Add Entity.getEntityReferenceByIdOrNull() (catches EntityNotFoundException ->
  null), mirroring getFromEntityRef's existing single-entity tolerance.
- Use it in the 8 non-lenient batch service resolvers (Container, MlModel, Topic,
  Pipeline, SearchIndex, LLMModel, IngestionPipeline, Directory) and Container's
  parent/children resolvers, with null-guarded puts.
- Add a deterministic regression test in BaseEntityIT that runs for every
  service-scoped entity: delete only the service row + evict cache (the exact
  TOCTOU state), then assert the scoped list tolerates it instead of failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

  • No GitHub issue is linked. Link an issue in the Development section of the PR (or add Fixes #12345 to the description). For a same-org cross-repo issue, add Fixes open-metadata/<repo>#123 to the description.

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 17, 2026
@harshach harshach added the skip-pr-checks Bypass PR metadata validation check label Jun 17, 2026
Comment on lines +1089 to +1093
org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
params.setService(serviceRef.getFullyQualifiedName());
params.setLimit(1000);

org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Fully-qualified names used instead of imports in BaseEntityIT

In the new list_toleratesConcurrentlyHardDeletedService test, ListParams and ListResponse are referenced via fully-qualified names rather than imports:

org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
...
org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);

The project's code standards explicitly forbid fully-qualified names ("No fully qualified names"). Add import org.openmetadata.sdk.models.ListParams; and import org.openmetadata.sdk.models.ListResponse; and use the simple names. (Note ListResponse/ListParams may already be imported elsewhere in this file given listEntities is used; verify to avoid duplicate-import.)

Was this helpful? React with 👍 / 👎

@gitar-bot

gitar-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Code Review 👍 Approved with suggestions 0 resolved / 2 findings

Adds lenient service reference resolution to batch list paths to prevent failures during concurrent cascade deletes. Consider cleaning up fully-qualified names in the new integration test and optimizing the cache pattern in batchFetchServices to avoid re-resolving deleted services.

💡 Quality: Fully-qualified names used instead of imports in BaseEntityIT

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/BaseEntityIT.java:1089-1093

In the new list_toleratesConcurrentlyHardDeletedService test, ListParams and ListResponse are referenced via fully-qualified names rather than imports:

org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
...
org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);

The project's code standards explicitly forbid fully-qualified names ("No fully qualified names"). Add import org.openmetadata.sdk.models.ListParams; and import org.openmetadata.sdk.models.ListResponse; and use the simple names. (Note ListResponse/ListParams may already be imported elsewhere in this file given listEntities is used; verify to avoid duplicate-import.)

💡 Performance: computeIfAbsent does not cache null, re-resolving deleted service per row

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:271-284

In batchFetchServices, the de-dupe cache was added specifically to resolve each unique storage service id only once (see the explanatory comment at lines 266-270). With the switch to getEntityReferenceByIdOrNull, when the service was concurrently hard-deleted the lambda returns null. Map.computeIfAbsent does NOT store a mapping when the mapping function returns null, so for every subsequent row sharing that same (deleted) service id the lambda runs again — re-attempting the lookup and re-throwing/catching EntityNotFoundException each time.

In the exact scenario this PR targets (a large page of children all under one now-deleted service, limit up to 1000), this defeats the de-dupe optimization and incurs N failed lookups + N exception throw/catch cycles instead of one. This is the rare race-window path only, hence minor, but it undercuts the optimization the comment describes. Consider resolving the unique service ids once outside the row loop (e.g. into a Map<UUID, Optional<EntityReference>> or a set of known-missing ids) so deleted ids are not re-resolved per row.

🤖 Prompt for agents
Code Review: Adds lenient service reference resolution to batch list paths to prevent failures during concurrent cascade deletes. Consider cleaning up fully-qualified names in the new integration test and optimizing the cache pattern in `batchFetchServices` to avoid re-resolving deleted services.

1. 💡 Quality: Fully-qualified names used instead of imports in BaseEntityIT
   Files: openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/BaseEntityIT.java:1089-1093

   In the new `list_toleratesConcurrentlyHardDeletedService` test, `ListParams` and `ListResponse` are referenced via fully-qualified names rather than imports:
   
   ```java
   org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
   ...
   org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);
   ```
   
   The project's code standards explicitly forbid fully-qualified names ("No fully qualified names"). Add `import org.openmetadata.sdk.models.ListParams;` and `import org.openmetadata.sdk.models.ListResponse;` and use the simple names. (Note `ListResponse`/`ListParams` may already be imported elsewhere in this file given `listEntities` is used; verify to avoid duplicate-import.)

2. 💡 Performance: computeIfAbsent does not cache null, re-resolving deleted service per row
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:271-284

   In `batchFetchServices`, the de-dupe cache was added specifically to resolve each unique storage service id only once (see the explanatory comment at lines 266-270). With the switch to `getEntityReferenceByIdOrNull`, when the service was concurrently hard-deleted the lambda returns `null`. `Map.computeIfAbsent` does NOT store a mapping when the mapping function returns `null`, so for every subsequent row sharing that same (deleted) service id the lambda runs again — re-attempting the lookup and re-throwing/catching `EntityNotFoundException` each time.
   
   In the exact scenario this PR targets (a large page of children all under one now-deleted service, limit up to 1000), this defeats the de-dupe optimization and incurs N failed lookups + N exception throw/catch cycles instead of one. This is the rare race-window path only, hence minor, but it undercuts the optimization the comment describes. Consider resolving the unique service ids once outside the row loop (e.g. into a `Map<UUID, Optional<EntityReference>>` or a set of known-missing ids) so deleted ids are not re-resolved per row.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Comment on lines 278 to +283
EntityReference serviceRef =
serviceRefById.computeIfAbsent(
serviceId, id -> getEntityReferenceById(STORAGE_SERVICE, id, NON_DELETED));
serviceMap.put(containerId, serviceRef);
serviceId, id -> getEntityReferenceByIdOrNull(STORAGE_SERVICE, id, NON_DELETED));
if (serviceRef != null) {
serviceMap.put(containerId, serviceRef);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 computeIfAbsent silently drops null results — the deduplication comment doesn't hold for the concurrent-delete case.

When getEntityReferenceByIdOrNull returns null, Map.computeIfAbsent does not store a null mapping in serviceRefById (per Java spec: the entry is only inserted when the mapping function returns non-null). On the next iteration for the same deleted serviceId, the key is still absent, so the mapping function is called again. For a page of N containers all under the same concurrently-deleted service you'll perform N DB round-trips instead of one — each hitting the catch block in getEntityReferenceByIdOrNull. This is correct but defeats the deduplication the comment advertises. A sentinel Optional.empty() or a computeIfAbsent-then-check approach can fix it: serviceRefById.computeIfAbsent(serviceId, id -> Optional.ofNullable(getEntityReferenceByIdOrNull(...))), then unwrap with orElse(null). The same pattern applies to DirectoryRepository.batchFetchFromByType's refById map.

Comment on lines +1123 to +1127
private boolean isServiceEntity(String entityType, UUID id) {
EntityInterface candidate =
Entity.getEntity(new EntityReference().withId(id).withType(entityType), "", Include.ALL);
return candidate instanceof ServiceEntityInterface;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 isServiceEntity performs a full entity fetch just to type-check.

Entity.getEntity(ref, "", Include.ALL) loads and deserialises the entire entity to check instanceof ServiceEntityInterface. For the service-type check you could instead call Entity.getEntityRepository(entityType) and inspect the repository's entity class, or check against the known service type string constants — both avoid the DB round-trip.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@sonarqubecloud

Copy link
Copy Markdown

@harshach harshach closed this Jun 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (9 flaky)

✅ 4302 passed · ❌ 0 failed · 🟡 9 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 302 0 0 4
🟡 Shard 2 811 0 1 9
🟡 Shard 3 811 0 2 8
🟡 Shard 4 855 0 2 12
🟡 Shard 5 732 0 1 47
🟡 Shard 6 791 0 3 8
🟡 9 flaky test(s) (passed on retry)
  • Features/ColumnBulkOperations.spec.ts › should update display name and propagate to all occurrences (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Verify Recently Viewed widget (shard 3, 1 retry)
  • Features/Tasks/DomainFiltering.spec.ts › selecting All Domains removes domain filter from feed API call (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Entity Reference List (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs skip-pr-checks Bypass PR metadata validation check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant