Fix flaky "Entity not found: <service>" on list (concurrent cascade delete)#29150
Fix flaky "Entity not found: <service>" on list (concurrent cascade delete)#29150harshach wants to merge 1 commit into
Conversation
…ade delete) The list batch path resolved each row's parent service via the singular Entity.getEntityReferenceById(), which throws EntityNotFoundException when a sibling test cascade-hard-deletes the service mid-list — failing the whole list with "Entity not found: storageService/mlmodelService <id>" (flaky ContainerResourceIT, MlModelResourceIT). PR #29093 fixed the single-GET path; this extends the same read-side tolerance to the LIST batch path. - Add Entity.getEntityReferenceByIdOrNull() (catches EntityNotFoundException -> null), mirroring getFromEntityRef's existing single-entity tolerance. - Use it in the 8 non-lenient batch service resolvers (Container, MlModel, Topic, Pipeline, SearchIndex, LLMModel, IngestionPipeline, Directory) and Container's parent/children resolvers, with null-guarded puts. - Add a deterministic regression test in BaseEntityIT that runs for every service-scoped entity: delete only the service row + evict cache (the exact TOCTOU state), then assert the scoped list tolerates it instead of failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
| org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams(); | ||
| params.setService(serviceRef.getFullyQualifiedName()); | ||
| params.setLimit(1000); | ||
|
|
||
| org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params); |
There was a problem hiding this comment.
💡 Quality: Fully-qualified names used instead of imports in BaseEntityIT
In the new list_toleratesConcurrentlyHardDeletedService test, ListParams and ListResponse are referenced via fully-qualified names rather than imports:
org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
...
org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);The project's code standards explicitly forbid fully-qualified names ("No fully qualified names"). Add import org.openmetadata.sdk.models.ListParams; and import org.openmetadata.sdk.models.ListResponse; and use the simple names. (Note ListResponse/ListParams may already be imported elsewhere in this file given listEntities is used; verify to avoid duplicate-import.)
Was this helpful? React with 👍 / 👎
Code Review 👍 Approved with suggestions 0 resolved / 2 findingsAdds lenient service reference resolution to batch list paths to prevent failures during concurrent cascade deletes. Consider cleaning up fully-qualified names in the new integration test and optimizing the cache pattern in 💡 Quality: Fully-qualified names used instead of imports in BaseEntityIT📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/BaseEntityIT.java:1089-1093 In the new org.openmetadata.sdk.models.ListParams params = new org.openmetadata.sdk.models.ListParams();
...
org.openmetadata.sdk.models.ListResponse<T> response = listEntities(params);The project's code standards explicitly forbid fully-qualified names ("No fully qualified names"). Add 💡 Performance: computeIfAbsent does not cache null, re-resolving deleted service per row📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java:271-284 In In the exact scenario this PR targets (a large page of children all under one now-deleted service, limit up to 1000), this defeats the de-dupe optimization and incurs N failed lookups + N exception throw/catch cycles instead of one. This is the rare race-window path only, hence minor, but it undercuts the optimization the comment describes. Consider resolving the unique service ids once outside the row loop (e.g. into a 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| EntityReference serviceRef = | ||
| serviceRefById.computeIfAbsent( | ||
| serviceId, id -> getEntityReferenceById(STORAGE_SERVICE, id, NON_DELETED)); | ||
| serviceMap.put(containerId, serviceRef); | ||
| serviceId, id -> getEntityReferenceByIdOrNull(STORAGE_SERVICE, id, NON_DELETED)); | ||
| if (serviceRef != null) { | ||
| serviceMap.put(containerId, serviceRef); | ||
| } |
There was a problem hiding this comment.
computeIfAbsent silently drops null results — the deduplication comment doesn't hold for the concurrent-delete case.
When getEntityReferenceByIdOrNull returns null, Map.computeIfAbsent does not store a null mapping in serviceRefById (per Java spec: the entry is only inserted when the mapping function returns non-null). On the next iteration for the same deleted serviceId, the key is still absent, so the mapping function is called again. For a page of N containers all under the same concurrently-deleted service you'll perform N DB round-trips instead of one — each hitting the catch block in getEntityReferenceByIdOrNull. This is correct but defeats the deduplication the comment advertises. A sentinel Optional.empty() or a computeIfAbsent-then-check approach can fix it: serviceRefById.computeIfAbsent(serviceId, id -> Optional.ofNullable(getEntityReferenceByIdOrNull(...))), then unwrap with orElse(null). The same pattern applies to DirectoryRepository.batchFetchFromByType's refById map.
| private boolean isServiceEntity(String entityType, UUID id) { | ||
| EntityInterface candidate = | ||
| Entity.getEntity(new EntityReference().withId(id).withType(entityType), "", Include.ALL); | ||
| return candidate instanceof ServiceEntityInterface; | ||
| } |
There was a problem hiding this comment.
isServiceEntity performs a full entity fetch just to type-check.
Entity.getEntity(ref, "", Include.ALL) loads and deserialises the entire entity to check instanceof ServiceEntityInterface. For the service-type check you could instead call Entity.getEntityRepository(entityType) and inspect the repository's entity class, or check against the known service type string constants — both avoid the DB round-trip.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
🟡 Playwright Results — all passed (9 flaky)✅ 4302 passed · ❌ 0 failed · 🟡 9 flaky · ⏭️ 88 skipped
🟡 9 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Describe your changes:
Fixes the flaky
Entity not found: storageService/mlmodelService <id>failures on the list path (seen in CI onContainerResourceIT/MlModelResourceIT). A list resolves each row's parent service in bulk, in a statement separate from the relationship lookup; when a sibling test cascade-hard-deletes that service mid-list, theCONTAINSrelationship row is still visible while the service entity row is already gone, and the non-lenient bulk resolver threw and failed the whole list instead of the one affected row. PR #29093 fixed this read-side TOCTOU for the single-GET path; this extends the same tolerance to the LIST batch path. I addedEntity.getEntityReferenceByIdOrNull()(catchesEntityNotFoundException→ null, mirroringgetFromEntityRef) and used it in the 8 non-lenient batch service resolvers (Container, MlModel, Topic, Pipeline, SearchIndex, LLMModel, IngestionPipeline, Directory) plus Container's parent/children resolvers, with null-guarded puts.Type of change:
High-level design:
The failure is read-side: a list loads rows, then resolves each row's
CONTAINSparent service in a separate statement. The single-GET path (getContainer/getFromEntityRef) already tolerates the parent being concurrently hard-deleted (returnsnull); the per-repobatchFetch*loops did not — they called the throwingEntity.getEntityReferenceById(). The fix adds a lenient sibling and swaps it in only where the bulk path was still throwing. Repos that already resolve via the batchgetEntityReferencesByIds(Table, Database, DatabaseSchema, Dashboard, Chart, APICollection, APIEndpoint) were already lenient and are untouched. No schema, API, or behavior change beyond returning anullservice reference (instead of a 404/500) when the parent was concurrently deleted.Tests:
Use cases covered
null, instead of failing the whole page.Backend integration tests
BaseEntityIT#list_toleratesConcurrentlyHardDeletedService— a generic, deterministic regression test that runs for every entity extendingBaseEntityIT. It deletes only the service entity row + evicts the entity cache (the exact concurrent-delete window: relationship row present, entity row gone), then asserts the scoped list succeeds. It is namespace-ownership-gated (only deletes a service this test created) and skips entities with no owned service parent.openmetadata-integration-tests/.../BaseEntityIT.javaManual testing performed
mysql-elasticsearchprofile: verified RED without the fix (ApiException (404): Entity not found: storageService <id>at the list call) and GREEN with it, for bothContainerResourceIT(storageService) andTopicResourceIT(messagingService).mvn spotless:checkpasses.UI screen recording / screenshots:
Not applicable.
Checklist:
🤖 Generated with Claude Code
Greptile Summary
Extends the read-side TOCTOU tolerance (introduced in #29093 for single-GET) to the LIST batch path by adding
Entity.getEntityReferenceByIdOrNulland swapping it into 8 non-lenientbatchFetch*service resolvers with null-guarded map puts. A new generic regression test inBaseEntityITreproduces the race window deterministically for every entity type.Entity.getEntityReferenceByIdOrNullcatchesEntityNotFoundExceptionand returnsnull, mirroring the existinggetFromEntityRefandgetEntityReferencesByIdstolerance, with aDEBUGlog to keep the signal without noise.list_toleratesConcurrentlyHardDeletedServiceinBaseEntityITdeletes only the service entity row and evicts the cache (the exact concurrent-delete window), then asserts the scoped list returns 200 instead of throwing a 404.Confidence Score: 4/5
Safe to merge — the production fix is minimal and well-scoped; the only note is that the deduplication map in ContainerRepository and DirectoryRepository won't cache null results, so a full page of containers under the same concurrently-deleted service will each hit the DB separately.
The core change is a targeted catch-and-null around getEntityReferenceById in batch resolvers, with careful null-guards before every map put. The pattern is consistent across all 8 repositories. The one thing to watch is computeIfAbsent not caching null in ContainerRepository and DirectoryRepository — the deduplication advertised in the comment only works for live services; deleted services incur N lookups instead of 1 per page.
ContainerRepository.java and DirectoryRepository.java — both use computeIfAbsent for deduplication but that map won't retain null results, so the optimisation doesn't apply to the concurrent-delete case.
Important Files Changed
getEntityReferenceByIdOrNull— a lenient wrapper that catchesEntityNotFoundExceptionand returns null, mirrors existing single-entity and batch-resolver tolerance.computeIfAbsentwon't cache null so the deduplication map doesn't protect against repeated DB hits for the same deleted service.batchFetchFromByTypeto the lenient variant; samecomputeIfAbsent-null deduplication gap as ContainerRepository. Non-batch paths keep the throwing variant correctly.batchFetchMlModelServiceto the lenient variant with a null-guard; straightforward and correct.list_toleratesConcurrentlyHardDeletedServiceregression test that deterministically reproduces the TOCTOU window (entity row deleted, relationship row intact) and asserts the list succeeds;isServiceEntitydoes a needless full DB fetch for the type-check.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Client participant ListEndpoint participant RelationshipDAO participant BatchResolver participant EntityDAO Client->>ListEndpoint: "GET /containers?service=svc" ListEndpoint->>RelationshipDAO: findFromBatch(containerIds, CONTAINS) Note over RelationshipDAO: Returns rows — service row still present RelationshipDAO-->>ListEndpoint: "[{serviceId, containerId}, ...]" Note over EntityDAO: Concurrent hard-delete removes service entity row ListEndpoint->>BatchResolver: resolve service references alt Before fix BatchResolver->>EntityDAO: fetch service by ID EntityDAO-->>BatchResolver: EntityNotFoundException BatchResolver-->>Client: 404 / 500 else After fix BatchResolver->>EntityDAO: fetch service by ID EntityDAO-->>BatchResolver: EntityNotFoundException returns null BatchResolver-->>ListEndpoint: "serviceRef = null (skipped in map)" ListEndpoint-->>Client: 200 OK (service field null for affected rows) end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Client participant ListEndpoint participant RelationshipDAO participant BatchResolver participant EntityDAO Client->>ListEndpoint: "GET /containers?service=svc" ListEndpoint->>RelationshipDAO: findFromBatch(containerIds, CONTAINS) Note over RelationshipDAO: Returns rows — service row still present RelationshipDAO-->>ListEndpoint: "[{serviceId, containerId}, ...]" Note over EntityDAO: Concurrent hard-delete removes service entity row ListEndpoint->>BatchResolver: resolve service references alt Before fix BatchResolver->>EntityDAO: fetch service by ID EntityDAO-->>BatchResolver: EntityNotFoundException BatchResolver-->>Client: 404 / 500 else After fix BatchResolver->>EntityDAO: fetch service by ID EntityDAO-->>BatchResolver: EntityNotFoundException returns null BatchResolver-->>ListEndpoint: "serviceRef = null (skipped in map)" ListEndpoint-->>Client: 200 OK (service field null for affected rows) endReviews (1): Last reviewed commit: "Fix flaky "Entity not found: <service>" ..." | Re-trigger Greptile