Skip to content

feat(rdf): SPARQL playground, MCP knowledge-graph tools, insights endpoints#28042

Open
harshach wants to merge 25 commits into
mainfrom
harshach/rdf-fidelity
Open

feat(rdf): SPARQL playground, MCP knowledge-graph tools, insights endpoints#28042
harshach wants to merge 25 commits into
mainfrom
harshach/rdf-fidelity

Conversation

@harshach

@harshach harshach commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Describe your changes

This PR closes the RDF fidelity gaps in the knowledge-graph stack and removes the experimental R2RML row-materialization feature (the schema-level concept-graph is the strategy going forward; row triples don't scale and were the wrong target). It adds: a SPARQL playground UI, five MCP knowledge-graph tools (SparqlQueryTool, EntityNeighborhoodTool, FindByTagTool, OntologyDescribeTool, ShaclValidateTool), insights endpoints (PageRank-based importance, Louvain communities, shortest-path explain-lineage, tag/glossary co-occurrence, recommendations), an inference-rule registry with a starter pack (transitive lineage, schema-tag/domain inheritance, PII-propagation-via-lineage), federated-SPARQL allowlist, expanded SHACL shapes + REST validator, custom-ontology upload/extension, PROV-O activity mapper, DQV quality mapper, usage mapper, and JSON-LD context coverage for AI/Automation/Governance entities. Deletes the R2RML mapping schema, validator, registry, applier, materializer workflow, OntologyEmitter sink, Linked-Data UI mode, and emitOntologyTriples profiler flag. Adds the OpenMetadata ontology TTL coverage (skos, dcat, prov, dprod, foaf) for the new entity types and a CHANGELOG.

Type of change

  • Improvement
  • New feature
  • Documentation

Tests

  • Unit + ITs added for every new component: SparqlQueryToolTest, EntityNeighborhoodToolTest, FindByTagToolTest, OntologyDescribeToolTest, ShaclValidateToolTest, CustomOntologyValidatorTest, SparqlFederationGuardTest, InferenceRuleValidatorTest, CentralityComputationTest, CommunityComputationTest, LouvainTest, PageRankTest, LineagePathBuilderTest, LineagePathFinderTest, ImportanceQueryBuilderTest, CoOccurrenceQueryBuilderTest, RecommendationsQueryBuilderTest, RdfShaclValidatorTest, OntologyDocumentTest, RdfUsageMapperTest, RdfJsonLdContextTest, expanded RdfPropertyMapperTest, and RdfResourceIT. SPARQL playground covered by SparqlPlayground.test.tsx.

UI screen recording / screenshots

SPARQL playground + KnowledgeGraph view: screenshot/recording to be attached on PR.

🤖 Generated with Claude Code


Summary by Gitar

  • Configuration updates:
    • Renamed remoteEndpoint to RDF_REMOTE_ENDPOINT in openmetadata.yaml for RDF configuration.
  • Migration handling:
    • Updated MigrationWorkflow.java to surface and list specific pending migration versions in error messages.
  • UI localization:
    • Added panel-background-color key across all supported language files in openmetadata-ui.
  • Generated Types:
    • Added rdfConfiguration.ts to define schema types for RDF-related configuration, supporting the new knowledge-graph infrastructure.

This will update automatically on new commits.

Greptile Summary

This PR significantly expands the RDF/knowledge-graph stack by adding a SPARQL playground UI, five MCP knowledge-graph tools, a suite of graph-analytics insights endpoints (PageRank centrality, Louvain communities, BFS lineage paths, tag co-occurrence, recommendations), an inference-rule registry, federated-SPARQL allowlist enforcement, expanded SHACL validation, custom-ontology upload, and broader JSON-LD context coverage — while removing the experimental R2RML row-materialization path.

  • Security hardenings are solid: SPARQL UPDATE validation now uses UpdateFactory (Jena parser) instead of fragile keyword matching; the federation guard uses ElementWalker to catch SERVICE clauses structurally; RdfIriValidator centralises IRI sanitisation for SPARQL interpolation; all new endpoints are authorizeAdmin-gated.
  • New MCP tools (SparqlQueryTool, EntityNeighborhoodTool, FindByTagTool, OntologyDescribeTool, ShaclValidateTool) are read-only, enforce the federation allowlist, and cap response sizes.
  • CustomOntologyValidator.hasCycle has a correctness bug: onStack is not reset between outer-loop DFS calls, so nodes that merely reference nodes in a confirmed cycle are falsely reported as cyclic themselves, causing valid ontology extensions to be rejected at the admin validation endpoint.

Confidence Score: 3/5

The SPARQL injection hardenings and federation allowlist are well-implemented, but the cycle-detection bug in CustomOntologyValidator will falsely reject valid ontology extensions once any extension with a true cycle has been validated in the same request.

The bulk of the change — federation guard, IRI validator, read-only MCP tools, insights endpoints, SHACL validation — is carefully implemented and admin-gated. The one correctness defect in hasCycle produces incorrect validation results for a real admin workflow (custom ontology registration) with no easy workaround; it needs a one-line fix before this path is relied upon.

openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java — the detectClassHierarchyCycles / hasCycle pair needs onStack.clear() between outer iterations.

Important Files Changed

Filename Overview
openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java New admin-only validator for custom ontology extensions; contains a P1 bug in hasCycle where onStack is not reset between outer DFS iterations, producing false-positive cycle errors for nodes that merely reference cyclic nodes without being part of a cycle themselves.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/federation/SparqlFederationGuard.java New SPARQL SERVICE-clause allowlist guard; correctly uses Jena's ElementWalker to parse real SERVICE elements (not text matching), handles subqueries and variable endpoints, and is applied at both the REST and MCP tool SPARQL paths.
openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfResource.java Major expansion adding SHACL validation, inference-rule CRUD, insights endpoints (importance, centrality, communities, lineage path, recommendations, co-occurrence), and ontology extension management — all guarded by authorizeAdmin. The private executeSparqlQuery helper correctly enforces the federation guard on every query path including internally-built insight queries.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SparqlQueryTool.java New MCP SPARQL read-only tool; correctly rejects UPDATE/INSERT/DELETE via Jena parser (not text matching), enforces the federation allowlist, caps response size at configurable maxBytes with safe UTF-8 split logic, and restricts accepted formats to a known set.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/FindByTagTool.java New MCP tag-search tool; SPARQL is built by string interpolation but the FQN is validated to block quotes, backslash, and newlines before insertion into a double-quoted SPARQL string literal, preventing injection.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/LineagePathFinder.java BFS lineage path-finder; uses per-frontier SPARQL round-trips with a visited set to prevent cycles and a parents map for path reconstruction. Inputs are validated by LineagePathBuilder.validateNodeUri.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/PageRank.java Clean iterative weighted PageRank with dangling-node mass redistribution, convergence tolerance, and post-normalization. Handles disconnected components and isolated sinks correctly.
openmetadata-ui/src/main/resources/ui/src/pages/SparqlPlayground/SparqlPlayground.component.tsx New admin SPARQL playground UI; uses window.prompt() for saved-query naming (P2 UX issue). Format options, inference selection, sample queries, and result rendering are well structured.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfIriValidator.java New shared IRI validator for SPARQL interpolation; blocks angle-brackets, quotes, backticks, and control characters, enforces http/https scheme, and caps length at 2048. Not yet reused by LineagePathBuilder.validateNodeUri.
openmetadata-service/src/main/java/org/openmetadata/service/migration/api/MigrationWorkflow.java Minor improvement: pending migration versions are now surfaced in the error message, making startup failures easier to diagnose.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant UI as SparqlPlayground UI
    participant REST as RdfResource (JAX-RS)
    participant Guard as SparqlFederationGuard
    participant Repo as RdfRepository
    participant MCP as MCP SparqlQueryTool

    UI->>REST: "POST /v1/rdf/sparql {query, format, inference}"
    REST->>Guard: enforce(query)
    alt SERVICE clause blocked
        Guard-->>REST: FederationDisallowedException
        REST-->>UI: 403 Forbidden
    else allowed
        Guard-->>REST: ok
        REST->>Repo: executeSparqlQuery(query, mimeType)
        Repo-->>REST: results
        REST-->>UI: 200 + result body
    end

    MCP->>MCP: SparqlQueryTool.execute(params)
    MCP->>MCP: QueryFactory.create(query) — reject UPDATE
    MCP->>Guard: new SparqlFederationGuard(config).enforce(query)
    alt SERVICE blocked
        Guard-->>MCP: FederationDisallowedException
        MCP-->>MCP: return error map
    else allowed
        MCP->>Repo: executeSparqlQuery(query, mimeType)
        Repo-->>MCP: results (capped at maxBytes)
        MCP-->>MCP: return result map
    end

    REST->>REST: "GET /v1/rdf/insights/path?from=..&to=.."
    REST->>REST: LineagePathBuilder.validateNodeUri(from, to)
    REST->>Repo: BFS frontier SPARQL queries (per hop)
    Repo-->>REST: neighbor rows
    REST-->>REST: reconstruct path + decorate with rdf:type
    REST-->>UI: 200 Path JSON
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant UI as SparqlPlayground UI
    participant REST as RdfResource (JAX-RS)
    participant Guard as SparqlFederationGuard
    participant Repo as RdfRepository
    participant MCP as MCP SparqlQueryTool

    UI->>REST: "POST /v1/rdf/sparql {query, format, inference}"
    REST->>Guard: enforce(query)
    alt SERVICE clause blocked
        Guard-->>REST: FederationDisallowedException
        REST-->>UI: 403 Forbidden
    else allowed
        Guard-->>REST: ok
        REST->>Repo: executeSparqlQuery(query, mimeType)
        Repo-->>REST: results
        REST-->>UI: 200 + result body
    end

    MCP->>MCP: SparqlQueryTool.execute(params)
    MCP->>MCP: QueryFactory.create(query) — reject UPDATE
    MCP->>Guard: new SparqlFederationGuard(config).enforce(query)
    alt SERVICE blocked
        Guard-->>MCP: FederationDisallowedException
        MCP-->>MCP: return error map
    else allowed
        MCP->>Repo: executeSparqlQuery(query, mimeType)
        Repo-->>MCP: results (capped at maxBytes)
        MCP-->>MCP: return result map
    end

    REST->>REST: "GET /v1/rdf/insights/path?from=..&to=.."
    REST->>REST: LineagePathBuilder.validateNodeUri(from, to)
    REST->>Repo: BFS frontier SPARQL queries (per hop)
    Repo-->>REST: neighbor rows
    REST-->>REST: reconstruct path + decorate with rdf:type
    REST-->>UI: 200 Path JSON
Loading

Reviews (1): Last reviewed commit: "Fix Java checkstyle" | Re-trigger Greptile

Greptile also left 4 inline comments on this PR.

Copilot AI review requested due to automatic review settings May 11, 2026 18:29
@harshach harshach requested review from a team, akash-jain-10 and tutte as code owners May 11, 2026 18:29
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands OpenMetadata’s RDF/knowledge-graph capabilities end-to-end: UI support for running SPARQL, spec-level ontology/context updates, and service/MCP tooling for querying, validation (SHACL), federation-guarding, inference rules, and “insights” SPARQL query builders.

Changes:

  • Adds SPARQL playground client support + routing, plus supporting UI enums/interfaces and locale keys.
  • Introduces/extends RDF specs (ontology changelog, JSON-LD contexts, RDF configuration schema) and adds inference-rule + federation configuration shapes.
  • Adds service-side RDF utilities (ontology document serving, SHACL validator, usage/quality/activity mappers, insights query builders) and MCP tools that expose read-only SPARQL + graph utilities.

Reviewed changes

Copilot reviewed 97 out of 100 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
openmetadata-ui/src/main/resources/ui/src/rest/rdfAPI.ts Adds SPARQL playground REST helper (runSparqlQuery) and result typing.
openmetadata-ui/src/main/resources/ui/src/pages/SparqlPlayground/SparqlPlayground.interface.ts Adds local types/constants for saved queries and sample queries.
openmetadata-ui/src/main/resources/ui/src/locale/languages/en-us.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/zh-tw.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/zh-cn.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/tr-tr.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/th-th.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/ru-ru.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/pt-pt.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/pt-br.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/pr-pr.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/nl-nl.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/mr-in.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/ko-kr.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/ja-jp.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/he-he.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/gl-es.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/fr-fr.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/es-es.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/de-de.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/locale/languages/ar-sa.json Adds SPARQL playground-related labels/messages; removes one label key.
openmetadata-ui/src/main/resources/ui/src/enums/codemirror.enum.ts Adds CodeMirror mode value for SPARQL.
openmetadata-ui/src/main/resources/ui/src/constants/constants.ts Adds route constant for SPARQL playground.
openmetadata-ui/src/main/resources/ui/src/components/KnowledgeGraph/KnowledgeGraph.tsx Refactors metadata-mode body rendering; removes a header label.
openmetadata-ui/src/main/resources/ui/src/components/AppRouter/AuthenticatedAppRouter.tsx Registers lazy-loaded SPARQL playground route.
openmetadata-spec/src/main/resources/rdf/ontology/openmetadata-prov.ttl Adds missing RDF prefix in PROV extension TTL.
openmetadata-spec/src/main/resources/rdf/ontology/CHANGELOG.md Introduces ontology changelog documenting fidelity changes.
openmetadata-spec/src/main/resources/rdf/contexts/lineage.jsonld Adjusts lineage JSON-LD mappings to avoid predicate collisions.
openmetadata-spec/src/main/resources/rdf/contexts/governance.jsonld Aligns governance mappings with SKOS predicates and adds new fields.
openmetadata-spec/src/main/resources/rdf/contexts/automation.jsonld Adds JSON-LD context for automation/workflow entities.
openmetadata-spec/src/main/resources/rdf/contexts/ai.jsonld Adds JSON-LD context for AI/MCP-related entities.
openmetadata-spec/src/main/resources/json/schema/api/configuration/rdfConfiguration.json Adds federation config block to RDF configuration schema.
openmetadata-spec/src/main/resources/json/schema/api/configuration/rdf/inferenceRule.json Adds schema for inference rule objects (CONSTRUCT/RDFS placeholder).
openmetadata-spec/src/main/resources/json/schema/api/configuration/rdf/customOntology.json Adds schema for custom ontology extension definitions.
openmetadata-service/src/test/java/org/openmetadata/service/resources/rdf/RdfShaclValidatorTest.java Adds SHACL validation tests for key shape constraints.
openmetadata-service/src/test/java/org/openmetadata/service/resources/rdf/OntologyDocumentTest.java Adds tests for ontology serving endpoint serialization.
openmetadata-service/src/test/java/org/openmetadata/service/rdf/translator/RdfUsageMapperTest.java Adds tests for RDF usage summary mapping.
openmetadata-service/src/test/java/org/openmetadata/service/rdf/insights/RecommendationsQueryBuilderTest.java Adds tests for recommendations SPARQL builder validation/shape.
openmetadata-service/src/test/java/org/openmetadata/service/rdf/insights/CoOccurrenceQueryBuilderTest.java Adds tests for co-occurrence/popularity/reach SPARQL builders.
openmetadata-service/src/main/resources/rdf/inference-rules/transitive-lineage-closure.json Adds starter inference rule JSON.
openmetadata-service/src/main/resources/rdf/inference-rules/schema-tag-inheritance.json Adds starter inference rule JSON.
openmetadata-service/src/main/resources/rdf/inference-rules/pii-propagation-via-lineage.json Adds starter inference rule JSON.
openmetadata-service/src/main/resources/rdf/inference-rules/domain-membership-inheritance.json Adds starter inference rule JSON.
openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfShaclValidator.java Adds SHACL shapes loader + validator helper.
openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/OntologyDocument.java Adds merged ontology document loader + serializer for multiple formats.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/translator/RdfUsageMapper.java Adds mapper emitting usage-summary triples.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/translator/RdfQualityMapper.java Adds mapper emitting DQV quality measurement resources.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/translator/RdfActivityMapper.java Adds PROV activity mapping for pipeline runs.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/translator/JsonLdTranslator.java Loads new JSON-LD contexts and assigns column IDs for named column resources.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfUtils.java Adds RDF types for new entities and a helper to mint column URIs.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/RecommendationsQueryBuilder.java Adds recommendations SPARQL builder.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/ImportanceQueryBuilder.java Adds importance ranking SPARQL builder.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CoOccurrenceQueryBuilder.java Adds co-occurrence/popularity/reach SPARQL builders.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/inference/InferenceRuleValidator.java Adds strict validation for inference rule bodies.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/inference/InferenceRuleRegistry.java Adds in-memory inference rule registry + starter pack loader.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/federation/SparqlFederationGuard.java Adds SERVICE allowlist enforcement for federated SPARQL.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyRegistry.java Adds in-memory registry for custom ontology extensions.
openmetadata-mcp/src/test/java/org/openmetadata/mcp/tools/OntologyDescribeToolTest.java Adds tests for ontology describe MCP tool.
openmetadata-mcp/src/test/java/org/openmetadata/mcp/tools/FindByTagToolTest.java Adds tests for find-by-tag MCP tool.
openmetadata-mcp/src/test/java/org/openmetadata/mcp/tools/EntityNeighborhoodToolTest.java Adds tests for entity neighborhood MCP tool.
openmetadata-mcp/src/main/resources/json/data/mcp/tools.json Registers new MCP tool schemas for KG/SPARQL features.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SparqlQueryTool.java Adds read-only SPARQL MCP tool with federation guard + truncation.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/ShaclValidateTool.java Adds SHACL validation MCP tool (entity-scoped or full graph).
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/OntologyDescribeTool.java Adds ontology describe MCP tool (full ontology or DESCRIBE).
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/FindByTagTool.java Adds MCP tool to find entities by tag/glossary FQN.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/DefaultToolContext.java Wires new MCP tools into tool dispatcher.
ingestion/src/metadata/workflow/profiler.py Adds optional ontology emission step to profiler workflow.
docker/development/docker-compose-postgres-fuseki.yml Adds RDF endpoint env var for Fuseki-based dev stack.
Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfUtils.java:20

  • New RDF types were added for workflowinstance, agentexecution, mcpexecution, automation, etc., but PROV_ACTIVITY_TYPES was not updated. Since JsonLdTranslator.toRdf() relies on getProvType() to add prov:Activity typing (when the primary rdf:type is not already a PROV class), these new execution-like entities will miss rdf:type prov:Activity, reducing PROV-O compatibility and potentially breaking queries that filter by prov:Activity. Please extend the PROV type sets to cover the new entity types.
  private static final Set<String> PROV_ACTIVITY_TYPES =
      Set.of(
          "pipeline",
          "ingestionpipeline",
          "storedprocedure",
          "dbtpipeline",
          "workflow",
          "pipelinerun");

Comment thread ingestion/src/metadata/workflow/profiler.py Outdated
@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

harshach added a commit that referenced this pull request May 11, 2026
…ntic correctness

Addresses the open review comments on PR #28042. Each fix is independent.

P0 — correctness / security
- ingestion/profiler.py: drop residual OntologyEmitter import (module removed
  in the R2RML pivot). Every pytest that imports profiler.py was failing with
  ModuleNotFoundError; reverts file to match main.
- RdfResource.validateGraph + ShaclValidateTool: replace `entityUri.replace(">", "")`
  with strict absolute-http(s)-IRI validation. Reject control chars, whitespace,
  quotes, and angle brackets up front. Closes the SPARQL-injection vector via
  newlines / # comments in the DESCRIBE template.
- EntityNeighborhoodTool.buildConstructQuery: rewrite to emit each path edge in
  its own UNION arm with the correct subject. The previous BIND(<entity> AS ?s)
  applied across all arms collapsed every multi-hop edge onto the start node.

P1 — semantic / robustness
- CommunityComputation.parseGraph: stop emitting both directions; the FILTER in
  the SPARQL canonicalises pairs and Louvain.addAllEdges symmetrises the
  adjacency internally. Double symmetrisation was doubling every edge weight.
- RdfActivityMapper: emit `prov:wasInformedBy` (Activity → Activity) instead of
  `prov:wasGeneratedBy` (Entity → Activity) for pipeline-run → pipeline. Closes
  the PROV-O domain/range inversion.
- RdfQualityMapper.measurementUri: deterministic URI built from subject + metric
  + timestamp. Random UUID was leaking orphan QualityMeasurement nodes on every
  re-emit because deletion only follows subject/object on the entity itself.
- SparqlQueryTool: byte-aware truncation. Previous substring-by-char cap could
  exceed maxBytes for multi-byte UTF-8 and never enforced a real byte limit.
- ShaclValidateTool: require explicit `fullGraph=true` to validate the entire
  triplestore; reject otherwise. Prevents accidental OOM on multi-GB graphs.
- RdfResource.getFederationGuard: synchronise the lazy-init path; the volatile
  field was double-checked without a lock.

P2 — code quality
- RdfResource.validateGraph: replace 14 inline FQN class names with proper
  imports (Model / ModelFactory / Lang / RDFDataMgr / RDFFormat / ValidationReport).
- RdfShaclValidator + OntologyDocument: catch RuntimeException (covers
  RiotException) in the static-init shapes/ontology load. A corrupt classpath
  resource now degrades to an empty model instead of failing class init and
  taking down /rdf/* and MCP describe.
- i18n: restore `label.view-mode` (still referenced by
  OntologyExplorer/FilterToolbar.tsx) and route the SparqlPlayground sample-query
  names through translation keys (`label.sparql-sample-*`). All 17 locales
  synced via `yarn i18n`.

Tests updated where the fix changes observable behaviour:
- EntityNeighborhoodToolTest: assert the new multi-hop subject preservation
  and the new depth-3 chain variable name.
- CommunityComputationTest: assert the directed-single-edge behaviour
  (Louvain symmetrises internally).
- RdfPropertyMapperTest: assert `prov:wasInformedBy` instead of
  `prov:wasGeneratedBy` for the pipeline activity.
- SparqlPlayground.test.tsx: drive the sample button by `nameKey`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 19:48
Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/ShaclValidateTool.java Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 95 out of 98 changed files in this pull request and generated 4 comments.

Comment thread openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/FindByTagTool.java Outdated
pmbrull and others added 15 commits May 20, 2026 14:15
#28289)

The ContextMemory schema (#28224) ships MemoryVisibility (PRIVATE/SHARED/
ENTITY) and shareConfig.sharedWith, but the CRUD resource didn't enforce
them — any non-admin user could read another user's PRIVATE memory via
GET /v1/contextCenter/memories/{id} or pick them out of /memories list
results. Wire visibility into list/get/getByName via a small
ContextMemoryVisibility helper:

- list: drop non-visible memories from the response data
- get / getByName: 403 (ForbiddenException) when not visible to the caller
- Visibility rule: admin → all, owner → own memory, ENTITY → everyone,
  SHARED → only principals listed in sharedWith (with team/domain
  expansion), PRIVATE / null shareConfig → owner only
…to databases (#28288)

* fix(ui): return 404 for unknown service category instead of fallback to databases

* fix ui checkstyle
…7 with retry + add workflow_dispatch (#28292)

* ci(airflow-apis-tests): retry Sonar PR scan on JRE-provisioning flake

Mirror the py-tests pattern: migrate from the deprecated
sonarsource/sonarcloud-github-action@master to
SonarSource/sonarqube-scan-action@v7, mark the PR scan
continue-on-error, and add a sleep+retry step so a transient
'Failed to query JRE metadata' from Sonar's JRE-provisioning
endpoint no longer fails the job on first attempt. Hoist the
shared sonar args into a workflow-level SONAR_OPTS env.

* ci(airflow-apis-tests): allow workflow_dispatch + run Sonar step on it

Add workflow_dispatch trigger so the Sonar retry path can be
exercised from the Actions UI without opening a PR, and extend
the Sonar PR step (plus its wait+retry siblings) to run on the
dispatch event.

* ci(airflow-apis-tests): scope Sonar steps to pull_request_target only

Drop workflow_dispatch from the Sonar PR/retry step conditions so
manual runs don't fire the scanner with empty -Dsonar.pullrequest.*
flags (would create a branch entry in SonarCloud, per gitar-bot
review). Dispatch trigger stays for re-running the build/test
surface; Sonar will only fire on a real PR where the pull-request
context exists.
…28273)

* fix(logging): synchronous shutdown captures full streamable log tail

V1's close() returned before the daemon sender delivered /close, leaving
the server-side multipart upload unfinalized — workflow tail (Execution
Time Summary, Workflow Summary, "Workflow finished in time") never
reached S3, only partial.txt did.

Redesign:
- Single Queue + single daemon worker; threading.Event for control
  plane instead of a sentinel on the data queue.
- flush(timeout) / shutdown(timeout) are synchronous: caller blocks
  until queued records are delivered or deadline.
- atexit hook covers callers that forget to shut down.
- Shutdown ships a final metrics line + synchronous /close POST.
  On join timeout, force-stops the worker and retries on a fresh REST
  so /close is never starved by an in-flight POST.
- log_server_version() decoupled from validate_versions() so the
  "client running" line is captured by the handler instead of being
  emitted before setup.

Counters (shipped, failed, dropped, timed_out) are logged at shutdown
so silent loss becomes loud.

* fix(logging): close flush() TOCTOU between dequeue and POST start

flush() polled (buffer.empty() AND not _post_in_flight). The worker
set _post_in_flight inside _post_batch, which left a microsecond gap
after buffer.get() returned (queue empty) but before _post_batch ran
(flag still clear). A flush() that landed in that gap could conclude
"drained" while the worker held a batch about to ship.

Not a data-loss bug in our caller (shutdown() always follows flush()
with worker.join()), but flush() is public and the contract should
hold standalone.

Set _post_in_flight inside _collect_batch immediately after the first
successful get(); the gap shrinks to one Python line. The flag stays
clear while the worker idles in buffer.get(timeout=...), so flush()
still returns fast when there's no work.

* refactor(logging): name magic timeouts in streamable handler

Hardcoded numbers in flush() and shutdown() were not self-documenting
and tied callers to undocumented invariants. Promoted to named class
constants:

- FLUSH_DEFAULT_SEC (5.0): flush() default deadline.
- FLUSH_POLL_SEC (0.05): how often flush() rechecks state.
- FORCE_STOP_JOIN_SEC (2.0): secondary worker join after force-stop.

flush(timeout) now uses the Optional[None] sentinel pattern matching
shutdown(timeout). Added a comment to the poll-sleep explaining the
two-conditions-AND requirement that motivates it.

* fix(logging): close worker/main _client race + mock log_server_version

Two issues surfaced by the PR review:

1. Workflow.__init__ now calls metadata.log_server_version(), which hits
   GET /system/version when not mocked. conftest pre-mocks the sibling
   validate_versions; add log_server_version so unit tests don't
   blow up on collection.

2. After force-stop, shutdown() reassigned self._client = REST(...).
   The worker's finally block (which closes self._client) could then
   close the freshly-created client under the main thread's metrics
   and /close POSTs. Use a local 'post_close_client' instead so the
   worker's session and the main thread's never alias.

Also document on shutdown(timeout=...) that the post-stop metrics and
/close POSTs carry their own HTTP timeouts, so total wall-time can be
roughly timeout + 32s in pathological cases.
…8260)

* feat(mysql): support custom queryHistoryTable for usage & lineage

MySQL lineage/usage extraction reads query history from one of two
hardcoded tables: `mysql.general_log` (default) and `mysql.slow_log`
(when `useSlowLogs=true`). Many deployments don't grant ingestion users
access to either of these system tables and instead replicate query
history into a custom table or view, leaving the connector with no way
to point at the replacement.

Add an optional `queryHistoryTable` field to MysqlConnection. When set,
it overrides the table read by the lineage/usage SQL and by the
test-connection probe — for both the general-log and slow-log paths.
Column expectations (`argument` vs `sql_text`, `event_time` vs
`start_time`) still follow `useSlowLogs`, so the custom table must
expose columns compatible with the selected path.

Mirrors the existing `queryHistoryTable` pattern in the Trino connector
(parameterized `{query_history_table}` placeholder in the SQL template,
resolved at format time from the service connection).

Fixes #28089

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update generated TypeScript types

* fix(mysql): add queryHistoryTable to SDK builder example

The exhaustive MysqlConnection construction in the SDK builder example
enumerates every connection field; basedpyright (run with
--baselinemode=discard) flagged the newly added queryHistoryTable as a
missing constructor argument. Pass it explicitly as None to keep the
example exhaustive and the static-checks job green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…27987)

* fix(alerts): make matchAnyEntityFqn match FQNs literally, not as regex

* fix(alerts): strict literal matching across alert filter functions

* fix(policy): AST-based SpEL validation so string-literal content is not parsed as syntax

* fix(data-insight): evaluate chart formulas outside the policy/alert validator

* test(playwright): ObservabilityAlerts must pick filter options by FQN, not bare name

* test(playwright): focus combobox before fill in alert action input loop

* test(playwright): extend inputValueId pattern to alert action inputs

* refactor(alerts): move UUID regex to constants and use size variable
* fix(ometa): use requests instead of httpx for SSE transport

In some enterprise environments httpx's HTTP/1.1 wire pattern is reset
by network security middleboxes that pass requests/wget through. The
symptom is httpx.ReadError: [Errno 104] Connection reset by peer at
_receive_response_headers, with no server-side access log line because
the connection is reset before any HTTP response is committed. requests
and wget succeed against the same endpoint with the same JWT and body.

The SDK's REST client already uses requests, so this aligns the SSE
path with the proven transport. Behavior preserved: same retry loop,
same SSE parser, same Last-Event-ID resume on retry, same log lines.

Drive-by fix: the previous implementation built opts = { ...,
allow_redirects, verify, cookies, ... } but never passed those keys to
httpx.stream(...) - a latent bug. requests.Session.request now wires
them up properly (defaults: verify=True, allow_redirects=True when
unset).

No new dependency: requests is already used by the SDK's REST client.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(ometa): drop noisy comment, expose timeout in SSE stream()

- Remove the fix-rationale comment from the request_kwargs block; that
  context lives in the PR description and would rot in-tree.
- Add an optional timeout parameter to SSEClient.stream(), forwarded to
  requests. Defaults to None (no timeout, prior behavior).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(ometa): apply ruff formatting to sse_client

Output of `make py_format` after the previous commits in this branch:
ruff collapses a few multi-line expressions onto one line where they
fit within the line width. No functional change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ometa): address PR review on SSE client

Three follow-ups from Copilot's review on PR #28293:

1. Forward `ClientConfig.cert` to the SSE request so mTLS / client-cert
   setups work for SSE the same way they do for the REST client. Without
   this, deployments that authenticate the REST path via mutual TLS would
   silently fail on `/agents/dynamic/run` and similar SSE endpoints.

2. Document `data` and the new `timeout` parameter in the `stream()`
   docstring, including the SSE-specific note that the default of `None`
   disables timeouts because SSE streams can have long idle periods.

3. Migrate unit tests off `httpx` mocks:
     - `tests/unit/metadata/ingestion/ometa/test_sse_client.py` and
       `tests/unit/test_user_agent.py` now mock `requests.Session.request`
       instead of `httpx.Client.stream`.
     - `MockHTTPXClient` is replaced by `MockRequestsSession` that
       accepts the SDK's full kwargs set (method/url/headers/json/params/
       stream/timeout/verify/allow_redirects/cookies/cert).
     - `MockSSEResponse.iter_lines` now accepts `decode_unicode` to
       match the call shape of the new code.
     - HTTP error / connection error fixtures use `requests.exceptions`.

Not adopting Copilot's suggestion to fall back to `self.config.timeout`
when the stream() arg is `None`: the REST client's default is `(10, 300)`,
which would silently abort SSE streams that have legitimate idle gaps
longer than 5 minutes (LLM "thinking" pauses, sparse update events).
Keeping the SSE default at `None` (no timeout) preserves prior behavior
and matches the working-everywhere baseline; callers can opt in.

All 56 tests in the two updated files pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…1.12.9 (#28238)

* fix(migration): preempt PDTS duplicates and recover invalid index in 1.12.9

CREATE UNIQUE INDEX CONCURRENTLY aborts when it hits existing duplicate
keys but leaves an invalid index behind. On migration retry, IF NOT EXISTS
no-ops successfully and gets checksum-logged, after which ADD CONSTRAINT
USING INDEX fails permanently with "index ... is not valid". Hit at a
customer with two duplicate table.systemProfile rows on a 10M-row PDTS.

Adds two idempotent statements before the existing constraint build:

- DO block: drops the invalid index and clears its migration-log entry
  when indisvalid=false. No-op on fresh DBs and on already-migrated
  environments (where the index is valid and owned by the constraint).

- DELETE: collapses duplicate rows via single hash aggregate on the
  4-column key + targeted self-join. Reads only key columns (no json
  scan), only touches rows in actual duplicate groups, no-op on clean DBs.
  Efficient on multi-million-row PDTS tables.

Existing CREATE INDEX / ALTER TABLE / ANALYZE statements byte-identical
so checksum-matched skips for already-migrated environments still apply.

* fix(migration): self-heal in one pass + schema-scoped invalid-index probe

Addresses Copilot review on #28238:

1. One-pass self-heal. MigrationFile.parseSQLFiles filters already-logged
   statements at parse time (MigrationFile.java:83). Clearing the CREATE
   log entry from inside a DO block doesn't bring CREATE back into the
   current pass's execution list — it would only re-run on the next
   migration cycle, leaving the same-pass ALTER to fail again. Replace
   the "DROP + clear log" pattern with "DROP + rebuild inline" so a
   valid index exists before ALTER runs in the same pass.

   Inline rebuild uses non-concurrent CREATE UNIQUE INDEX, which takes a
   brief ACCESS EXCLUSIVE lock on the table. Acceptable because this
   path fires only when the environment is already in a degraded state.
   Normal-path customers go through the CONCURRENTLY build below.

2. Schema-scoped invalid-index probe. pg_class.relname is not
   schema-unique. Anchor the lookup via
   i.indrelid = 'profiler_data_time_series'::regclass and DROP by index
   OID (invalid_idx::regclass), so an invalid index with the same name
   in another schema cannot accidentally trigger this branch.

Existing CREATE INDEX / ALTER TABLE / ANALYZE statements byte-identical
to before this PR, so checksum-matched skips still apply for
already-migrated environments. Test gap (Copilot's third comment) for
the recovery scenario tracked as follow-up — existing migration tests
in MigrationWorkflowReprocessingTest are mock-based; verifying recovery
end-to-end needs Postgres integration infrastructure.

* chore(migration): trim verbose comments in 1.12.9/postgres/schemaChanges.sql

Statement bodies unchanged — checksums identical. Detailed mechanism
write-ups live in the commit log / PR description; the file keeps just
the load-bearing intent comment above each statement.

* fix(migration): scope PDTS dedup to operation IS NOT NULL

Addresses Copilot review on PR #28238 discussion r3264066840.

Postgres UNIQUE treats NULLs as DISTINCT by default, so the constraint
on (entityFQNHash, extension, operation, timestamp) permits multiple
rows where operation IS NULL — i.e. table.tableProfile and
table.columnProfile rows that share the other key columns.

The previous dedup used GROUP BY which treats NULLs as equal, so it
would have collapsed retry-induced tableProfile / columnProfile pairs
that the restored constraint never actually blocked. Restricting the
subquery to operation IS NOT NULL (plus a defensive entityFQNHash IS
NOT NULL) aligns dedup with constraint semantics.

DMG's customer rows were all table.systemProfile (operation = INSERT),
so this still removes the customer dupes correctly. tableProfile /
columnProfile retry duplicates — if they exist — stay as-is, which is
the same outcome the unique constraint would produce on its own.

* perf(migration): boost work_mem / maintenance_work_mem for PDTS dedup at scale

Mirrors the tuning pattern from 1.9.9/postgres/postDataMigrationSQLScript.sql
(same table, same operation class). On 50M-row PDTS the dedup DELETE's hash
aggregate spills to disk with default work_mem=4MB, adding ~30-60s of disk
I/O. Bumping work_mem to 256MB keeps the aggregate in memory;
maintenance_work_mem=512MB lets CREATE UNIQUE INDEX CONCURRENTLY sort in
memory too.

Session-level (not SET LOCAL) because schemaChanges runs in autocommit
(CREATE INDEX CONCURRENTLY requires it) — SET LOCAL would reset between
statements. RESET at the end of the file restores defaults before the
connection returns to the Hikari pool.

Expected runtime impact at customer scale:
  20M rows:  ~30s tuned vs ~40s default
  50M rows:  ~40s tuned vs ~90s default (avoids spill)

* chore(migration): trim comments on PDTS dedup additions

* chore(migration): drop 1.9.9 reference from mem comment
…tityStats (#28264)

* fix(data-insights): resolve owner by id in processTeam to handle bare historical refs

* refactor(data-insights): isolate enricher step failures + per-step EntityStats

* test(data-insights): add enricher behavior IT + rename equivalence IT

* refactor(data-insights): extract VersionResolver + SnapshotMaterializer

* refactor(data-insights): extract enricher concerns into step classes + bounded owner cache + rate-limited loss warns

* refactor(data-insights): apply review feedback — tighten enricher docs, drop dead code

* docs(data-insights): drop unsupported 'intentional' claim from TierStep prefix-match docstring

* fix(data-insights): guard VersionResolver against null updatedAt on historical version rows

* refactor(data-insights): tighten VersionResolver walk — consume isFirst at top, drop historyDone flag
* feat(ui): allow admins to set default landing-page panel color

Adds an optional panelBackgroundColor field to ThemeConfiguration and a
matching color picker on the admin Theme settings page. The landing page
welcome panel now falls back to this admin-configured color when no
per-user or per-persona override exists, preserving existing customization.
…28228)

* fix(dq-dashboard): tier filter queries wrong ES field, showing 0 in testCase/result/resolutionStatus widgets

Tier is indexed in a dedicated `tier.tagFQN` field (or `testCase.tier.tagFQN`
for derived indices), not in the `tags[]` array. Several fetch functions were
merging tier FQNs into the tags array and routing them through
`buildMustEsFilterForTags`, which emits a `nested: { path: 'tags' }` query
that never matches a tier-tagged doc.

- Add `buildMustEsFilterForTier(tiers, isTestCaseResult)` in DataQualityUtils —
  emits a flat `term: { 'tier.tagFQN' }` or `term: { 'testCase.tier.tagFQN' }`
- Fix `buildDataQualityDashboardFilters` !isTableApi branch to route tier and
  tags through separate filters
- Fix all four bypass functions in dataQualityDashboardAPI.ts:
  `fetchTestCaseSummaryByNoDimension`, `fetchCountOfIncidentStatusTypeByDays`,
  `fetchIncidentTimeMetrics`, `fetchTestCaseStatusMetricsByDays`
- Fix TagPage data_observability tab: Tier tags now pass `tier` in initialFilters
  and hide the tier dropdown, instead of incorrectly using the tags filter

Fixes open-metadata/openmetadata-collate#4086

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix failing unit tests

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…versions

conf/openmetadata.yaml line 720 was reading ${RDF_ENDPOINT}, but every
docker-compose stack and deployment doc has always exported
RDF_REMOTE_ENDPOINT. The mismatch was silently falling back to the
localhost default, so any non-local Fuseki configuration was ignored and
the server crashed trying to push the ontology to http://localhost:3030.
Realign the placeholder with the long-standing env var name.

Also extend MigrationWorkflow.validateMigrationsForServer to include the
list of pending versions in the IllegalStateException message. The bare
"there are pending migrations" wording forced operators to add their own
logging to find out which versions were actually outstanding; listing
them inline cuts straight to the failing migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud

Copy link
Copy Markdown

# Conflicts:
#	.github/workflows/airflow-apis-tests.yml
#	conf/openmetadata.yaml
#	ingestion/src/metadata/ingestion/source/database/mysql/connection.py
#	ingestion/src/metadata/utils/streamable_logger.py
#	ingestion/src/metadata/workflow/base.py
#	ingestion/tests/unit/utils/test_streamable_logger.py
#	openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/AlertsRuleEvaluatorResourceIT.java
#	openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/DefaultToolContext.java
#	openmetadata-service/src/main/java/org/openmetadata/service/events/subscription/AlertsRuleEvaluator.java
#	openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java
#	openmetadata-service/src/main/java/org/openmetadata/service/resources/context/ContextMemoryResource.java
#	openmetadata-service/src/main/java/org/openmetadata/service/security/policyevaluator/ExpressionValidator.java
#	openmetadata-ui/src/main/resources/ui/playwright/e2e/Features/DataQuality/DataQualityDashboard.spec.ts
#	openmetadata-ui/src/main/resources/ui/playwright/e2e/Flow/ObservabilityAlerts.spec.ts
#	openmetadata-ui/src/main/resources/ui/playwright/utils/dataQuality.ts
#	openmetadata-ui/src/main/resources/ui/src/components/MyData/CustomizableComponents/CustomiseLandingPageHeader/CustomiseLandingPageHeader.test.tsx
#	openmetadata-ui/src/main/resources/ui/src/components/OntologyExplorer/hooks/useGraphData.ts
#	openmetadata-ui/src/main/resources/ui/src/components/OntologyExplorer/hooks/useOntologyGraph.ts
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/ar-sa.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/de-de.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/en-us.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/es-es.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/fr-fr.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/gl-es.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/he-he.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/ja-jp.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/ko-kr.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/mr-in.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/nl-nl.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/pr-pr.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/pt-br.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/pt-pt.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/ru-ru.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/th-th.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/tr-tr.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/zh-cn.json
#	openmetadata-ui/src/main/resources/ui/src/locale/languages/zh-tw.json
#	openmetadata-ui/src/main/resources/ui/src/rest/dataQualityDashboardAPI.test.ts
#	openmetadata-ui/src/main/resources/ui/src/rest/dataQualityDashboardAPI.ts
#	openmetadata-ui/src/main/resources/ui/src/utils/Alerts/AlertsUtil.tsx
#	openmetadata-ui/src/main/resources/ui/src/utils/CustomizePage/CustomizePageUtils.test.ts
#	openmetadata-ui/src/main/resources/ui/src/utils/DataQuality/DataQualityUtils.tsx
#	openmetadata-ui/src/main/resources/ui/src/utils/StringUtils.test.ts
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

  • No GitHub issue is linked. Link an issue in the Development section of the PR (or add Fixes #12345 to the description). For a same-org cross-repo issue, add Fixes open-metadata/<repo>#123 to the description.

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@github-actions

Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gitar-bot

gitar-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Code Review ⚠️ Changes requested 7 resolved / 9 findings

Implements extensive knowledge-graph features including a SPARQL playground, MCP tools, and inference registry while deprecating legacy R2RML materialization. Resolves multiple injection and validation vulnerabilities, though the CustomOntologyValidator requires a fix for class forward-references and the SPARQL UPDATE logic lacks proper URI sanitization.

⚠️ Bug: CustomOntologyValidator rejects valid forward-references between classes

📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:106-109 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:151 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:157-158

The validateClass method checks whether a parent class URI exists in seenUris (line 158), but seenUris is populated incrementally as classes are processed in list order (line 151). If class B declares class A as its subClassOf parent, but A appears after B in the classes list, the validator will incorrectly report "references unknown parent class" — even though both classes are declared in the same extension.

This makes validation order-dependent and will reject valid ontology extensions where the author didn't happen to declare classes in topological order. The cycle-detection method (detectClassHierarchyCycles) correctly builds the full graph before traversing, so it's not affected — only the parent-existence check is broken.

Pre-collect all class URIs before validating parent references, so declaration order doesn't matter. Change the duplicate-URI detection inside validateClass to use a separate seen-set or check before adding.
// First pass: collect all declared class URIs
Set<String> classUris = new HashSet<>();
for (CustomOntologyClass cls : classes) {
  if (cls != null && cls.getUri() != null && !cls.getUri().isBlank()) {
    classUris.add(cls.getUri());
  }
}
// Second pass: validate each class against the full set
for (CustomOntologyClass cls : classes) {
  validateClass(cls, classUris, errors);
}
💡 Security: SPARQL UPDATE built via string concat without URI validation

📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CentralityComputation.java:146-157 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CommunityComputation.java:214-218

In CentralityComputation.persistScores and CommunityComputation.persistCommunities, entity URIs obtained from SPARQL SELECT results are concatenated directly into SPARQL UPDATE strings using <uri> syntax without any IRI validation. While valid RDF IRIs cannot contain > (so Jena-stored URIs should be safe), this is a defense-in-depth gap: if a malformed URI ever enters the triplestore (e.g., via a custom ontology upload or a bug in another component), it could break out of the <…> brackets and inject arbitrary SPARQL UPDATE operations.

The codebase already provides RdfIriValidator.validateEntityIri() for this purpose but it's not used in the insights package.

Add a basic IRI sanity check (or use RdfIriValidator) before concatenating URIs into the UPDATE string. Apply similarly in CommunityComputation.persistCommunities.
// In CentralityComputation.persistScores, before appending:
for (Map.Entry<String, Double> e : sorted) {
  String uri = e.getKey();
  if (uri == null || uri.contains(">") || uri.contains(" ") || uri.contains("
")) {
    continue; // skip malformed URIs
  }
  update.append("  <").append(uri).append("> ...");
}
✅ 7 resolved
Security: SPARQL injection via insufficient URI sanitization in validateGraph

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfResource.java:197-200 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CommunityComputation.java:202-205 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfResource.java:199 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CommunityComputation.java:203
The validateGraph endpoint (line 199) sanitizes user-supplied entityUri by only stripping > characters before interpolating into a SPARQL query: String.format("DESCRIBE <%s>", entityUri.replace(">", "")). This is insufficient — an attacker can inject SPARQL via newlines, # comments, or other control characters. For example, entityUri = "http://x> . } DELETE WHERE { ?s ?p ?o } #" would close the angle bracket context on a newline. Since this is an admin-only endpoint the blast radius is somewhat limited, but it's still a direct injection vector against the triplestore.

The same pattern appears in CommunityComputation.persistCommunities (line 203) where member URIs from SPARQL results are directly interpolated into SPARQL UPDATE statements — if a malicious triple exists in the graph, it could inject arbitrary updates.

Bug: CONSTRUCT query flattens multi-hop neighborhood onto single subject

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/EntityNeighborhoodTool.java:102-116
In EntityNeighborhoodTool.buildConstructQuery (line 102-121), the WHERE clause captures ?p ?o triples from 1-3 hops away, but then does BIND(<entityUri> AS ?s) and the CONSTRUCT template is { ?s ?p ?o }. This means ALL matched triples — even those 2-3 hops from the start entity — are emitted with the start entity as subject, producing an incorrect/flattened graph. For depth≥2, triples from intermediate nodes lose their actual subject and get attributed to the root entity.

The tool's purpose is to return the neighborhood subgraph, but consumers will see a star graph rather than the actual topology.

Performance: Double symmetrization quadruples edge weights in Louvain input

📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CommunityComputation.java:161-162 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/Louvain.java:128-129
CommunityComputation.parseGraph (lines 161-162) already symmetrizes the adjacency list by adding weight in both directions for every SPARQL result row. Then Louvain.addAllEdges (lines 128-129) symmetrizes again by doing adj.get(src).merge(dst, w, Double::sum) AND adj.get(dst).merge(src, w, Double::sum). Since the input already has both A→B and B→A entries, the result is each edge getting 4× its original weight.

While this may not change community assignments (the ratio is preserved in the modularity gain formula), it inflates totalWeight and degree[] arrays by 4×, making the absolute modularity score meaningless for comparison with other runs or standard benchmarks. It also doubles memory usage for the adjacency structure.

Quality: 14 fully-qualified class names in validateGraph method

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfResource.java:203-217
The validateGraph method uses inline fully-qualified names (org.apache.jena.rdf.model.Model, org.apache.jena.riot.RDFDataMgr, java.io.ByteArrayOutputStream, etc.) instead of imports. This violates the project's coding standards ('No fully qualified names') and makes the method hard to read. There are 14 FQN usages in a single 70-line method.

Bug: SparqlFederationGuard race condition on lazy init

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/rdf/RdfResource.java:113-120
In RdfResource.java lines 113-120, getFederationGuard() has a lazy-init pattern without synchronization. The field is volatile, but the check-then-assign pattern is not atomic — two threads could both see null and both create a new instance. While the consequences here are benign (both would be functionally equivalent), the field is already initialized in initialize() (line 110), so the lazy fallback is only for tests. The inconsistency with the double-checked locking pattern used for getSemanticSearchEngine() is confusing.

...and 2 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Implements extensive knowledge-graph features including a SPARQL playground, MCP tools, and inference registry while deprecating legacy R2RML materialization. Resolves multiple injection and validation vulnerabilities, though the CustomOntologyValidator requires a fix for class forward-references and the SPARQL UPDATE logic lacks proper URI sanitization.

1. ⚠️ Bug: CustomOntologyValidator rejects valid forward-references between classes
   Files: openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:106-109, openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:151, openmetadata-service/src/main/java/org/openmetadata/service/rdf/extension/CustomOntologyValidator.java:157-158

   The `validateClass` method checks whether a parent class URI exists in `seenUris` (line 158), but `seenUris` is populated incrementally as classes are processed in list order (line 151). If class B declares class A as its `subClassOf` parent, but A appears *after* B in the `classes` list, the validator will incorrectly report "references unknown parent class" — even though both classes are declared in the same extension.
   
   This makes validation order-dependent and will reject valid ontology extensions where the author didn't happen to declare classes in topological order. The cycle-detection method (`detectClassHierarchyCycles`) correctly builds the full graph before traversing, so it's not affected — only the parent-existence check is broken.

   Fix (Pre-collect all class URIs before validating parent references, so declaration order doesn't matter. Change the duplicate-URI detection inside validateClass to use a separate seen-set or check before adding.):
   // First pass: collect all declared class URIs
   Set<String> classUris = new HashSet<>();
   for (CustomOntologyClass cls : classes) {
     if (cls != null && cls.getUri() != null && !cls.getUri().isBlank()) {
       classUris.add(cls.getUri());
     }
   }
   // Second pass: validate each class against the full set
   for (CustomOntologyClass cls : classes) {
     validateClass(cls, classUris, errors);
   }

2. 💡 Security: SPARQL UPDATE built via string concat without URI validation
   Files: openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CentralityComputation.java:146-157, openmetadata-service/src/main/java/org/openmetadata/service/rdf/insights/CommunityComputation.java:214-218

   In `CentralityComputation.persistScores` and `CommunityComputation.persistCommunities`, entity URIs obtained from SPARQL SELECT results are concatenated directly into SPARQL UPDATE strings using `<uri>` syntax without any IRI validation. While valid RDF IRIs cannot contain `>` (so Jena-stored URIs should be safe), this is a defense-in-depth gap: if a malformed URI ever enters the triplestore (e.g., via a custom ontology upload or a bug in another component), it could break out of the `<…>` brackets and inject arbitrary SPARQL UPDATE operations.
   
   The codebase already provides `RdfIriValidator.validateEntityIri()` for this purpose but it's not used in the insights package.

   Fix (Add a basic IRI sanity check (or use RdfIriValidator) before concatenating URIs into the UPDATE string. Apply similarly in CommunityComputation.persistCommunities.):
   // In CentralityComputation.persistScores, before appending:
   for (Map.Entry<String, Double> e : sorted) {
     String uri = e.getKey();
     if (uri == null || uri.contains(">") || uri.contains(" ") || uri.contains("
   ")) {
       continue; // skip malformed URIs
     }
     update.append("  <").append(uri).append("> ...");
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Comment on lines +259 to +279
return uri != null && uri.startsWith(EXTENSION_NS);
}

/**
* @return true if the URI references either a canonical om: class on the allowlist or a class
* declared inside the current extension.
*/
private static boolean isKnownClassReference(String uri, Set<String> extensionClassUris) {
if (uri == null) return false;
if (extensionClassUris.contains(uri)) return true;
if (uri.startsWith(CANONICAL_NS)) {
String localName = uri.substring(CANONICAL_NS.length());
return KNOWN_CANONICAL_CLASSES.contains(localName);
}
// Allow short-form prefixed names like "om:Table".
if (uri.startsWith("om:")) {
return KNOWN_CANONICAL_CLASSES.contains(uri.substring(3));
}
return false;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 hasCycle leaves onStack dirty, causing false-positive cycle errors

The onStack set is shared across all outer-loop iterations but is never cleared when a cycle is detected early. When hasCycle(A) finds a cycle (A → B → A) and returns true, it unwinds the stack without calling onStack.remove() for any node along the path. Nodes A and B remain in onStack. If a subsequent outer-loop call processes a node D that has an edge to B (D → B), hasCycle(B) returns true immediately because B is still in onStack — even though D is not part of any cycle. This falsely rejects valid ontology extensions with "class hierarchy contains a cycle through D".

The fix is to clear onStack before each outer-loop call (the visited set correctly prevents re-processing, so only onStack needs resetting).

Comment on lines +176 to +180
const handleSaveCurrent = useCallback(() => {
const name = window.prompt(t('label.sparql-save-prompt'));
if (!name || !name.trim()) {
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 window.prompt() is a native blocking dialog that is suppressed in most embedded or iframe contexts, returns null when dismissed without input, and is not accessible on mobile. Consider replacing with an inline rename input or a modal form consistent with the rest of the UI; it would also allow pre-validation of the name before saving.

Suggested change
const handleSaveCurrent = useCallback(() => {
const name = window.prompt(t('label.sparql-save-prompt'));
if (!name || !name.trim()) {
return;
}
const [saveDialogOpen, setSaveDialogOpen] = useState(false);
const handleSaveCurrent = useCallback(() => {
setSaveDialogOpen(true);
}, []);
// Replace the prompt with a controlled modal/input in the JSX and wire
// the confirmed name through a separate callback.
const commitSave = useCallback((name: string) => {
if (!name || !name.trim()) {
return;
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +68 to +89
throw new IllegalArgumentException(label + " contains illegal characters");
}
URI parsed;
try {
parsed = new URI(trimmed);
} catch (URISyntaxException e) {
throw new IllegalArgumentException(label + " is not a valid URI: " + e.getMessage());
}
if (!parsed.isAbsolute()) {
throw new IllegalArgumentException(label + " must be an absolute URI");
}
String scheme = parsed.getScheme();
if (!"http".equalsIgnoreCase(scheme) && !"https".equalsIgnoreCase(scheme)) {
throw new IllegalArgumentException(
label + " must use http or https scheme (got: " + scheme + ")");
}
return trimmed;
}

/** Clamp maxHops to [1, HARD_MAX_HOPS]; null or < 1 falls back to DEFAULT_MAX_HOPS. */
public static int clampMaxHops(Integer requested) {
if (requested == null || requested < 1) return DEFAULT_MAX_HOPS;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 validateNodeUri doesn't reuse the shared RdfIriValidator

RdfIriValidator.validateEntityIri was introduced in this PR specifically to provide a single, reusable IRI validator. validateNodeUri reimplements the same checks but misses a few characters that RdfIriValidator also blocks: ", ', `, and control characters below 0x20. For angle-bracket SPARQL IRI literals (<uri>), only > must be blocked to prevent injection, so the current behaviour is safe — but keeping two divergent validators risks them drifting apart over time. Consider delegating to RdfIriValidator.validateEntityIri and translating null to an IllegalArgumentException.

Comment on lines +68 to +70
if (errors.size() > 0) {
return errors;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Minor style: errors.size() > 0 is less idiomatic than !errors.isEmpty(). The latter is also marginally more efficient for certain List implementations.

Suggested change
if (errors.size() > 0) {
return errors;
}
if (!errors.isEmpty()) {
return errors;
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.