Skip to content

[Issue]: Define a GA4GH LinkML transformation model for cross-schema mappings #84

@zykonda

Description

@zykonda

Issue Title

Define a GA4GH LinkML transformation model for cross-schema mappings

Issue Type

Schema Alignment

Problem Statement

GA4GH currently has active work on schema alignment, identifier governance, data model best practices, and interoperability across products, but there does not yet appear to be a harmonized way to represent transformation rules between non-GA4GH schemas and GA4GH standards as machine-readable, governed artifacts.

This gap becomes operationally significant when an external schema must be mapped into GA4GH VRS. In a PXF protobuf to VRS 1.3 workflow, implementers must make explicit decisions about:

  • which PXF elements correspond to which VRS elements
  • whether a mapping is a direct replacement, transformation, normalization, computation, copy, or no-mapping case
  • how coordinate systems are converted
  • what validation rules apply
  • how cardinality mismatches are handled
  • how unmapped concepts are represented
  • what acceptance and error states should be recorded
  • how mapping behavior is versioned as either schema evolves

Without a shared artifact model, those decisions are typically embedded in custom code, local tables, or prose documents. That leads to several technical risks:

  • semantic drift across adopters mapping the same source schema into VRS
  • inconsistent identifier handling and provenance
  • inconsistent normalization logic, especially for coordinates and intervals
  • lack of a machine-readable contract for validation and testing
  • inability to compare, diff, review, or reuse transformation rules across teams
  • no common way to publish transformation metadata for downstream tooling or governance

The desired state is for GA4GH to define a reusable, machine-readable transformation model that captures source-to-target mapping semantics in a standard form. Success would mean that transformations from external schemas into VRS can be treated as explicit interoperability artifacts rather than opaque implementation details.

Scope Validation

✅ Harmonization Impact:
This issue directly supports harmonization by proposing a common way to represent the semantic mapping layer between external schemas and GA4GH standards. It would make transformation behavior inspectable and comparable across adopters.

✅ Barrier Reduction:
It reduces barriers caused by duplicated mapping design, unclear transformation semantics, and poor discoverability of prior work. It also supports repeatable validation and conformance testing.

✅ Alignment Challenges:
This issue addresses concrete alignment challenges involving:

  • identifiers and canonical references to schema elements
  • field-level correspondence between models
  • coordinate normalization
  • type and cardinality mismatch handling
  • versioning of schema-to-schema mappings
  • publication and governance of transformation artifacts

✅ Cross-Work Stream:
Yes. The pilot target is VRS within GKS, but the transformation model is broadly relevant to schema alignment across GA4GH. It intersects with DaMaSC best practices, schema registry work, interoperability touchpoints, and implementation guidance.

Proposed Solution(s)

Proposed Solution(s)

I recommend that TASC evaluate and potentially define a minimal GA4GH transformation model using LinkML, with PXF to VRS as the first pilot use case.
The purpose of using LinkML here is therefore to standardize the structure and semantics of transformation artifacts. LinkML would define what a valid mapping record looks like. A separate compiler or runtime would still execute the transformation logic.

Recommended Model

Minimally define the following classes:

  • TransformationRecord
  • SourceElement
  • TargetElement
  • MappingAction
  • CompilerMetadata

A transformation record should include:

Source metadata:

  • source element name
  • canonical source element URI
  • source schema identifier and version
  • source type
  • source path
  • cardinality

Target metadata:

  • target standard identifier
  • target version
  • target class or object type
  • canonical target element URI or schema reference
  • structural path such as JSON Pointer

Transformation metadata:

  • action type such as REPLACE, TRANSFORM, NORMALIZE, COMPUTE, COPY, CONCAT, or NONE
  • human-readable description
  • machine-readable expression or rule reference
  • validation rules
  • normalization guidance
  • fallback behavior

Quality and governance metadata:

  • acceptance status such as OK, WARN, FAIL
  • error code taxonomy
  • provenance
  • compiler version
  • timestamp
  • notes

A deterministic record_key may also be useful as an identifier for indexing, comparison, and governance, but should be defined consistently.
Attached please find three YAML files with three complete LinkML instance examples.

Pilot Use Case

The PXF to VRS proposal can serve as an initial pilot. It already contains representative examples covering:

  • direct field replacement
  • normalized coordinate transformation
  • no-mapping and error cases
  • acceptance codes and error taxonomy
  • versioned transformation metadata

Representative VRS targets in the pilot include:

  • Allele
  • SequenceLocation
  • Haplotype
  • TextVariation
  • VariationSet
  • CopyNumberChange
  • SequenceReference

Allele example.yaml
No-mapping example.yaml
SequenceLocation example.yaml

Estimated Effort Level

Medium (3-6 months, moderate resources)

Success Criteria

Measurable Outcomes:

  • TASC agrees on whether LinkML-based transformation artifacts are in scope
  • a requirements memo is produced
  • a minimum LinkML transformation schema is defined
  • at least one pilot PXF to VRS mapping set is published as valid LinkML instances
  • the pilot demonstrates direct mapping, normalized transformation, and no-mapping cases
  • guidance is produced on publication, versioning, and governance

Key Metrics:

  • multiple reviewers can independently understand and evaluate a mapping without reading implementation code
  • at least one mapping artifact can be version-diffed and reused by another implementer
  • transformation records support validation of representative cases involving identifiers, intervals, and cardinality
  • the pilot model is judged reusable beyond a single PXF use case

Timeline:

  • short term: scope decision and requirements memo
  • medium term: LinkML schema and pilot mappings
  • later: governance recommendation and registry integration path

How will this issue aid GA4GH harmonization?

This issue would create a shared, machine-readable representation for the semantic layer between schemas. That matters because interoperability failures often occur not only at the level of object models, but at the level of translation between them.

A LinkML transformation model would help GA4GH harmonization by:

  • making mapping decisions explicit
  • enabling review of alignment assumptions
  • reducing divergent local implementations
  • supporting reuse of mapping artifacts
  • creating a foundation for testing, validation, and conformance around transformations
  • aligning transformation artifacts with broader data-model and schema-governance practices across GA4GH

In that sense, it complements existing TASC work on schema registry, data model best practices, identifier alignment, and interoperability touchpoints rather than duplicating it.

Additional context

Relevant TASC issues include:

  • interoperability points across products, which explicitly mentions harmonized common data types and libraries of transformations
  • schema registry, which highlights the need for discoverable, versioned, governed artifacts across the ecosystem
  • data model best practices, which emphasizes machine-readable and human-readable artifacts for approved models
  • namespace policy and identifier harmonization, which show that shared semantics and reusable identifiers are already treated as GA4GH-wide concerns

Work Streams Raising This Issue

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups Raising This Issue

No response

Work Streams That Will Be Impacted

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups That Will Be Impacted

No response

Key Stakeholders to Consult

Organizations/Communities:

  • TASC leadership
  • GKS and VRS contributors
  • DaMaSC contributors
  • driver projects with cross-standard interoperability needs
  • external communities responsible for adjacent variation schemas

Technical Experts:

  • VRS editors and implementers
  • contributors to interoperability, schema registry, and identifier-governance work
  • schema registry and metadata-infrastructure experts
  • implementers with production mapping and normalization experience

Decision Makers:

  • TASC co-leads
  • PSC stakeholders as needed
  • maintainers of any future registry or TASC-managed artifact repository

Products affected

  • VRS
  • transformation tooling targeting VRS
  • future schema registry or metadata discovery infrastructure
  • conformance and validation tooling that depends on explicit mapping semantics

Additional Context

The initial pilot proposal already includes examples of direct replacement, coordinate normalization, and no-equivalent error handling. These can be represented as LinkML instances conforming to a shared transformation schema.

LinkML: https://github.com/linkml

Priority Level

Medium (should be addressed within 3-6 months)

Additional Tags

  • Documentation
  • API
  • Schema
  • Security
  • Performance
  • Interoperability
  • Compliance
  • User Experience
  • Infrastructure
  • Testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions