Skip to content

feat: GSoC 2026 scaffold - connector package, design docs, benchmark, and config layer#2

Open
Vishmayraj wants to merge 27 commits into
istSOS:mainfrom
Vishmayraj:feature/gsoc-2026
Open

feat: GSoC 2026 scaffold - connector package, design docs, benchmark, and config layer#2
Vishmayraj wants to merge 27 commits into
istSOS:mainfrom
Vishmayraj:feature/gsoc-2026

Conversation

@Vishmayraj

@Vishmayraj Vishmayraj commented Jun 13, 2026

Copy link
Copy Markdown

This PR establishes the complete project scaffold for the istSOS Metadata Connector (GSoC 2026, Idea 1). It clears the 2024 GSoC working files that were in the upstream branch and replaces them with the week 1-3 deliverables for this project: the connector package skeleton, three design reference documents, the configuration layer, a benchmark script with verified results, and a temporary API to validate STAC output against a live browser.

This is a scaffold PR. The connector module files (harvester.py, cache.py, stac_transformer.py, dcat_transformer.py) are present as stubs with docstrings and public interface signatures. Implementations follow in subsequent PRs per the staging plan.


What this PR includes

Cleared from upstream

  • Removed the 2024 GSoC STAC project working files that were present in the base branch. This project builds on the 2024 work conceptually but is a clean implementation with a different architecture and dual STAC + DCAT-AP output.

Connector package scaffold (connector/)

  • config.py - pydantic-settings configuration layer, fully implemented. Single source of truth for all connector settings. get_settings() singleton via lru_cache. Field-level validation including trailing slash stripping on STA_BASE_URL and has_mandatory_dcat_fields property for startup warnings.
  • harvester.py - stub with HarvestedCatalog, HarvestedThing dataclasses and public interface signatures. Full design spec in docs/Harvesting-Layer-Reference.md.
  • stac_transformer.py - stub with public interface signatures.
  • dcat_transformer.py - stub with public interface signatures.
  • cache.py - stub with public interface signatures.
  • exceptions.py - HarvesterError hierarchy fully defined.
  • utils.py - stub.
  • __init__.py - package init.

Design reference documentation (docs/)

  • Harvesting-Layer-Reference.md - complete design specification for harvester.py and cache.py. Covers the STA query strategy, internal data model with all nested dict shapes, the 12-point transformer contract that downstream transformers can rely on, and the public interface. Written before implementation to lock down the contract first.
  • STA-STAC-Transformation-Layer-Reference.md - field-by-field STA to STAC 1.0 mapping specification. Covers Thing to Collection, Datastream to Item, spatial fallback chain, temporal fallback chain, bbox derivation, collection extent computation, asset construction, and all link relations.
  • STA-DCAT-AP-Transformation-Layer-Reference.md - field-by-field STA to DCAT-AP 3.0 mapping specification. Covers Datastream to dcat:Dataset, Thing to dcat:DatasetSeries, catalog-level mandatory field gap strategy, Distribution construction for JSON/CSV/MQTT access points, JSON-LD and Turtle serialization, and the Datastream.properties key convention for operator-supplied DCAT fields.

Benchmark script and results (benchmark_stac.py)

  • Standalone benchmark that fetches from a live STA instance, runs the STAC transformation, and times fetch and transformation separately.
  • Run against two deployments:
    • Local dev instance (5 Things, 20 Datastreams): fetch 63ms, transformation 7.34ms averaged over 100 iterations.
    • Fraunhofer FROST production instance (5,610 Things, 22,941 Datastreams): fetch 42.4 seconds across 57 sequential paginated requests, transformation 1,203ms.
  • These results motivate the architecture proposal (direct Postgres + Redis) being discussed with mentors this week. The benchmark script remains in the repo as a reference for the decision rationale.

Temporary validation API (api.py)

  • Minimal FastAPI app that serves the benchmark output as a live STAC endpoint, used to verify that the transformed output is valid STAC by pointing the Radiant Earth STAC browser at it. Verified working. This file is not part of the connector module and will be removed once the integrated connector endpoints are live.

Project files

  • README.md - updated to reflect the current project structure and architecture.
  • requirements.txt and requirements-test.txt - dependency manifests.
  • .env.example - annotated environment variable template.
  • .gitignore - excludes .env, benchmark outputs, and cache files.

Architecture note

The connector was originally scoped as a standalone microservice harvesting via STA HTTP pagination. Benchmark results showed that at production scale (22,941 Datastreams) this approach requires 57 sequential HTTP round trips taking 42 seconds. A revised architecture using direct asyncpg queries against the istSOS4 Postgres database and Redis with LISTEN/NOTIFY cache invalidation is being proposed to mentors this week. Pending their confirmation, week 4 implementation will follow the integrated approach. The design docs in this PR already reflect the proposed integrated design in the harvesting layer reference.


Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant