DataSpoke

Note: This project is currently under active development and has not been officially released. APIs, features, and documentation are subject to change without notice.

AI-powered sidecar extension for DataHub, built API-first.

DataSpoke is a loosely coupled sidecar to DataHub. DataHub stores metadata (the Hub); DataSpoke extends it with five baseline features (the Spokes): Ingestion Control, Validation, Ontology Generation, Metadata Generation, and Governance. Both UI and API are organised by feature — one function namespace each under /spoke/.

This repository delivers two artifacts:

Baseline Product — A foundational data catalog implementation of the five MANIFESTO features. The API contract in spec/API.md is the canonical surface; the frontend is a thin reference UI that consumes those routes verbatim.
Productized Scaffold — An AI Scaffold (Claude Code conventions, generator/evaluator subagents, PRauto) plus a Development Scaffold (scripted Kubernetes dev environment) that together let teams fork this repo and build custom Spokes with AI coding agents.

Fork or copy this repository to create a data catalog for your organization.

Usage Guide

Prerequisites

kubectl + Helm v3 installed and configured
A Kubernetes cluster with appropriate capacity
A separate DataHub instance — DataSpoke connects to DataHub as an external dependency

Deploy to Production

DataSpoke ships as an umbrella Helm chart at helm-charts/dataspoke/. The production profile (values.yaml) enables the application components (frontend, API) and infrastructure (PostgreSQL with pgvector + Apache AGE, Redis, Airflow). The optional event-consumer subchart is shipped disabled — baseline UC1–UC5 are schedule-driven via Airflow rather than event-driven.

Build and push images: docker build -t <registry>/dataspoke/api:latest -f docker-images/api/Dockerfile . (Frontend image TBD; event-consumer is disabled by default)
Configure: Copy helm-charts/dataspoke/values.yaml and customize — container images, ingress hosts/TLS, DataHub connection (config.datahub.gmsUrl), and secrets (PostgreSQL, Redis, JWT, LLM API key). For production secrets management, consider External Secrets Operator.

Install:

helm dependency build ./helm-charts/dataspoke
helm upgrade --install dataspoke ./helm-charts/dataspoke \
  --namespace dataspoke --create-namespace \
  --values ./your-values.yaml

Resource sizing: Production defaults total ~5 CPU / ~10 CPU and ~9.5 Gi / ~22 Gi (requests / limits), excluding the opt-in event-consumer. See spec/feature/HELM_CHART.md for the full chart reference.

Development Guide

Prerequisites

kubectl + Helm v3 installed and configured
A Kubernetes cluster (GKE Autopilot recommended; Docker Desktop, minikube, or kind also work) with 8+ CPUs / 24 GB RAM / 150 GB storage
Python 3.13 and uv
Node.js 18+ (TBD — frontend not yet implemented)

Dev Environment Setup

The dev profile installs infrastructure (DataHub, PostgreSQL with pgvector + Apache AGE, Redis, Airflow, self-hosted Langfuse for LLM observability, example data sources) into a Kubernetes cluster via the umbrella Helm chart plus dev peripherals. The API runs in-cluster alongside Airflow (for workflow callbacks); frontend runs on the host.

cp helm-charts/.env.example helm-charts/.env       # Set your Kubernetes context
./helm-charts/bin/install.sh --profile dev          # ~5-10 min first run

Using Claude Code? Run /k8s-deploy install for guided setup.

After install, verify all services are reachable:

./helm-charts/bin/health-check.sh                   # Verify all services respond via nginx-ingress

Services are accessed via nginx-ingress endpoints — HTTP services use virtual-host routing (http://<service>.<INGRESS_IP>.nip.io/) and TCP services use dedicated ports on the ingress IP. See helm-charts/README.md for the full endpoint table, credentials, lock service, namespace architecture, resource budgets, and troubleshooting.

Uninstall

./helm-charts/bin/uninstall.sh --profile dev

Running DataSpoke

uv sync                                                                # Install dependencies
./helm-charts/bin/install.sh --profile dev --components api            # Rebuild + redeploy the API
kubectl scale deployment/dataspoke-api --replicas=0 \
  -n "${DATASPOKE_KUBE_DATASPOKE_NAMESPACE}"                           # Scale down in-cluster API

The API is accessible via nginx-ingress at http://api.<INGRESS_IP>.nip.io/api/v1/. See spec/TESTING.md for testing modes.

Implementation Status

Component	Status	Location
API layer (FastAPI)	Done	`src/api/`
Backend services	Done	`src/backend/`, `src/shared/`
Airflow DAGs	Done	`src/workflows/dags/`
Database migrations	Done	`migrations/`
Docker image (API)	Done	`docker-images/api/`
Helm charts	Done	`helm-charts/dataspoke/`
Tests (unit + integration)	Done	`tests/`
Frontend (Next.js)	TBD	`src/frontend/`

Testing

uv run pytest tests/unit/                      # Unit tests (no infra needed)
uv run pytest tests/integration/               # Integration tests (requires dev environment with ingress)
uv run python -m tests.integration.util --reset-seed  # Seed dummy data (Imazon use-case)

See spec/TESTING.md for conventions, three-group execution sequence, and the integration test lock protocol.

Implementation Workflow

Use the plan -> approve -> generate -> evaluate workflow:

Read the relevant spec in spec/feature/
Plan (built-in Plan mode) -> human reviews and approves
backend -> reviewer -> [fix pass if needed]
workflow -> reviewer -> [fix pass if needed]
test -- write and run tests
frontend -> reviewer -> [fix pass if needed]
k8s-helm -- containerize and deploy

See spec/AI_SCAFFOLD.md for the full scaffold reference.

Building a Custom Spoke

Fork this repository and adapt:

Revise spec/MANIFESTO_*.md -- redefine features and product identity
Run /spec-write -- update architecture and author feature specs
Run /k8s-deploy install -- bring up the local environment
Use the implementation workflow above

Key Specs

Document	Purpose
spec/MANIFESTO_en.md	Golden — product identity, five baseline features
spec/API.md	Golden — route catalogue, auth, middleware, error catalogue
spec/USE_CASE_en.md	Golden — five UC scenarios on the Imazon test estate
spec/ARCHITECTURE.md	System architecture, tech stack, deployment
spec/DATAHUB_INTEGRATION.md	DataHub SDK/API patterns
spec/API_DESIGN_PRINCIPLE_en.md	REST API conventions
spec/AI_SCAFFOLD.md	Claude Code scaffold: skills, subagents, hooks
spec/AI_PRAUTO.md	PRauto autonomous PR worker: lifecycle labels, heartbeat, phase state machine
spec/TESTING.md	Testing conventions and integration test protocol
spec/feature/	Feature specs (BACKEND, BACKEND_LLM, BACKEND_SCHEMA, VALIDATION, SECRET_RESOLUTION, FRONTEND_*, HELM_CHART)

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 581 Commits
.claude		.claude
.github		.github
.prauto		.prauto
assets		assets
docker-images		docker-images
helm-charts		helm-charts
migrations		migrations
ref		ref
spec		spec
src		src
tests		tests
.gcloudignore		.gcloudignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSpoke

Usage Guide

Prerequisites

Deploy to Production

Development Guide

Prerequisites

Dev Environment Setup

Uninstall

Running DataSpoke

Implementation Status

Testing

Implementation Workflow

Building a Custom Spoke

Key Specs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSpoke

Usage Guide

Prerequisites

Deploy to Production

Development Guide

Prerequisites

Dev Environment Setup

Uninstall

Running DataSpoke

Implementation Status

Testing

Implementation Workflow

Building a Custom Spoke

Key Specs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages