Note: This project is currently under active development and has not been officially released. APIs, features, and documentation are subject to change without notice.
AI-powered sidecar extension for DataHub, built API-first.
DataSpoke is a loosely coupled sidecar to DataHub. DataHub stores metadata (the Hub); DataSpoke extends it with five baseline features (the Spokes): Ingestion Control, Validation, Ontology Generation, Metadata Generation, and Governance. Both UI and API are organised by feature — one function namespace each under /spoke/.
This repository delivers two artifacts:
- Baseline Product — A foundational data catalog implementation of the five MANIFESTO features. The API contract in
spec/API.mdis the canonical surface; the frontend is a thin reference UI that consumes those routes verbatim. - Productized Scaffold — An AI Scaffold (Claude Code conventions, generator/evaluator subagents, PRauto) plus a Development Scaffold (scripted Kubernetes dev environment) that together let teams fork this repo and build custom Spokes with AI coding agents.
Fork or copy this repository to create a data catalog for your organization.
- kubectl + Helm v3 installed and configured
- A Kubernetes cluster with appropriate capacity
- A separate DataHub instance — DataSpoke connects to DataHub as an external dependency
DataSpoke ships as an umbrella Helm chart at helm-charts/dataspoke/. The production profile (values.yaml) enables the application components (frontend, API) and infrastructure (PostgreSQL with pgvector + Apache AGE, Redis, Airflow). The optional event-consumer subchart is shipped disabled — baseline UC1–UC5 are schedule-driven via Airflow rather than event-driven.
- Build and push images:
docker build -t <registry>/dataspoke/api:latest -f docker-images/api/Dockerfile .(Frontend image TBD; event-consumer is disabled by default) - Configure: Copy
helm-charts/dataspoke/values.yamland customize — container images, ingress hosts/TLS, DataHub connection (config.datahub.gmsUrl), and secrets (PostgreSQL, Redis, JWT, LLM API key). For production secrets management, consider External Secrets Operator. - Install:
helm dependency build ./helm-charts/dataspoke helm upgrade --install dataspoke ./helm-charts/dataspoke \ --namespace dataspoke --create-namespace \ --values ./your-values.yaml
Resource sizing: Production defaults total ~5 CPU / ~10 CPU and ~9.5 Gi / ~22 Gi (requests / limits), excluding the opt-in event-consumer. See spec/feature/HELM_CHART.md for the full chart reference.
- kubectl + Helm v3 installed and configured
- A Kubernetes cluster (GKE Autopilot recommended; Docker Desktop, minikube, or kind also work) with 8+ CPUs / 24 GB RAM / 150 GB storage
- Python 3.13 and
uv - Node.js 18+ (TBD — frontend not yet implemented)
The dev profile installs infrastructure (DataHub, PostgreSQL with pgvector + Apache AGE, Redis, Airflow, self-hosted Langfuse for LLM observability, example data sources) into a Kubernetes cluster via the umbrella Helm chart plus dev peripherals. The API runs in-cluster alongside Airflow (for workflow callbacks); frontend runs on the host.
cp helm-charts/.env.example helm-charts/.env # Set your Kubernetes context
./helm-charts/bin/install.sh --profile dev # ~5-10 min first runUsing Claude Code? Run
/k8s-deploy installfor guided setup.
After install, verify all services are reachable:
./helm-charts/bin/health-check.sh # Verify all services respond via nginx-ingressServices are accessed via nginx-ingress endpoints — HTTP services use virtual-host routing (http://<service>.<INGRESS_IP>.nip.io/) and TCP services use dedicated ports on the ingress IP. See helm-charts/README.md for the full endpoint table, credentials, lock service, namespace architecture, resource budgets, and troubleshooting.
./helm-charts/bin/uninstall.sh --profile devuv sync # Install dependencies
./helm-charts/bin/install.sh --profile dev --components api # Rebuild + redeploy the API
kubectl scale deployment/dataspoke-api --replicas=0 \
-n "${DATASPOKE_KUBE_DATASPOKE_NAMESPACE}" # Scale down in-cluster APIThe API is accessible via nginx-ingress at http://api.<INGRESS_IP>.nip.io/api/v1/. See spec/TESTING.md for testing modes.
| Component | Status | Location |
|---|---|---|
| API layer (FastAPI) | Done | src/api/ |
| Backend services | Done | src/backend/, src/shared/ |
| Airflow DAGs | Done | src/workflows/dags/ |
| Database migrations | Done | migrations/ |
| Docker image (API) | Done | docker-images/api/ |
| Helm charts | Done | helm-charts/dataspoke/ |
| Tests (unit + integration) | Done | tests/ |
| Frontend (Next.js) | TBD | src/frontend/ |
uv run pytest tests/unit/ # Unit tests (no infra needed)
uv run pytest tests/integration/ # Integration tests (requires dev environment with ingress)
uv run python -m tests.integration.util --reset-seed # Seed dummy data (Imazon use-case)See spec/TESTING.md for conventions, three-group execution sequence, and the integration test lock protocol.
Use the plan -> approve -> generate -> evaluate workflow:
- Read the relevant spec in
spec/feature/ - Plan (built-in Plan mode) -> human reviews and approves
backend->reviewer-> [fix pass if needed]workflow->reviewer-> [fix pass if needed]test-- write and run testsfrontend->reviewer-> [fix pass if needed]k8s-helm-- containerize and deploy
See spec/AI_SCAFFOLD.md for the full scaffold reference.
Fork this repository and adapt:
- Revise
spec/MANIFESTO_*.md-- redefine features and product identity - Run
/spec-write-- update architecture and author feature specs - Run
/k8s-deploy install-- bring up the local environment - Use the implementation workflow above
| Document | Purpose |
|---|---|
| spec/MANIFESTO_en.md | Golden — product identity, five baseline features |
| spec/API.md | Golden — route catalogue, auth, middleware, error catalogue |
| spec/USE_CASE_en.md | Golden — five UC scenarios on the Imazon test estate |
| spec/ARCHITECTURE.md | System architecture, tech stack, deployment |
| spec/DATAHUB_INTEGRATION.md | DataHub SDK/API patterns |
| spec/API_DESIGN_PRINCIPLE_en.md | REST API conventions |
| spec/AI_SCAFFOLD.md | Claude Code scaffold: skills, subagents, hooks |
| spec/AI_PRAUTO.md | PRauto autonomous PR worker: lifecycle labels, heartbeat, phase state machine |
| spec/TESTING.md | Testing conventions and integration test protocol |
| spec/feature/ | Feature specs (BACKEND, BACKEND_LLM, BACKEND_SCHEMA, VALIDATION, SECRET_RESOLUTION, FRONTEND_*, HELM_CHART) |