AI Data Foundation is an open-source reference implementation for building governed data access layers for MCP and AI agent applications.
It focuses on a common problem: AI agents should be able to retrieve business data, but must not bypass tenant isolation, object-level permissions, field masking, audit trails, or source-of-truth data pipelines.
Many AI / MCP / agent demos stop at “the model can query data”.
This project focuses on the harder part:
- how external data is ingested without skipping Raw / event / outbox boundaries
- how source-facing and canonical models stay separated
- how OAuth, API keys, service accounts, permission enforcement, masking, and audit are applied before business data is returned
- how retrieval can support agent use cases without turning into an unsafe direct-database shortcut
- Raw -> Source Change Event -> Outbox ingestion pipeline
- Source and canonical customer / order models
- OAuth 2.1 Authorization Code + PKCE demo authorization server
- API Key and service-account access
- Permission enforcement, masking, and audit logs
- Candidate-only retrieval followed by authorized canonical backfill
- Admin Console for governance workflows
Landing-style overview of the governed access model:
Example governance UI for MCP tools, rollout rules, masking, and audit controls:
This repository is guided by a few non-negotiable principles:
- Raw first: source data should land in Raw before downstream normalization or retrieval
- Event boundaries matter: ingestion and downstream publication should respect Change Envelope and outbox boundaries
- Source and canonical models are different layers: source models preserve source semantics, while canonical models represent unified business meaning
- Permissions are enforced server-side: MCP / API / agent callers must not define their own tenant, user, or permission scope
- Retrieval must stay safe: candidate recall is not authorization, and all final business data must flow through authorized canonical backfill
- Audit is part of the product: governed data access is incomplete if access, masking, and decision paths are not traceable
- Java 21
- Maven 3.9+
- Docker
docker compose up -d postgres redpandamvn spring-boot:run- Admin Console:
http://localhost:8080/admin - OAuth discovery:
http://localhost:8080/.well-known/openid-configuration - Customer API example:
http://localhost:8080/api/customers?limit=10
admin / admin123sales001 / sales123
The default runnable setup uses realistic mock data rather than a live JKYun production connection.
Current mock scale includes:
- 1,000 JKYun customers
- 3,600 JKYun orders
- 12,573 order lines
- 520 refund records
- external-crm and retail-pos overlap samples
Current Phase 0 / Phase 1 implementation includes:
- mock customer / trade ingestion through a connector-driven pipeline
- Raw object persistence plus
raw_records,source_change_events, andsource_event_outbox - customer and order normalizers
source_customers,source_trades,canonical_customers,canonical_orders, andidentity_mapcanonical_change_events- customer and order query APIs with Phase 1 permission enforcement
- local Search / Vector-like retrieval for governed candidate recall
- customer knowledge candidate-only serving API
- OAuth 2.1 demo authorization server with PKCE, JWKS, introspection, and revoke
- Admin Console with 6 governance workspaces, ECharts topology/trend views, replay / backfill, status mapping, MDM, reconcile, permission simulation, MCP sessions, and audit foundations
Current status: active early-stage reference implementation.
What is stable enough to explore today:
- governed ingestion and normalization with realistic mock business data
- customer / order serving paths with permission enforcement and audit
- OAuth PKCE demo authorization flow for MCP / agent-facing access
- Admin Console workflows for governance, replay, MDM, and operational inspection
What is still intentionally incomplete:
- live JKYun production data integration
- full production deployment posture
- full standalone Permission Service rollout
- production search / vector / analytics infrastructure
This means the repository is suitable for:
- architecture review
- OSS evaluation
- local demos
- governed MCP / agent integration experiments
It should not yet be presented as:
- a finished production data platform
- a live production JKYun connector
- a complete enterprise identity and authorization product
This repository is a good fit for:
- engineers building MCP / AI agent integrations that need governed business-data access
- teams evaluating how to combine ingestion, canonical modeling, OAuth, permission checks, masking, and audit in one reference stack
- architects who want a concrete example of Raw -> event -> canonical -> authorized retrieval boundaries
- product or platform teams exploring safe retrieval patterns before connecting real production systems
This repository is not a good fit if you need:
- a drop-in production JKYun connector with real tenant credentials already integrated
- a finished enterprise permission platform with full production policy lifecycle and organization sync
- a production-ready OpenSearch / Qdrant / ClickHouse stack out of the box
- a minimal toy MCP demo that ignores audit, masking, tenant isolation, and data-governance boundaries
High-level repository structure:
src/main/java/— application code for ingestion, normalization, serving, auth, permission, audit, search, MDM, replay, and admin flowssrc/main/resources/— application config, Flyway migrations, and Admin Console static assetssrc/test/java/— unit and integration testsdocs/— architecture, execution contract, implementation status, MCP auth design, and product-completion planningmock-data/— generated mock customer / order / refund / overlap data used for runnable demosscripts/— helper scripts such as OAuth PKCE local flow checksdocker-compose.yml— local infrastructure bootstrapdocs/05-local-runbook.md— detailed local runbook and command reference
Recommended reading order:
- README.md — what the project is, why it exists, and how to run it quickly
- AGENTS.md — repository-wide execution constraints and architecture guardrails
- docs/00-llm-wiki-index.md — knowledge map for the rest of the repo
- docs/01-architecture-execution-contract.md — implementation contract derived from the architecture
- docs/02-implementation-status.md — what is already implemented vs. still out of scope
Additional key documents:
- docs/ai-data-foundation-architecture.md
- docs/mcp-ai-application-auth-design.md
- docs/03-commercial-product-completion-plan.md
- docs/05-local-runbook.md
- docs/05-oss-reviewer-checklist.md
- CONTRIBUTING.md
- SECURITY.md
If you are reviewing this repository as an OSS evaluator, the fastest path is:
- read the top sections of this README for scope, status, and non-goals
- check docs/02-implementation-status.md to see what is implemented versus still intentionally incomplete
- inspect CHANGELOG.md for the current tagged capability snapshot
- run
docker compose up -d postgres redpandaandmvn spring-boot:run - verify the main endpoints:
http://localhost:8080/adminhttp://localhost:8080/.well-known/openid-configurationhttp://localhost:8080/api/customers?limit=10
- run
mvn test
Things to keep in mind while reviewing:
- this repository is intentionally honest about current boundaries
- mock data is part of the runnable demo flow
- real JKYun production integration is not yet complete
- the project is designed as a governed reference implementation, not as a minimal MCP toy example
Near-term priorities:
- connect real JKYun business data through the existing governed ingestion boundaries
- harden OAuth / SSO integration beyond the current demo authorization-server slice
- continue pushing MCP-safe permission, masking, audit, and retrieval patterns
- validate deployment, operations, and observability for more production-like environments
Medium-term priorities:
- evolve local retrieval into production-grade search / vector infrastructure
- expand business-domain coverage beyond the current customer / order-centered slice
- separate and harden Permission Service responsibilities where needed
- improve production readiness for backup, alerting, and multi-environment rollout
This repository does not currently claim:
- live JKYun production API integration
- production-grade multi-node deployment
- full enterprise permission-service rollout
- real OpenSearch / Qdrant / external LLM RAG production integration
- final business-domain coverage for inventory, suppliers, refunds, and export execution
Before opening a PR or making architecture-sensitive changes, read:
Additional repository governance files:
If you want to contribute code or docs, please also check the issue templates and PR template under .github/.
The current implementation focuses on a minimum runnable end-to-end slice:
mock JKYun customer / order data
-> raw object temporary storage
-> PostgreSQL transaction writes raw_records + source_change_events + source_event_outbox
-> raw object linked after transaction commit
-> source_event_publisher sends Change Envelope to Redpanda
-> customer / trade normalizer
-> source_customers/source_trades + identity_map + canonical_customers/canonical_orders
-> canonical_change_events
-> customer query API with basic tenant / role checks
For the full command reference and the longer local walkthrough, see:
The shortest practical local validation loop is:
- start infrastructure
docker compose up -d postgres redpanda- start the application
mvn spring-boot:run- verify the main endpoints
http://localhost:8080/adminhttp://localhost:8080/.well-known/openid-configurationhttp://localhost:8080/api/customers?limit=10
For Admin Console browser checks and responsive screenshots after the application is running:
npx playwright install chromium
node scripts/admin-console-verify.mjs- run tests
mvn test- if you want the full runnable data flow, use the detailed runbook to:
- ingest mock customers and trades
- normalize customer and order records
- build local Search / Vector-like indexes
- exercise Admin Console, MDM, replay, and backfill flows
For the most accurate statement of implemented boundaries, rely on:

