Skip to content

HuangHaohang/ai-data-foundation

AI Data Foundation

CI License Release Java 21 Spring Boot

AI Data Foundation is an open-source reference implementation for building governed data access layers for MCP and AI agent applications.

It focuses on a common problem: AI agents should be able to retrieve business data, but must not bypass tenant isolation, object-level permissions, field masking, audit trails, or source-of-truth data pipelines.

Architecture overview

Why it matters

Many AI / MCP / agent demos stop at “the model can query data”.

This project focuses on the harder part:

  • how external data is ingested without skipping Raw / event / outbox boundaries
  • how source-facing and canonical models stay separated
  • how OAuth, API keys, service accounts, permission enforcement, masking, and audit are applied before business data is returned
  • how retrieval can support agent use cases without turning into an unsafe direct-database shortcut

What it provides

  • Raw -> Source Change Event -> Outbox ingestion pipeline
  • Source and canonical customer / order models
  • OAuth 2.1 Authorization Code + PKCE demo authorization server
  • API Key and service-account access
  • Permission enforcement, masking, and audit logs
  • Candidate-only retrieval followed by authorized canonical backfill
  • Admin Console for governance workflows

Product preview

Landing-style overview of the governed access model:

AI Data Foundation landing page

Example governance UI for MCP tools, rollout rules, masking, and audit controls:

MCP tool governance console

Project principles

This repository is guided by a few non-negotiable principles:

  • Raw first: source data should land in Raw before downstream normalization or retrieval
  • Event boundaries matter: ingestion and downstream publication should respect Change Envelope and outbox boundaries
  • Source and canonical models are different layers: source models preserve source semantics, while canonical models represent unified business meaning
  • Permissions are enforced server-side: MCP / API / agent callers must not define their own tenant, user, or permission scope
  • Retrieval must stay safe: candidate recall is not authorization, and all final business data must flow through authorized canonical backfill
  • Audit is part of the product: governed data access is incomplete if access, masking, and decision paths are not traceable

5-minute quick start

Requirements

  • Java 21
  • Maven 3.9+
  • Docker

Start infrastructure

docker compose up -d postgres redpanda

Start the application

mvn spring-boot:run

Open the main endpoints

  • Admin Console: http://localhost:8080/admin
  • OAuth discovery: http://localhost:8080/.well-known/openid-configuration
  • Customer API example: http://localhost:8080/api/customers?limit=10

Demo login

  • admin / admin123
  • sales001 / sales123

Current runnable data shape

The default runnable setup uses realistic mock data rather than a live JKYun production connection.

Current mock scale includes:

  • 1,000 JKYun customers
  • 3,600 JKYun orders
  • 12,573 order lines
  • 520 refund records
  • external-crm and retail-pos overlap samples

Current capabilities

Current Phase 0 / Phase 1 implementation includes:

  • mock customer / trade ingestion through a connector-driven pipeline
  • Raw object persistence plus raw_records, source_change_events, and source_event_outbox
  • customer and order normalizers
  • source_customers, source_trades, canonical_customers, canonical_orders, and identity_map
  • canonical_change_events
  • customer and order query APIs with Phase 1 permission enforcement
  • local Search / Vector-like retrieval for governed candidate recall
  • customer knowledge candidate-only serving API
  • OAuth 2.1 demo authorization server with PKCE, JWKS, introspection, and revoke
  • Admin Console with 6 governance workspaces, ECharts topology/trend views, replay / backfill, status mapping, MDM, reconcile, permission simulation, MCP sessions, and audit foundations

Project status

Current status: active early-stage reference implementation.

What is stable enough to explore today:

  • governed ingestion and normalization with realistic mock business data
  • customer / order serving paths with permission enforcement and audit
  • OAuth PKCE demo authorization flow for MCP / agent-facing access
  • Admin Console workflows for governance, replay, MDM, and operational inspection

What is still intentionally incomplete:

  • live JKYun production data integration
  • full production deployment posture
  • full standalone Permission Service rollout
  • production search / vector / analytics infrastructure

This means the repository is suitable for:

  • architecture review
  • OSS evaluation
  • local demos
  • governed MCP / agent integration experiments

It should not yet be presented as:

  • a finished production data platform
  • a live production JKYun connector
  • a complete enterprise identity and authorization product

Who this is for

This repository is a good fit for:

  • engineers building MCP / AI agent integrations that need governed business-data access
  • teams evaluating how to combine ingestion, canonical modeling, OAuth, permission checks, masking, and audit in one reference stack
  • architects who want a concrete example of Raw -> event -> canonical -> authorized retrieval boundaries
  • product or platform teams exploring safe retrieval patterns before connecting real production systems

Who this is not for

This repository is not a good fit if you need:

  • a drop-in production JKYun connector with real tenant credentials already integrated
  • a finished enterprise permission platform with full production policy lifecycle and organization sync
  • a production-ready OpenSearch / Qdrant / ClickHouse stack out of the box
  • a minimal toy MCP demo that ignores audit, masking, tenant isolation, and data-governance boundaries

Repository layout

High-level repository structure:

  • src/main/java/ — application code for ingestion, normalization, serving, auth, permission, audit, search, MDM, replay, and admin flows
  • src/main/resources/ — application config, Flyway migrations, and Admin Console static assets
  • src/test/java/ — unit and integration tests
  • docs/ — architecture, execution contract, implementation status, MCP auth design, and product-completion planning
  • mock-data/ — generated mock customer / order / refund / overlap data used for runnable demos
  • scripts/ — helper scripts such as OAuth PKCE local flow checks
  • docker-compose.yml — local infrastructure bootstrap
  • docs/05-local-runbook.md — detailed local runbook and command reference

Key docs

Recommended reading order:

  1. README.md — what the project is, why it exists, and how to run it quickly
  2. AGENTS.md — repository-wide execution constraints and architecture guardrails
  3. docs/00-llm-wiki-index.md — knowledge map for the rest of the repo
  4. docs/01-architecture-execution-contract.md — implementation contract derived from the architecture
  5. docs/02-implementation-status.md — what is already implemented vs. still out of scope

Additional key documents:

OSS reviewer checklist

If you are reviewing this repository as an OSS evaluator, the fastest path is:

  1. read the top sections of this README for scope, status, and non-goals
  2. check docs/02-implementation-status.md to see what is implemented versus still intentionally incomplete
  3. inspect CHANGELOG.md for the current tagged capability snapshot
  4. run docker compose up -d postgres redpanda and mvn spring-boot:run
  5. verify the main endpoints:
    • http://localhost:8080/admin
    • http://localhost:8080/.well-known/openid-configuration
    • http://localhost:8080/api/customers?limit=10
  6. run mvn test

Things to keep in mind while reviewing:

  • this repository is intentionally honest about current boundaries
  • mock data is part of the runnable demo flow
  • real JKYun production integration is not yet complete
  • the project is designed as a governed reference implementation, not as a minimal MCP toy example

Roadmap

Near-term priorities:

  1. connect real JKYun business data through the existing governed ingestion boundaries
  2. harden OAuth / SSO integration beyond the current demo authorization-server slice
  3. continue pushing MCP-safe permission, masking, audit, and retrieval patterns
  4. validate deployment, operations, and observability for more production-like environments

Medium-term priorities:

  1. evolve local retrieval into production-grade search / vector infrastructure
  2. expand business-domain coverage beyond the current customer / order-centered slice
  3. separate and harden Permission Service responsibilities where needed
  4. improve production readiness for backup, alerting, and multi-environment rollout

Explicit non-goals for the current version

This repository does not currently claim:

  • live JKYun production API integration
  • production-grade multi-node deployment
  • full enterprise permission-service rollout
  • real OpenSearch / Qdrant / external LLM RAG production integration
  • final business-domain coverage for inventory, suppliers, refunds, and export execution

Contributing

Before opening a PR or making architecture-sensitive changes, read:

  1. AGENTS.md
  2. llm-wiki Index
  3. Architecture Execution Contract
  4. Implementation Status

Additional repository governance files:

If you want to contribute code or docs, please also check the issue templates and PR template under .github/.

Architecture snapshot

The current implementation focuses on a minimum runnable end-to-end slice:

mock JKYun customer / order data
  -> raw object temporary storage
  -> PostgreSQL transaction writes raw_records + source_change_events + source_event_outbox
  -> raw object linked after transaction commit
  -> source_event_publisher sends Change Envelope to Redpanda
  -> customer / trade normalizer
  -> source_customers/source_trades + identity_map + canonical_customers/canonical_orders
  -> canonical_change_events
  -> customer query API with basic tenant / role checks

Detailed local runbook

For the full command reference and the longer local walkthrough, see:

The shortest practical local validation loop is:

  1. start infrastructure
docker compose up -d postgres redpanda
  1. start the application
mvn spring-boot:run
  1. verify the main endpoints
  • http://localhost:8080/admin
  • http://localhost:8080/.well-known/openid-configuration
  • http://localhost:8080/api/customers?limit=10

For Admin Console browser checks and responsive screenshots after the application is running:

npx playwright install chromium
node scripts/admin-console-verify.mjs
  1. run tests
mvn test
  1. if you want the full runnable data flow, use the detailed runbook to:
  • ingest mock customers and trades
  • normalize customer and order records
  • build local Search / Vector-like indexes
  • exercise Admin Console, MDM, replay, and backfill flows

For the most accurate statement of implemented boundaries, rely on:

Design and architecture docs

About

AI data foundation for MCP/agent applications with governed ingestion, canonical models, OAuth, permission enforcement, audit, and safe retrieval.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors