techySPHINX · techySPHINX · Mar 17, 2026 · Mar 17, 2026
diff --git a/.github/agents/nivesh-backend-ml-integration.agent.md b/.github/agents/nivesh-backend-ml-integration.agent.md
@@ -0,0 +1,57 @@
+---
+name: Nivesh Backend-ML Integration Engineer
+description: "Use when implementing, debugging, or reviewing cross-service work between backend NestJS and ml-services FastAPI, including API contracts, auth propagation, schema validation, retraining workflows, and deployment/runtime alignment."
+tools: [read, search, edit, execute, todo]
+model: "GPT-5 (copilot)"
+argument-hint: "Describe the business goal, affected backend and ml-services modules, contract changes expected, and whether tests should be run."
+user-invocable: true
+---
+You are the Nivesh Backend-ML Integration Engineer, a senior software and ML platform engineer responsible for reliable end-to-end delivery across `backend` and `ml-services`.
+
+## Mission
+Ship production-grade features and fixes that keep backend and ML services tightly aligned, secure, and test-verified.
+
+## Scope
+- Primary scope: `backend/` and `ml-services/`
+- Secondary scope: `docs/`, `docker-compose*.yml`, `k8s/`, `helm/`, `monitoring/`, and environment examples when integration contracts change
+
+## Execution Priorities
+Follow the remediation sequence unless the user sets a different priority:
+1. Phase 0 critical path stability: MLS-001 to MLS-005
+2. Phase 1 security and runtime hardening: MLS-006 to MLS-008
+3. Phase 2 contract and validation governance: MLS-009 to MLS-011
+4. Phase 3 retraining control plane: MLS-012 to MLS-014
+5. Phase 4 integration and release convergence: MLS-015 to MLS-017
+
+## Quality Bar
+- Operate at SWE-benchmark-style rigor: root-cause first, minimal safe patch, evidence-backed validation
+- Never claim a test passed unless it was executed and result captured
+- Prefer deterministic fixes over speculative changes
+
+## Non-Negotiable Constraints
+- Do not change API contracts silently; when changing contracts, update caller code, tests, and docs in the same task
+- Do not weaken auth, API key checks, rate limiting, or validation boundaries
+- Do not leave backend and ml-services endpoint naming inconsistent
+- Do not complete work with unresolved build, type, or test failures caused by your changes
+
+## Operating Workflow
+1. Discover and map current contracts, endpoints, schema expectations, and auth behavior
+2. Identify integration impact across backend callers and ml-services providers
+3. Propose and apply the smallest complete change set that resolves the issue
+4. Update environment templates, deployment manifests, and docs when contracts or runtime behavior change
+5. Run targeted validation first, then broader checks when risk is cross-cutting
+6. Report outcomes with explicit evidence, remaining risks, and next action
+
+## Validation Defaults
+Use relevant commands based on scope and report exact outcomes:
+- ML services tests: `cd ml-services; pytest`
+- ML endpoint tests: `cd ml-services; pytest test/test_model_server.py -v`
+- Backend tests: `cd backend; pnpm test`
+- Backend build checks: `cd backend; pnpm build`
+
+## Output Format
+Return responses in this order:
+1. Objective and resolution summary
+2. File-by-file changes and contract impact
+3. Validation commands executed with pass or fail status
+4. Risks, assumptions, and recommended next step
diff --git a/ml-services-execution-plan/ML_SERVICES_EXECUTION_CHECKLIST.md b/ml-services-execution-plan/ML_SERVICES_EXECUTION_CHECKLIST.md
@@ -0,0 +1,308 @@
+# ML Services Execution Checklist
+
+Document owner: ML Services Engineering
+Document date: 2026-03-17
+Linked strategy document: ML_SERVICES_REMEDIATION_ROADMAP.md
+Linked executive plan: ML_SERVICES_EXECUTIVE_PLAN.md
+Linked sprint plan: SPRINT_WISE_TICKET_PLAN.md
+Execution mode: Principal architect governance with weekly phase gates
+
+## How to Use This Checklist
+
+1. Assign owner and target date for every item before execution starts.
+2. Mark each item done only when objective evidence is captured.
+3. Do not close a phase until all mandatory gate criteria are satisfied.
+4. Record deviations and approved exceptions in the notes section.
+
+Legend:
+
+- Priority: C0 (Critical), C1 (High), C2 (Medium)
+- Status: [ ] Not started, [x] Complete
+
+## A) Program Governance and Pre-Flight
+
+- [ ] Confirm execution owner matrix across ML, Platform, Backend, QA, SRE
+- [ ] Finalize sprint mapping for Phase 0 to Phase 4
+- [ ] Open tracking board with each task ID mapped to engineering ticket
+- [ ] Define release branch strategy and rollback approach
+- [ ] Confirm non-production validation environment availability
+
+Evidence links:
+
+- Owner matrix:
+- Tracking board:
+- Release and rollback plan:
+
+## B) Phase 0 - Critical Path Stabilization (Week 1)
+
+### B1. Intent DAG Reliability (C0)
+
+- [ ] Align intent DAG required columns with real dataset fields
+- [ ] Align intent DAG trainer invocation with actual train function signature
+- [ ] Validate XCom outputs for accuracy and f1 mapping
+- [ ] Execute DAG dry-run and record logs
+
+Evidence:
+
+- DAG run ID:
+- Dry-run log:
+
+### B2. Credit DAG Reliability (C0)
+
+- [ ] Align credit DAG required columns with actual training dataset schema
+- [ ] Align target label naming across DAG and model training logic
+- [ ] Expose or adapt trainer interface to support DAG invocation contract
+- [ ] Execute DAG dry-run and record logs
+
+Evidence:
+
+- DAG run ID:
+- Dry-run log:
+
+### B3. Build and Container Integrity (C0)
+
+- [ ] Fix docker-compose model-server dockerfile path reference
+- [ ] Fix Dockerfile dependency file reference to existing requirements source
+- [ ] Validate local build success
+- [ ] Validate CI build success
+
+Evidence:
+
+- Local build output:
+- CI build run:
+
+### Phase 0 Gate (Mandatory)
+
+- [ ] Intent retraining DAG runs without interface/schema failure
+- [ ] Credit retraining DAG runs without interface/schema failure
+- [ ] model-server image builds successfully in local and CI
+
+Gate approver:
+
+- [ ] Principal architect sign-off
+
+## C) Phase 1 - Security and Runtime Hardening (Week 2 to Week 3)
+
+### C1. Endpoint Security Consistency (C1)
+
+- [ ] Inventory all model-server endpoints by auth requirement
+- [ ] Apply API key and rate-limit dependency to non-health operational endpoints
+- [ ] Apply equivalent protection to drift monitoring router endpoints
+- [ ] Validate authorized and unauthorized behavior with test cases
+
+Evidence:
+
+- Endpoint security matrix:
+- Test report:
+
+### C2. Probe Semantics Correction (C1)
+
+- [ ] Update Kubernetes deployment readiness path to readiness endpoint
+- [ ] Update Kubernetes deployment liveness path to liveness endpoint
+- [ ] Update Helm template probe paths consistently
+- [ ] Validate behavior under dependency failure simulation
+
+Evidence:
+
+- K8s manifest diff:
+- Helm template diff:
+- Probe behavior test log:
+
+### C3. Environment Contract Alignment (C1)
+
+- [ ] Add ML_SERVICE_URL and ML_API_KEY to backend environment example
+- [ ] Add ML_API_KEYS or equivalent policy variable to ml-services environment example
+- [ ] Update integration docs with canonical auth and endpoint settings
+
+Evidence:
+
+- Updated env examples:
+- Docs updates:
+
+### Phase 1 Gate (Mandatory)
+
+- [ ] All non-health operational endpoints reject missing or invalid auth
+- [ ] Probe configuration uses correct readiness and liveness paths
+- [ ] Backend and ml-services env examples are contract-complete
+
+Gate approver:
+
+- [ ] Principal architect sign-off
+
+## D) Phase 2 - Contract and Validation Governance (Week 4 to Week 5)
+
+### D1. Strict API Schema Enforcement (C1)
+
+- [ ] Replace weak dictionary payload contracts with explicit request models
+- [ ] Replace weak response payload contracts where ambiguity exists
+- [ ] Add field constraints and validation messages for invalid requests
+
+Evidence:
+
+- Schema diff:
+- Contract test run:
+
+### D2. Validation Layer Integration (C1)
+
+- [ ] Integrate data_validation checks into inference input boundary
+- [ ] Integrate validation checks into retraining data ingestion path
+- [ ] Add failure-mode tests for malformed payloads and training data
+
+Evidence:
+
+- Validation integration points:
+- Failure-mode tests:
+
+### D3. OpenAPI Governance in CI (C2)
+
+- [ ] Generate OpenAPI artifact in CI pipeline
+- [ ] Add contract drift check policy
+- [ ] Fail CI on unreviewed contract changes
+
+Evidence:
+
+- CI workflow update:
+- Sample contract report:
+
+### Phase 2 Gate (Mandatory)
+
+- [ ] Typed contracts enforced on critical endpoints
+- [ ] Validation checks active for inference and training boundaries
+- [ ] CI contract governance gate enabled
+
+Gate approver:
+
+- [ ] Principal architect sign-off
+
+## E) Phase 3 - Retraining Control Plane Completion (Week 6 to Week 7)
+
+### E1. Manual Retraining Control (C1)
+
+- [ ] Add authenticated manual retraining endpoint(s)
+- [ ] Capture audit metadata for each retraining trigger
+- [ ] Add operator runbook for manual invocation and rollback
+
+Evidence:
+
+- API spec and endpoint test:
+- Runbook:
+
+### E2. Drift to Retraining Path (C1)
+
+- [ ] Define policy thresholds and cool-down windows
+- [ ] Connect drift alert path to controlled retraining trigger flow
+- [ ] Add safety guardrails for repeated trigger suppression
+
+Evidence:
+
+- Trigger policy config:
+- End-to-end scenario test:
+
+### E3. Production Drift Data Source (C1)
+
+- [ ] Replace simulated drift data path with production source adapter
+- [ ] Validate data quality and schema checks on collected drift samples
+- [ ] Verify drift outputs against known baseline snapshots
+
+Evidence:
+
+- Data source integration:
+- Drift validation report:
+
+### Phase 3 Gate (Mandatory)
+
+- [ ] Manual retraining is operational and authenticated
+- [ ] Drift-trigger workflow is implemented with guardrails
+- [ ] Drift checks use real production-aligned data
+
+Gate approver:
+
+- [ ] Principal architect sign-off
+
+## F) Phase 4 - Quality, Integration, and Documentation Convergence (Week 8)
+
+### F1. Test Hardening (C1)
+
+- [ ] Replace permissive status assertions with strict expected outcomes
+- [ ] Add endpoint auth test coverage for operational routes
+- [ ] Add contract regression tests for payload schema boundaries
+
+Evidence:
+
+- Test diff:
+- Test run report:
+
+### F2. Backend Integration Assurance (C1)
+
+- [ ] Add backend integration tests for ml-categorization service path
+- [ ] Validate fallback behavior and API key propagation
+- [ ] Validate endpoint path compatibility between backend and ml-services
+
+Evidence:
+
+- Integration test run:
+- Endpoint compatibility report:
+
+### F3. Documentation Parity (C2)
+
+- [ ] Correct endpoint naming drift in docs and examples
+- [ ] Ensure env setup docs match implementation requirements
+- [ ] Publish final API contract references and operational runbooks
+
+Evidence:
+
+- Docs change set:
+- Published references:
+
+### Phase 4 Gate (Mandatory)
+
+- [ ] Strict test suite passes in CI
+- [ ] Backend integration checks pass
+- [ ] Documentation and implementation are aligned
+
+Gate approver:
+
+- [ ] Principal architect sign-off
+
+## G) Cross-Phase Non-Functional Controls
+
+- [ ] Performance baseline recorded before changes
+- [ ] Performance regression threshold defined and monitored
+- [ ] Security review completed for auth and rate-limit changes
+- [ ] Observability dashboards updated for new endpoints and retraining events
+- [ ] Incident response and rollback drills executed at least once
+
+Evidence:
+
+- Performance report:
+- Security review:
+- Observability update:
+- Drill report:
+
+## H) Final Release Readiness Checklist
+
+- [ ] All phase gates approved
+- [ ] No open C0 or C1 defects
+- [ ] CI pipeline green on required jobs
+- [ ] Deployment manifests validated in target environment
+- [ ] Stakeholder sign-off captured (ML Lead, Platform Lead, Backend Lead, QA, SRE)
+
+Release decision:
+
+- [ ] Go
+- [ ] No-go
+
+Decision notes:
+
+
+## I) Post-Release Validation (First 7 Days)
+
+- [ ] Verify no failed scheduled retraining runs
+- [ ] Verify no unauthorized access attempts are accepted
+- [ ] Verify probe behavior during one controlled dependency disruption test
+- [ ] Verify no endpoint contract mismatch incidents from backend clients
+- [ ] Verify drift monitoring and retraining controls produce expected telemetry
+
+Post-release owner:
+
+Post-release report link: