llm-patch/NOTES.md at main · seismael/llm-patch

Executive Summary

The Context-Isolated "Living Wiki" Copilot is an enterprise-grade AI architecture that completely replaces the Retrieval-Augmented Generation (RAG) paradigm for technical documentation and support systems.

Instead of relying on a vector database to fetch probabilistic text chunks at query time, this pipeline treats documentation as source code. It continuously compiles markdown wikis into isolated neural weight modules (LoRA adapters). At inference time, an API gateway deterministically hot-swaps the exact module required for the specific user context, ensuring zero cross-contamination between product versions or domains.

The following is the exhaustive, end-to-end blueprint for finalizing this pipeline for production release.

Phase 1: The Ingestion & Compilation Daemon (Continuous Integration)

The foundation of the system is the daemon that binds the state of the documentation to the state of the language model's memory. This utilizes the existing KnowledgeFusionOrchestrator and WikiKnowledgeSource.

The Workflow

The Trigger: A technical writer commits a change to api-v2-auth.md in the company's documentation repository, or modifies an Obsidian vault.
The Aggregation: The WikiDocumentAggregator intercepts the filesystem event. It parses the YAML frontmatter and resolves any [[wikilinks]] to inject necessary definitional topology (e.g., pulling in the definition of an internal OAuth wrapper).
The Compilation: The system passes the aggregated markdown to the SakanaT2LGenerator. A sub-second forward pass generates the A and B LoRA matrices.
The Persistence: The LocalSafetensorsRepository (or an S3AdapterRepository) serializes the matrices into a .safetensors file named api_v2_auth_adapter.safetensors, writing a new AdapterManifest to a centralized registry.

Actionable Engineering Task

You must expose this not just as a Python API, but as a headless CLI command designed for CI/CD pipelines.

Deliverable: A CLI command llm-patch watch --dir ./docs --out ./adapters that can be run as a systemd service or a Docker container alongside the documentation repository.

Phase 2: The Semantic Router & Inference Gateway (The "Motor Cortex")

This is where the system transitions from a generation tool to a production serving infrastructure. You must build a FastAPI gateway that intercepts user queries, determines which adapter is required, and serves the response.

The Workflow

The Query: A user asks the support chat: "How do I implement the new authentication flow?" along with metadata indicating they are using "API Version 2".
The Router: The FastAPI gateway evaluates the query. It uses either explicit metadata (the user selected "v2" in a dropdown) or a lightweight, fast embedding classifier (like all-MiniLM-L6-v2) against the AdapterManifest summaries to determine that api_v2_auth_adapter.safetensors is the exact required knowledge state.
The Hot-Swap (LoRAX/vLLM): The gateway routes the request to a multi-LoRA inference server (e.g., LoRAX). The server instantly injects the adapter into the base model's attention layers.
The Execution: The LLM generates the response utilizing the full context window for reasoning, while the factual knowledge is inherently drawn from its newly modified parameters.

Object-Oriented Router Design (Actionable Task)

Implement an extensible routing layer using the Strategy pattern to allow developers to define how adapters are chosen.

import abc
from typing import Optional
from llm_patch.core.models import AdapterManifest

class IAdapterRouter(abc.ABC):
    @abc.abstractmethod
    def route(self, user_query: str, request_metadata: dict) -> Optional[AdapterManifest]:
        """Determines the specific adapter required for the inference context."""
        pass

class MetadataExactMatchRouter(IAdapterRouter):
    """Routes based on exact product version tags provided by the client."""
    def __init__(self, manifest_registry: list[AdapterManifest]):
        self.registry = {m.adapter_id: m for m in manifest_registry}

    def route(self, user_query: str, request_metadata: dict) -> Optional[AdapterManifest]:
        target_id = request_metadata.get("context_id")
        return self.registry.get(target_id)

Phase 3: Autonomous Agent Orchestration via Model Context Protocol (MCP)

To capture the rapidly expanding autonomous agent market, the pipeline must be exposed to orchestration layers like Claude Desktop or universal VS Code drivers via the Model Context Protocol.

The Workflow

The Encounter: An autonomous coding agent is assigned a task utilizing an unfamiliar internal framework.
Tool Invocation: The agent recognizes a knowledge gap and invokes an exposed MCP tool: internalize_knowledge(path="./docs/internal_framework").
Dynamic Compilation: The llm-patch pipeline compiles the framework documentation into a temporary adapter and registers its ID.
Agent Execution: The agent issues its subsequent code-generation prompts specifying the new adapter ID. The model dynamically acquires the framework knowledge, generates production-ready code, and unloads the adapter.

Actionable Engineering Task

Implement an MCP Server utilizing the mcp Python SDK. Expose the KnowledgeFusionOrchestrator.process_document() method as a callable tool, and the multi-LoRA inference endpoint as a resource or prompt enhancement.

Systemic Advantages & Go-To-Market Positioning

When finalizing the documentation and README.md for this specific use case, emphasize the mathematical and architectural superiority over standard solutions:

Eradication of Version Hallucinations: By physically swapping weights rather than retrieving text, a model initialized with a v2 adapter literally possesses zero neurological pathways associated with v3 deprecations. Cross-contamination is mathematically nullified.
Pristine Context Windows: Because the factual data resides in the model parameters, the full 128k+ context window remains available strictly for complex reasoning, multi-step execution logic, and conversational history.
Stateless Infrastructure: The base model remains entirely frozen and stateless. Thousands of enterprise tenants or distinct product lines can be served by a single GPU cluster, injecting kilobyte-sized adapters on a per-request basis with zero downtime.

By delivering the CI/CD compilation CLI, the FastAPI dynamic router, and the MCP server integration, this project evolves from a hypernetwork wrapper into definitive, enterprise-grade AI infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executive Summary

Phase 1: The Ingestion & Compilation Daemon (Continuous Integration)

The Workflow

Actionable Engineering Task

Phase 2: The Semantic Router & Inference Gateway (The "Motor Cortex")

The Workflow

Object-Oriented Router Design (Actionable Task)

Phase 3: Autonomous Agent Orchestration via Model Context Protocol (MCP)

The Workflow

Actionable Engineering Task

Systemic Advantages & Go-To-Market Positioning

FilesExpand file tree

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

Executive Summary

Phase 1: The Ingestion & Compilation Daemon (Continuous Integration)

The Workflow

Actionable Engineering Task

Phase 2: The Semantic Router & Inference Gateway (The "Motor Cortex")

The Workflow

Object-Oriented Router Design (Actionable Task)

Phase 3: Autonomous Agent Orchestration via Model Context Protocol (MCP)

The Workflow

Actionable Engineering Task

Systemic Advantages & Go-To-Market Positioning