Skip to content

alexhambley/RO-Crate-RAG-Validator

Repository files navigation

RO-Crate RAG Validator

A semantic validation tool for RO-Crate metadata. It checks each entity in an ro-crate-metadata.json against one or more RO-Crate profile specifications, using retrieval-augmented generation (RAG) to pull the relevant rules per entity and an LLM to judge both compliance and usefulness for the intended audience.

Features

  1. Bring your own model: any OpenAI-compatible provider (OpenAI, DeepSeek, Ollama, Together, …) via providers.yaml. Embeddings are configured independently of the chat model.
  2. Actionable output: each finding includes typed fix operations (add / replace / remove) with suggested values.
  3. Context-aware: determines whether metadata is useful for the intended audience.
  4. Streamlit UI and a CLI entry point.

Architecture

flowchart LR
    subgraph UI
        APP[app.py<br/>Streamlit]
    end
    subgraph Core
        CFG[config.py<br/>providers.yaml]
        LLM[llm.py<br/>factories]
        REG[profile_registry.py<br/>RAG]
        VAL[validation.py<br/>CrateValidator]
    end
    APP --> CFG
    APP --> LLM
    APP --> REG
    APP --> VAL
    LLM --> REG
    LLM --> VAL
    VAL --> REG
Loading

Data flow

sequenceDiagram
    actor User
    participant App as app.py
    participant Reg as ProfileRegistry
    participant Emb as Embeddings provider
    participant Val as CrateValidator
    participant Chat as Chat provider

    User->>App: upload crate + profiles, pick provider
    App->>Reg: ingest profile markdown
    Reg->>Emb: embed chunks
    Emb-->>Reg: vectors (Chroma)
    loop per entity
        App->>Val: validate_entity(entity, audience)
        Val->>Reg: get_rules_for_entity(@type)
        Reg-->>Val: relevant rules
        Val->>Chat: prompt(rules + entity + audience)
        Chat-->>Val: structured ValidationResult
    end
    Val-->>App: results
    App-->>User: issues, fixes, quality, JSON report
Loading

Setup

  1. Create and activate a virtual environment:

    python -m venv .venv && source .venv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt
  3. Provide API keys in a .env file in the project root. Each provider reads the key named by its key_env in providers.yaml:

    OPENAI_API_KEY=sk-...
    DEEPSEEK_API_KEY=sk-...   # only if using DeepSeek

Configuring providers

Providers set in providers.yaml:

deepseek:
  base_url: https://api.deepseek.com
  key_env: DEEPSEEK_API_KEY
  models: [deepseek-chat]
  embedding_models: []        # (DeepSeek has no embeddings API)

Because DeepSeek has no embeddings API, pair it with another embeddings provider in the UI, e.g. DeepSeek chat + OpenAI embeddings. To add a new OpenAI-compatible provider, add an entry with its base_url, the env var holding its key, and its model list.

Running

Streamlit UI

streamlit run app.py

Upload ro-crate-metadata.json and one or more profile .md files, select the chat and embeddings providers, optionally describe the intended audience, and select Validate.

CLI

python validation.py path/to/ro-crate-metadata.json path/to/profile.md

Docker

docker build -t rocrate-validator .
docker run --env-file .env -p 8501:8501 rocrate-validator

Then open http://localhost:8501.

Tests

pip install -r requirements-dev.txt
pytest

Tests mock the LLM and embeddings.

About

Validate RO-Crate metadata against profile specs using retrieval-augmented generation (RAG) and an OpenAI-compatible LLM.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors