diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..92a01d0 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,49 @@ +name: docs + +on: + push: + branches: [ "main" ] + workflow_dispatch: + +permissions: + contents: read + pages: write + id-token: write + +concurrency: + group: "pages" + cancel-in-progress: true + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Install docs dependencies + run: | + python -m pip install --upgrade pip + pip install -e ".[docs]" + + - name: Build site + run: mkdocs build --strict + + - name: Upload artifact + uses: actions/upload-pages-artifact@v3 + with: + path: site + + deploy: + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + runs-on: ubuntu-latest + needs: build + steps: + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 diff --git a/.gitignore b/.gitignore index 9804c77..c985141 100644 --- a/.gitignore +++ b/.gitignore @@ -18,6 +18,9 @@ poetry.lock .idea/ target/ +# agents +.cursor/ + # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] diff --git a/README.md b/README.md index 54fcba9..3c463c3 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,21 @@ # KGpipe: A Framework for Knowledge Graph Integration Pipelines -- πŸ“Š [Benchmark Datasets](https://doi.org/10.5281/zenodo.17246357) +## Related benchmarks & datasets + +- **KGI-Bench**: benchmark specification + tooling for KG integration evaluation. See `https://github.com/ScaDS/KGI-Bench`. +- **KGI-Bench (Movies)**: Movie-domain benchmark dataset release (Zenodo). See `https://doi.org/10.5281/zenodo.17246357`. KGpipe is an open-source framework for defining, executing, and evaluating knowledge graph (KG) integration pipelines. It enables the reuse and composition of existing tools (e.g., OpenIE, PARIS, JedAI) and Large Language Models (LLMs) into modular pipelines that integrate heterogeneous data sources into a unified KG. +![KGpipe workflow](docs/workflow.png) + +**Who is this for?** +- You have multiple heterogeneous sources (RDF/JSON/text) and want a **reproducible, modular pipeline**. +- You want to **reuse existing tooling** (Python libs, Dockerized CLIs, remote APIs/LLMs) without rewriting everything. +- You want to **evaluate** generated KGs with a growing set of metrics (`kgpipe_eval`). + **Key features:** - Modular and extensible pipeline specification. - Support for multiple execution backends (Python, Docker, HTTP services). @@ -13,6 +23,28 @@ It enables the reuse and composition of existing tools (e.g., OpenIE, PARIS, Jed - Novel benchmark for systematic evaluation of pipelines across RDF, JSON, and text sources. - Metrics covering structural, semantic, and reference-based evaluation. +## Quickstart (5 minutes) + +Install from source (editable): + +```bash +pip install -e . +kgpipe --help +``` + +Bootstrap a minimal example project and discover its tasks: + +```bash +cd experiments/examples +./init.sh + +cd "" +pip install -e . + +kgpipe discover --package --show-results +kgpipe list --type tasks +``` + ## Architecture Each pipeline is a sequence of tasks with well-defined input/output contracts. @@ -49,7 +81,42 @@ KGpipe provides Single-Source Pipelines (SSPs) and Multi-Source Pipelines (MSPs) ## Usage -For documentation see the [docs](docs/reproduce.md) +Documentation lives in `docs/`: +- **Start here**: `docs/index.md` and `docs/quickstart.md` +- **Adopting KGpipe / wrapping existing tools**: `docs/adoption.md` +- **Evaluation (new API)**: `docs/evaluation.md` (uses `kgpipe_eval`) +- **MovieKG reproduction**: `docs/reproduce.md` + +### Documentation site (GitHub Pages) + +This repo is set up to build docs with **MkDocs + Material**: +- config: `mkdocs.yml` +- local build instructions: `docs/README.md` +- deploy workflow: `.github/workflows/docs.yml` (GitHub Pages via Actions) + +## Installation notes (CPU vs CUDA) + +Some optional ML dependencies (e.g. `sentence_transformers`) pull in PyTorch (`torch`). Depending on which PyTorch wheel gets selected, you may see large downloads like `nvidia-*` and `triton`. + +KGpipe keeps the ML stack out of the default install; install it explicitly when needed. For `uv`, PyTorch is pinned to the official PyTorch wheel indexes to avoid accidentally pulling CUDA wheels from PyPI. + +### Base install (fast, no torch) + +```bash +uv pip install . +``` + +### ML install with CPU-only PyTorch (no `nvidia-*`) + +```bash +uv pip install ".[ml,cpu]" +``` + +### ML install with CUDA-enabled PyTorch (will download `nvidia-*`) + +```bash +uv pip install ".[ml,cuda]" +``` ## Experiments -- **[moviekg](experiments/moviekg/README.md)** evalaution of a pipelines, building a Movie KG from three sources (rdf,json,text). +- **[moviekg](experiments/moviekg/README.md)** evaluation of pipelines, building a Movie KG from three sources (rdf, json, text). diff --git a/docs/adoption.md b/docs/adoption.md new file mode 100644 index 0000000..c1229a5 --- /dev/null +++ b/docs/adoption.md @@ -0,0 +1,85 @@ +# Adopting KGpipe (integrating existing pipelines/tools) + +This page explains how to **adopt KGpipe** when you already have: +- an existing KG pipeline (e.g., DBpedia-style multi-step workflows), and/or +- existing implementations you want to reuse (Python code, Dockerized tools, external APIs). + +The goal is to map β€œwhat you already have” onto KGpipe’s building blocks: +- **Tasks**: reusable steps with typed inputs/outputs (`input_spec` / `output_spec`) +- **Pipelines**: ordered task graphs (`KgPipe`) that transform `Data` from seed β†’ result +- **Configuration**: parameters passed into tasks (often via env/config profiles) + +## 1) Convert an existing pipeline into a KGpipe pipeline + +When you have a pipeline described elsewhere (scripts, Airflow, Makefile, DBpedia extraction steps, etc.), do this: + +1. **List pipeline steps** (one row per step): name, inputs, outputs, and β€œhow it runs” (Python/Docker/API). +2. **Define formats** for each boundary artifact (RDF formats, CSV, JSON, text). If needed, extend formats. +3. **Wrap each step as a KGpipe task** (see sections below). +4. **Compose tasks into a `KgPipe`** and verify the input/output formats connect. + +Practical tip: start by wrapping a *single* step and run it via `kgpipe task ...`, then grow into a pipeline. + +## 2) Wrap existing tasks (three common patterns) + +### A) Wrap a Dockerized CLI tool + +Use this when the tool is a command-line program and can run inside a container. + +Reference example: +- `src/kgpipe_tasks/entity_resolution/matcher/paris_rdf_matcher.py` + +What to document for each wrapper: +- Docker image name + how to build/pull it +- command template (mapping KGpipe input/output keys to CLI args) +- volume mounts / working dir assumptions +- required environment variables + +### B) Wrap existing Python code + +Use this when you have Python functions/classes you want to call directly. + +Reference example: +- `experiments/param-opti/src/param_opti/tasks/base_linker.py` + +What to document for each wrapper: +- the function/class you call +- how you read from `inputs[...]` and write to `outputs[...]` +- how you map configuration parameters into function args (or config objects) + +### C) Wrap an external API (HTTP service) + +Use this when the implementation is β€œsome service endpoint” (DBpedia Spotlight, LLM providers, etc.). + +Reference examples: +- `experiments/param-opti/src/param_opti/tasks/spotlight_lib.py` +- `experiments/param-opti/src/param_opti/tasks/spotlight.py` + +What to document for each wrapper: +- endpoint URL + auth +- request/response format +- retry/timeouts and caching +- how you handle rate limits and partial failures + +## 3) Discovery (making your tasks available) + +Once tasks exist in a Python package, KGpipe can discover them (they register when imported). + +```bash +kgpipe discover --package --show-results +kgpipe list --type tasks +``` + +## 4) Recommended structure for β€œadopted” pipelines + +A maintainable layout usually separates: +- `tasks/`: wrappers (Python/Docker/API) +- `pipelines/`: composition (KgPipe builders or pipeline configs) +- `configs/`: pipeline/task configuration profiles +- `docker/`: Dockerfiles and wrapper scripts (if needed) + +## Status + +This page is the intended replacement for `migration.md` (which was a misleading name). It will be expanded with +copy-pastable code snippets for each wrapper type using the referenced files above as canonical examples. + diff --git a/docs/create-docs.md b/docs/create-docs.md new file mode 100644 index 0000000..b066c6a --- /dev/null +++ b/docs/create-docs.md @@ -0,0 +1,45 @@ +# Docs (MkDocs) + +This repository uses **MkDocs + Material** to build the documentation site from the Markdown files in `docs/`. + +## Local preview + +### Option A: pip + +```bash +python -m pip install -e ".[docs]" +mkdocs serve +``` + +Then open the URL shown in the terminal (usually `http://127.0.0.1:8000/`). + +### Option B: uv (recommended if you use uv) + +```bash +uv pip install -e ".[docs]" +mkdocs serve +``` + +## Build + +```bash +mkdocs build --strict +``` + +The static site is written to `site/`. + +## Navigation / sidebar + +Edit `mkdocs.yml` (`nav:` section) to control: +- sidebar structure +- ordering +- page titles + +## Deployment (GitHub Pages) + +Deployment is handled by the GitHub Actions workflow: +- `.github/workflows/docs.yml` + +In your GitHub repo settings, set: +- **Settings β†’ Pages β†’ Source**: **GitHub Actions** + diff --git a/docs/evaluation.md b/docs/evaluation.md index fc27bda..613eb33 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -1,181 +1,114 @@ -# KG Evaluation +# KG Evaluation (new API) -The framework provides several approaches to evaluate the quality of a generated knowledge graph. Evaluation is organized into different aspects, each focusing on specific quality dimensions. +KGpipe currently contains **two** evaluation implementations: -## Evaluation Aspects +- **New** (recommended): `kgpipe_eval` (package: `src/kgpipe_eval/`) +- **Old** (deprecated soon): `kgpipe.evaluation` (package: `src/kgpipe/evaluation/`) -The framework supports evaluation across multiple aspects: +This page documents the **new** `kgpipe_eval` API. -- **Statistical**: Basic metrics like triple count, entity count, graph density, and other structural properties -- **Semantic**: Validation of ontology consistency, type errors, relation direction, and semantic correctness -- **Reference**: Comparison against curated gold-standard knowledge graphs using precision, recall, and F1 scores -- **Efficiency**: Resource consumption metrics including runtime, memory usage, and cost +## Mental model -## Using the Evaluator +In `kgpipe_eval`, evaluation is composed from: +- **KG loader / adapter**: turns a `KgLike` (e.g. a `kgpipe.common.model.kg.KG`) into an in-memory `TripleGraph` +- **Metric instances**: objects implementing `Metric.compute(...) -> MetricResult` +- **Metric configs** (optional): typed config objects passed to metrics that require parameters +- **Evaluator**: runs multiple metrics against a loaded graph -The main entry point for evaluation is the `Evaluator` class. You configure which aspects to evaluate and then run evaluation on a knowledge graph: +Key types: +- `kgpipe_eval.api.Metric`: metric interface (`key`, `description`, `compute`) +- `kgpipe_eval.api.MetricResult`: dataclass with `measurements` + optional `summary` +- `kgpipe_eval.evaluator.Evaluator`: runs a list of metrics with an optional `confs` dict + +## Minimal example (statistics) ```python -from kgpipe.evaluation import Evaluator, EvaluationConfig, EvaluationAspect -from kgpipe.common.models import KG, DataFormat from pathlib import Path -# Create evaluation configuration -config = EvaluationConfig( - aspects=[EvaluationAspect.STATISTICAL, EvaluationAspect.SEMANTIC, EvaluationAspect.REFERENCE], - metrics=None # None means use all available metrics for each aspect -) +from kgpipe.common.model.data import DataFormat +from kgpipe.common.model.kg import KG -# Create evaluator -evaluator = Evaluator(config) +from kgpipe_eval.evaluator import Evaluator +from kgpipe_eval.metrics.statistics import CountMetric +from kgpipe_eval.utils.kg_utils import KgManager -# Load the knowledge graph to evaluate kg = KG( id="my_kg", - name="My Knowledge Graph", - path=Path("result.nt"), - format=DataFormat.RDF_NTRIPLES + name="My KG", + path=Path("my_kg.nt"), + format=DataFormat.RDF_NTRIPLES, ) -# For reference-based evaluation, provide reference data -references = { - "gold_standard": Data(path=Path("gold_standard.nt"), format=DataFormat.RDF_NTRIPLES) -} - -# Run evaluation -report = evaluator.evaluate(kg, references=references) +tg = KgManager.load_kg(kg) +results = Evaluator().run(tg, metrics=[CountMetric()]) -# Access results -print(f"Overall score: {report.overall_score}") -for aspect_result in report.aspect_results: - print(f"{aspect_result.aspect.value}: {len(aspect_result.metrics)} metrics") - for metric in aspect_result.metrics: - print(f" {metric.name}: {metric.value} (normalized: {metric.normalized_score})") +for r in results: + print(r.metric.key, r.summary) + for m in r.measurements: + print(" ", m.name, m.value) ``` -## Evaluation via CLI - -You can also evaluate knowledge graphs using the command-line interface: - -```bash -kgpipe eval target.nt --ground-truth gold.nt --aspects statistical semantic reference --output results.json -``` - -The CLI supports: -- `--aspects`: Specify which aspects to evaluate (statistical, semantic, reference, efficiency) -- `--metrics`: Filter to specific metrics by name -- `--ground-truth`: Path to reference knowledge graph for reference-based evaluation -- `--output`: Save evaluation results to a JSON file - -## Statistical Evaluation - -Statistical evaluation provides basic metrics about the knowledge graph structure: - -- Triple count -- Entity count -- Relation count -- Graph density -- Average degree -- Connected components - -These metrics help understand the scale and structure of the generated knowledge graph. - -## Semantic Evaluation +## Metrics that need configuration -Semantic evaluation validates the knowledge graph against its ontology: +Some metrics require a config object. The `Evaluator` detects this by introspecting the metric’s +`compute(...)` signature: +- `compute(self, kg)` β†’ no config needed +- `compute(self, kg, config)` β†’ config required and must be provided -- Disjoint domain violations -- Incorrect relation direction -- Incorrect relation cardinality -- Incorrect relation domain/range -- Incorrect datatypes -- Ontology class coverage -- Ontology relation coverage -- Namespace coverage +You pass configs via a dict keyed by the metric key/class name. -These metrics ensure the knowledge graph conforms to its schema and maintains semantic consistency. +Example (triple alignment + duplicates): -## Reference-based Evaluation - -Reference-based evaluation compares the generated knowledge graph against a gold standard: - -- Entity matching (precision, recall, F1) -- Relation matching (precision, recall, F1) -- Triple alignment -- Source typed entity coverage -- Reference class coverage - -This type of evaluation requires a curated reference knowledge graph that serves as ground truth. - -## Evaluation Reports - -Evaluation results are returned as `EvaluationReport` objects that contain: +```python +from kgpipe_eval.evaluator import Evaluator +from kgpipe_eval.utils.kg_utils import KgManager + +from kgpipe_eval.metrics.duplicates import DuplicateMetric, DuplicateConfig +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentMetric, TripleAlignmentConfig +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig + +tg = KgManager.load_kg("path/to/result_eval.nt") # KgLike: path, KG object, ... + +metrics = [DuplicateMetric(), TripleAlignmentMetric()] +confs = { + "DuplicateMetric": DuplicateConfig( + entity_alignment_config=EntityAlignmentConfig( + method="label_embedding", + verified_entities_path="path/to/verified_entities.tsv", + verified_entities_delimiter="\\t", + entity_sim_threshold=0.95, + ) + ), + "TripleAlignmentMetric": TripleAlignmentConfig( + reference_kg="path/to/reference.nt", + entity_alignment_config=EntityAlignmentConfig( + method="label_embedding", + reference_kg="path/to/reference.nt", + entity_sim_threshold=0.95, + ), + value_sim_threshold=0.5, + cache_literal_embeddings=True, + cache_ref_literal_embeddings=True, + ), +} -- The evaluated knowledge graph -- Reference data used (if any) -- Aspect results for each evaluated aspect -- Individual metric results with values and normalized scores -- Overall score (average of normalized scores across all metrics) +results = Evaluator().run(tg, metrics, confs) +``` -Reports can be serialized to JSON for storage and later analysis: +## Canonical reference example (MovieKG) -```python -report.to_json("evaluation_results.json") -``` +For a realistic end-to-end usage example (loading pipeline stage outputs, wiring configs, running multiple metrics), +see: -# Hierarchy +- `experiments/moviekg/src/moviekg/evaluation/test_eval_refactor.py` -``` -QualityEvaluationOntology - -QualityDimension - β”œβ”€ Accuracy - β”œβ”€ Coverage - β”œβ”€ Consistency - └─ Uniqueness - -Metric - β”œβ”€ BaseMetric - β”‚ β”œβ”€ Precision - β”‚ └─ Recall - β”œβ”€ CompositeMetric - β”‚ └─ F1Score - └─ AggregatedMetric - β”œβ”€ MacroAverage - └─ MicroAverage - -QualityIssue - β”œβ”€ DuplicateEntities (false positives for EM) - β”œβ”€ DisjointDomainIssue - └─ MissingEntities (false positives for OM, or true positives for EM) - -EvaluationArtifact - β”œβ”€ ReferenceDataset - └─ QualityRulePattern -``` +That file shows how to: +- build per-metric configs (duplicates/entity alignment/triple alignment) +- load the KG from a pipeline output directory +- serialize `MetricResult` to JSON (because it contains metric objects) -Example Instance: Entity Matching +## CLI note -``` -ReferenceDataset - β”‚ - β–Ό -Precision / Recall - β”‚ - β–Ό -F1Score - β”‚ - β–Ό -Evaluation of Matching Quality - β”‚ -False Negatives - β”‚ - β–Ό -DuplicateEntities - β”‚ - β–Ό -RedundancyIssue - β”‚ - β–Ό -Violates Uniqueness Dimension -``` \ No newline at end of file +There is a β€œnew eval” CLI command path intended to run these metrics (see `kgpipe_eval.api` docstring mentioning +`kgpipe eval-new`). If you want the docs to include the CLI, we should first confirm the exact CLI flags and expected +inputs in `src/kgpipe/cli/eval_new.py` and align this page with that implementation. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 1a39c1b..4a8a452 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,46 +4,45 @@ KGpipe is a framework to define pipelines for data integration into knowledge gr The framework is organized into three main subpackages: `kgpipe` contains the core framework functionality including CLI, common utilities, execution backends, and evaluation components. `kgpipe_tasks` provides task implementations for cleaning, construction, entity resolution, schema alignment, and text processing. `kgpipe_llm` includes LLM-based task implementations and utilities. -## Meta KG +**Current version**: 0.7.0 +**Python**: >= 3.12 -[link](metakg.md) +![KGpipe workflow](workflow.png) -KGpipe uses an internally maintained Meta KG (PipeKG) to maintain tasks, tool implementations, their components, pipelines, and metrics. This knowledge base enables automatic pipeline generation and tracking of execution results. +## Quickstart -## Task Specification +Start here: [Quickstart guide](quickstart.md) -[link](tasks.md) +Minimal β€œhappy path” (install + discover + inspect what’s available): -The framework enables the description and integration of integration tasks. You can describe tasks with Python, interface existing implementations with Python, Docker, or remote API requests. Tasks are discovered and registered through the framework's discovery mechanism. +```bash +pip install -e . +kgpipe discover --all --show-results +kgpipe list --type tasks +kgpipe list --type metrics +``` -## Pipeline Generation and Execution +Create a new experiment project (recommended): -[link](pipelines.md) +```bash +cd experiments/examples +./init.sh +``` -KGpipe allows you to define pipelines manually or using an automatic search algorithm that operates on the PipeKG knowledge base and a set of given constraints. You can swap subpipelines or single tasks with other components to experiment with different approaches. +## How to use KGpipe (docs map) -## Configuration - -[link](configuration.md) - -The framework supports configuration at multiple levels. The main configuration is specified in `kgpipe.yml`, and individual tasks can define their own configuration parameters that will be passed by the framework when executing pipelines. - -## Evaluation - -[link](evaluation.md) - -The framework provides several approaches to evaluate the quality of a generated knowledge graph, including accuracy, coverage, consistency, statistics, and efficiency measurements. Evaluation metrics are tracked in the Meta KG alongside pipeline results. - -Additional evaluation metrics are documented in the [metrics](metrics/) directory, such as [entity coverage](metrics/entity_coverage.md). +- Define tasks: [Task specification](tasks.md) +- Build and run pipelines: [Pipelines](pipelines.md) +- Configure runs and task parameters: [Configuration](configuration.md) and [Parameters](parameters.md) +- Evaluate generated KGs: [Evaluation](evaluation.md) and [Metrics](metrics/metrics.md) +- Understand the internal β€œPipeKG”: [Meta KG](metakg.md) ## Other Links - [Reproducing the movie kg experiments for 15 pipelines](reproduce.md) (rdf, json, text) +- [Adopting KGpipe (integrating existing pipelines/tools)](adoption.md) +- [UI / viewer](view.md) -## Docu Backlog +## Docs backlog -- Explain different execution modes - - File Batches - - Streaming -- Explain advanced pipelines -- Ontology creation... \ No newline at end of file +Open items live in `TODO.md` (High/Medium/Low priority). Keep the landing page focused on user-facing docs. \ No newline at end of file diff --git a/docs/metrics/entity_coverage.md b/docs/metrics/entity_coverage.md index 3e76cb3..1aa3ea5 100644 --- a/docs/metrics/entity_coverage.md +++ b/docs/metrics/entity_coverage.md @@ -1,21 +1,73 @@ +# Entity Coverage Metric (OLD) +The Entity Coverage metric evaluates how well source entities are integrated into the target knowledge graph. It measures the overlap between expected source entities and the entities actually present in the generated knowledge graph. +## Source Entity Integration Score -# Source Entitiy Integration Score +The metric compares a set of expected source entities (provided as a reference file) against the entities found in the knowledge graph. It calculates coverage based on entity URIs and labels. -# Entity Integration Score +## Input Format +The expected entities are provided in a CSV or JSON file with the following structure: + +**CSV Format:** ``` URI, LABEL, TYPE +http://example.org/entity1, "Entity Label 1", EntityType +http://example.org/entity2, "Entity Label 2", EntityType +``` + +**JSON Format:** +```json +{ + "http://example.org/entity1": { + "entity_label": "Entity Label 1", + "entity_type": "EntityType" + }, + "http://example.org/entity2": { + "entity_label": "Entity Label 2", + "entity_type": "EntityType" + } +} +``` + +## Calculation + +The metric performs the following steps: + +1. **Load expected entities**: Reads the entity dictionary from the provided file path +2. **Extract entity identifiers**: Collects URIs and labels from the expected entities +3. **Find entities in KG**: Searches the knowledge graph for entities matching by URI or label (using `rdfs:label`) +4. **Calculate overlap**: Counts how many expected entities are found in the KG + +The coverage score is calculated as: + ``` +coverage = overlapping_entities_count / expected_entities_count +``` + +Where: +- `overlapping_entities_count`: Number of expected entities found in the KG +- `expected_entities_count`: Total number of entities in the reference file -Set of entity type pairs -Make overlap on entity_type pairs +## Variants -intesection= -precission -recall= +The framework provides several variants of entity coverage metrics: +- **SourceEntityCoverageMetric**: Strict matching by URI and label +- **SourceEntityCoverageMetricSoft**: Fuzzy matching using label embeddings (threshold 0.95) +- **SourceTypedEntityCoverageMetric**: Matching based on entity type pairs, calculating precision and recall on entity-type combinations +## Usage +To use this metric in evaluation, provide the path to the verified source entities file in the reference configuration: + +```python +from kgpipe.evaluation.aspects.reference import ReferenceConfig + +config = ReferenceConfig( + VERIFIED_SOURCE_ENTITIES="path/to/entities.csv" +) +``` +The metric will automatically be included when evaluating with the `REFERENCE` aspect. diff --git a/experiments/moviekg/src/moviekg/evaluation/__init__.py b/docs/metrics/metrics.md similarity index 100% rename from experiments/moviekg/src/moviekg/evaluation/__init__.py rename to docs/metrics/metrics.md diff --git a/experiments/moviekg/src/moviekg/paper/__init__.py b/docs/metrics/reference_entity_alignment.md similarity index 100% rename from experiments/moviekg/src/moviekg/paper/__init__.py rename to docs/metrics/reference_entity_alignment.md diff --git a/experiments/moviekg/src/moviekg/paper/helpers/__init__.py b/docs/metrics/reference_triple_alignment.md similarity index 100% rename from experiments/moviekg/src/moviekg/paper/helpers/__init__.py rename to docs/metrics/reference_triple_alignment.md diff --git a/docs/metrics/stats_counts.md b/docs/metrics/stats_counts.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/migration.md b/docs/migration.md new file mode 100644 index 0000000..1ca2964 --- /dev/null +++ b/docs/migration.md @@ -0,0 +1,7 @@ +# Migration (renamed) + +This page was renamed to better reflect its intent. + +Use: +- [`adoption.md`](adoption.md): **Adopting KGpipe (integrating existing pipelines/tools)** + diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..a2b08f4 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,173 @@ +# KGpipe Quickstart + +This quickstart shows the **current** workflow for defining and running: +- tasks (Python functions registered in the `Registry`) +- pipelines (a `KgPipe` connecting tasks via input/output `Data`) +- metrics/evaluations (evaluators + metrics run on a `KG`) + +## See also +- `experiments/examples/`: a minimal example project using KGpipe +- `docs/reproduce.md`: running the (deprecated but working) reproduction experiments + +## Install + +From the repo root: + +```bash +pip install -e . +``` + +If you need the optional ML stack (transformers / sentence-transformers), install extras: + +```bash +pip install -e ".[ml]" +``` + +## Create a new experiment (recommended starting point) + +The easiest way to get a working project layout is to copy the template in `experiments/examples/`. + +```bash +cd experiments/examples +./init.sh +``` + +The script creates a new directory containing a Python package with example tasks/pipelines. Then: + +```bash +cd "" +pip install -e . +``` + +## Define tasks + +Tasks are normal Python callables registered via `@Registry.task(...)`. See +`experiments/examples/src/kgpipe_examples/task_examples.py` for canonical examples. + +Key concepts: +- **`input_spec` / `output_spec`**: the expected formats for inputs/outputs +- **`TaskInput` / `TaskOutput`**: dict-like objects mapping names to `Data` +- **`trace_task_run`**: wraps the function to produce a run report + +Minimal pattern (simplified from the examples): + +```python +from kgpipe.common import TaskInput, TaskOutput, trace_task_run +from kgpipe.common.registry import Registry + +@trace_task_run +@Registry.task( + input_spec={"input": "some_format"}, + output_spec={"output": "some_other_format"}, + description="Example task", +) +def my_task(inputs: TaskInput, outputs: TaskOutput): + outputs["output"].path.touch() +``` + +## Define and run a pipeline (Python API) + +Pipelines connect tasks by passing `Data` (path + format) between them. A minimal example exists in +`experiments/examples/src/kgpipe_examples/pipe_examples.py`. + +The core pattern: + +```python +from kgpipe.common import KgPipe, Data + +# tasks = [task_a, task_b, ...] # registered task callables (from your package) +# seed = Data(path=..., format=...) +# result = Data(path=..., format=...) +pipe = KgPipe(tasks=tasks, seed=seed, data_dir="/tmp/my_run_dir") +pipe.build(source=seed, result=result) +pipe.run() +``` + +## Discover components and inspect what’s available (CLI) + +The CLI entrypoint is `kgpipe` (see `pyproject.toml`). + +To register tasks/pipelines/metrics from your local package, import it via discovery: + +```bash +# From inside your experiment venv / environment +kgpipe discover --package --show-results +``` + +You can also discover from a local module path (directory or file): + +```bash +kgpipe discover --module-path ./src/ --show-results +``` + +To list what KGpipe currently knows about (after discovery): + +```bash +kgpipe list --type tasks +kgpipe list --type metrics +``` + +To show details for a specific task: + +```bash +kgpipe show --type task +``` + +To print YAML templates for evaluation configs: + +```bash +kgpipe show metric-config-templates +``` + +## Run a single task (CLI) + +KGpipe can execute a registered task directly. The `--input/--output` syntax is: + +\[ +\texttt{|@} +\] + +(`@` is optional.) + +Example: + +```bash +kgpipe task \ + --input "/tmp/in.txt|txt@input" \ + --output "/tmp/out.txt|txt@output" +``` + +Tip: if you get β€œTask not found”, run `kgpipe discover ...` first. + +## Run a minimal evaluation / metrics (Python API, new `kgpipe_eval`) + +KG evaluation is being migrated to the **new** `kgpipe_eval` package (recommended). A realistic integration-style +example exists in: + +- `experiments/moviekg/src/moviekg/evaluation/test_eval_refactor.py` + +Minimal example (basic statistics): + +```python +from pathlib import Path + +from kgpipe.common.model.data import DataFormat +from kgpipe.common.model.kg import KG + +from kgpipe_eval.evaluator import Evaluator +from kgpipe_eval.metrics.statistics import CountMetric +from kgpipe_eval.utils.kg_utils import KgManager + +kg = KG( + id="my_kg", + name="My KG", + path=Path("my_kg.nt"), + format=DataFormat.RDF_NTRIPLES, +) + +tg = KgManager.load_kg(kg) +results = Evaluator().run(tg, metrics=[CountMetric()]) + +for r in results: + print(r.metric.key, r.summary) +``` \ No newline at end of file diff --git a/docs/reproduce.md b/docs/reproduce.md index 40875e6..1d9349b 100644 --- a/docs/reproduce.md +++ b/docs/reproduce.md @@ -1,7 +1,7 @@ -# Rep Experiments +# Rep Experiments (Deprecated but working) -Guidelines to run the [experiments](../experiments) -- see also [moviekg](../experiments/moviekg/README.md) +Guidelines to run the [experiments](https://github.com/ScaDS/KGpipe/tree/main/experiments) +- see also [moviekg](https://github.com/ScaDS/KGpipe/blob/main/experiments/moviekg/README.md) ## Overview diff --git a/docs/workflow.png b/docs/workflow.png new file mode 100644 index 0000000..41239b0 Binary files /dev/null and b/docs/workflow.png differ diff --git a/experiments/examples/src/kgpipe_examples/config.py b/experiments/examples/src/kgpipe_examples/config.py index c9cc32e..77768ae 100644 --- a/experiments/examples/src/kgpipe_examples/config.py +++ b/experiments/examples/src/kgpipe_examples/config.py @@ -1,19 +1,8 @@ -from enum import Enum -from kgpipe.common.model.data import DynamicFormat, FormatRegistry +from kgpipe.common.model.default_catalog import CustomDataFormats -class ExtendedFormats(Enum): - SPECIAL_IN = DynamicFormat(name="special_in", extension=".special_in", description="Special input format") - SPECIAL1 = DynamicFormat(name="special1", extension=".special1", description="Special format 1") - SPECIAL2 = DynamicFormat(name="special2", extension=".special2", description="Special format 2") - SPECIAL_KG = DynamicFormat(name="special_kg", extension=".special_kg", description="Special output format for knowledge graph") -FORMAT_REGISTRY = FormatRegistry() - -FORMAT_REGISTRY.register_format( - ExtendedFormats.SPECIAL_IN.value.name, ExtendedFormats.SPECIAL_IN.value.extension, ExtendedFormats.SPECIAL_IN.value.description) -FORMAT_REGISTRY.register_format( - ExtendedFormats.SPECIAL1.value.name, ExtendedFormats.SPECIAL1.value.extension, ExtendedFormats.SPECIAL1.value.description) -FORMAT_REGISTRY.register_format( - ExtendedFormats.SPECIAL2.value.name, ExtendedFormats.SPECIAL2.value.extension, ExtendedFormats.SPECIAL2.value.description) -FORMAT_REGISTRY.register_format( - ExtendedFormats.SPECIAL_KG.value.name, ExtendedFormats.SPECIAL_KG.value.extension, ExtendedFormats.SPECIAL_KG.value.description) \ No newline at end of file +class ExtendedFormats(CustomDataFormats): + SPECIAL_IN = "special_in" + SPECIAL1 = "special1" + SPECIAL2 = "special2" + SPECIAL_KG = "special_kg" \ No newline at end of file diff --git a/experiments/examples/src/kgpipe_examples/eval_examples.py b/experiments/examples/src/kgpipe_examples/eval_examples.py new file mode 100644 index 0000000..ae5434f --- /dev/null +++ b/experiments/examples/src/kgpipe_examples/eval_examples.py @@ -0,0 +1,88 @@ +from kgpipe.evaluation.aspects.statistical import ( + StatisticalEvaluator, + StatisticalConfig, + EntityCountMetric +) +from kgpipe.evaluation.aspects.semantic import ( + SemanticEvaluator, + SemanticConfig, + DisjointDomainMetric, + IncorrectRelationDirectionMetric, + IncorrectRelationRangeMetric, + IncorrectRelationDomainMetric, + IncorrectDatatypeMetric, + IncorrectDatatypeFormatMetric, +) +from kgpipe.evaluation.aspects.reference import ( + ReferenceEvaluator, + ReferenceConfig, + SourceTypedEntityCoverageMetric, + ReferenceTripleAlignmentMetric, + ReferenceTripleAlignmentMetricSoftE, + ReferenceTripleAlignmentMetricSoftEV, +) +from kgpipe.common.model.kg import KG +from kgpipe.common.model.default_catalog import BasicDataFormats +from typing import List +from pathlib import Path + +from kgpipe.common.graph import mapper + +TEST_NTRIPLES = """ + . + "itemA" . + "The Hobbit, or There and Back Again" . + . + "9780261102217" . + + . + "itemB" . + "Pride & Prejudice" . + . + "9780199535569" . + + . + "itemC" . + "1984" . + . + "9780452284234" . +""" + +def eval_example(tmp_path: Path): + """Example: Evaluate a KG against a ground truth.""" + + tmp_path = tmp_path / "my_kg.nt" + tmp_path.write_text(TEST_NTRIPLES) + + kg = KG( + id="my_kg", + name="My Knowledge Graph", + path=tmp_path, + format=BasicDataFormats.RDF_NTRIPLES + ) + + statistical_config = StatisticalConfig(name="default") + # semantic_config = SemanticConfig(name="default") + # reference_config = ReferenceConfig( + # name="default" + # REFERENCE_KG_PATH=...) + + statistical_evaluator = StatisticalEvaluator() + # semantic_evaluator = SemanticEvaluator() + # reference_evaluator = ReferenceEvaluator() + + statistical_metrics: List[str] = [EntityCountMetric().name] + # semantic_metrics: List[str] = [DisjointDomainMetric().name, IncorrectRelationDirectionMetric().name, IncorrectRelationRangeMetric().name, IncorrectRelationDomainMetric().name, IncorrectDatatypeMetric().name, IncorrectDatatypeFormatMetric().name] + # reference_metrics: List[str] = [SourceTypedEntityCoverageMetric().name, ReferenceTripleAlignmentMetric().name, ReferenceTripleAlignmentMetricSoftE().name, ReferenceTripleAlignmentMetricSoftEV().name] + + statistical_results = statistical_evaluator.evaluate( + kg, metrics=statistical_metrics, config=statistical_config) + # semantic_results = semantic_evaluator.evaluate( + # kg, metrics=semantic_metrics, config=semantic_config) + # reference_results = reference_evaluator.evaluate( + # kg, metrics=reference_metrics, config=reference_config) + + for metric in statistical_results.metrics: + mapper.metric_run_to_entity(metric) + + return statistical_results #, semantic_results, reference_results \ No newline at end of file diff --git a/experiments/examples/src/kgpipe_examples/pipe_examples.py b/experiments/examples/src/kgpipe_examples/pipe_examples.py index 0a44e3d..79289ea 100644 --- a/experiments/examples/src/kgpipe_examples/pipe_examples.py +++ b/experiments/examples/src/kgpipe_examples/pipe_examples.py @@ -19,6 +19,7 @@ def pipe_example(): tmp_data_dir = tempfile.mkdtemp() input_data = Data(path=os.path.join(tmp_data_dir, "input.special_in"), format=ExtendedFormats.SPECIAL_IN) output_data = Data(path=os.path.join(tmp_data_dir, "output.special_kg"), format=ExtendedFormats.SPECIAL_KG) + input_data.path.touch() tasks = [pipe_task_python, pipe_task_docker, pipe_task_remote] diff --git a/experiments/examples/src/kgpipe_examples/task_examples.py b/experiments/examples/src/kgpipe_examples/task_examples.py index 9a4fa7f..89345d0 100644 --- a/experiments/examples/src/kgpipe_examples/task_examples.py +++ b/experiments/examples/src/kgpipe_examples/task_examples.py @@ -1,17 +1,26 @@ from kgpipe.common import TaskInput, TaskOutput +from kgpipe.common import trace_task_run from kgpipe.common.model.configuration import ConfigurationProfile, ConfigurationDefinition, Parameter, ParameterType from kgpipe_examples.config import ExtendedFormats +from kgpipe.common.model.default_catalog import BasicTaskCategoryCatalog from kgpipe.common.registry import Registry +@trace_task_run @Registry.task( input_spec={"input": ExtendedFormats.SPECIAL_IN}, - output_spec={"output": ExtendedFormats.SPECIAL1} + output_spec={"output": ExtendedFormats.SPECIAL1}, + category=[BasicTaskCategoryCatalog.entity_resolution], + description="A task that processes a special input and produces a special output" ) def pipe_task_python(inputs: TaskInput, outputs: TaskOutput): # touch output file outputs["output"].path.touch() +# def converts_pdfs: pass +# def extracts_text + +@trace_task_run @Registry.task( input_spec={"input": ExtendedFormats.SPECIAL1}, output_spec={"output": ExtendedFormats.SPECIAL2} @@ -20,6 +29,7 @@ def pipe_task_docker(inputs: TaskInput, outputs: TaskOutput): # touch output file outputs["output"].path.touch() +@trace_task_run @Registry.task( input_spec={"input": ExtendedFormats.SPECIAL2}, output_spec={"output": ExtendedFormats.SPECIAL_KG} @@ -29,14 +39,23 @@ def pipe_task_remote(inputs: TaskInput, outputs: TaskOutput): outputs["output"].path.touch() +@trace_task_run @Registry.task( - input_spec={"input": ExtendedFormats.SPECIAL2}, + input_spec={"input": ExtendedFormats.SPECIAL1}, output_spec={"output": ExtendedFormats.SPECIAL_KG}, + category=[BasicTaskCategoryCatalog.entity_resolution], config_spec=ConfigurationDefinition( name="pipe_task_with_config_spec", description="Configuration specification for the pipe_task_with_config task", parameters=[ - Parameter(name="some_parameter", datatype=ParameterType.string, default_value="default", required=False) + Parameter( + name="some_parameter", + native_keys=["some_parameter"], + datatype=ParameterType.string, + default_value="default", + required=False, + allowed_values=[] + ) ] ) ) diff --git a/experiments/examples/src/kgpipe_examples/test_examples.py b/experiments/examples/src/kgpipe_examples/test_examples.py index 08d63f7..c937774 100644 --- a/experiments/examples/src/kgpipe_examples/test_examples.py +++ b/experiments/examples/src/kgpipe_examples/test_examples.py @@ -1,17 +1,213 @@ +from pathlib import Path +from kgpipe.common import Data -def test_python_task_defintion(): + +def test_python_task_execution(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats from kgpipe_examples.task_examples import pipe_task_python - assert pipe_task_python.name == "pipe_task_python" -def test_docker_task_defintion(): + in_file = tmp_path / "input.special_in" + out_file = tmp_path / "output.special1" + in_file.touch() + + report = pipe_task_python.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL_IN)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL1)], + ) + + assert pipe_task_python.name == "pipe_task_python" + assert report.status == "success" + assert out_file.exists() + + +def test_docker_task_execution(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats from kgpipe_examples.task_examples import pipe_task_docker + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special2" + in_file.touch() + + report = pipe_task_docker.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL2)], + ) + assert pipe_task_docker.name == "pipe_task_docker" + assert report.status == "success" + assert out_file.exists() -def test_remote_task_defintion(): + +def test_remote_task_execution(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats from kgpipe_examples.task_examples import pipe_task_remote + + in_file = tmp_path / "input.special2" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_remote.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL2)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + ) + assert pipe_task_remote.name == "pipe_task_remote" + assert report.status == "success" + assert out_file.exists() + + +def test_config_spec_execution(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats + from kgpipe_examples.task_examples import pipe_task_with_config + from kgpipe.common.model.configuration import ( + ConfigurationProfile, + ParameterBinding, + Parameter, + ParameterType, + ) + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_with_config.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + configProfile=ConfigurationProfile( + name="pipe_task_with_config_profile", + definition=pipe_task_with_config.config_spec, + bindings=[ + ParameterBinding( + parameter=Parameter( + name="some_parameter", + native_keys=["some_parameter"], + datatype=ParameterType.string, + default_value="default", + required=False, + allowed_values=[], + ), + value="some", + ) + ], + ), + ) + + assert pipe_task_with_config.name == "pipe_task_with_config" + assert report.status == "success" + assert out_file.exists() + +def test_config_profile_missing_fails(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats + from kgpipe_examples.task_examples import pipe_task_with_config + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_with_config.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + # configProfile intentionally omitted + ) -def test_pipeline_defintion(): + assert report.status == "failed" + assert report.error is not None + assert "requires a 'config' argument" in report.error + + +def test_config_profile_wrong_type_fails(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats + from kgpipe_examples.task_examples import pipe_task_with_config + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_with_config.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + configProfile="not-a-profile", + ) + + assert report.status == "failed" + assert report.error is not None + assert "expects configProfile to be a ConfigurationProfile" in report.error + + +def test_config_profile_spec_mismatch_fails(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats + from kgpipe_examples.task_examples import pipe_task_with_config + from kgpipe.common.model.configuration import ( + ConfigurationProfile, + ConfigurationDefinition, + ) + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_with_config.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + configProfile=ConfigurationProfile( + name="mismatching_profile", + definition=ConfigurationDefinition(name="different_spec_name"), + bindings=[], + ), + ) + + assert report.status == "failed" + assert report.error is not None + assert "does not match task config spec" in report.error + + +def test_config_profile_unknown_parameter_fails(tmp_path: Path): + from kgpipe_examples.config import ExtendedFormats + from kgpipe_examples.task_examples import pipe_task_with_config + from kgpipe.common.model.configuration import ( + ConfigurationProfile, + ParameterBinding, + Parameter, + ParameterType, + ) + + in_file = tmp_path / "input.special1" + out_file = tmp_path / "output.special_kg" + in_file.touch() + + report = pipe_task_with_config.run( + inputs=[Data(path=in_file, format=ExtendedFormats.SPECIAL1)], + outputs=[Data(path=out_file, format=ExtendedFormats.SPECIAL_KG)], + configProfile=ConfigurationProfile( + name="pipe_task_with_config_profile_unknown_param", + definition=pipe_task_with_config.config_spec, + bindings=[ + ParameterBinding( + parameter=Parameter( + name="other_parameter", + native_keys=["other_parameter"], + datatype=ParameterType.string, + default_value="default", + required=False, + allowed_values=[], + ), + value="some", + ) + ], + ), + ) + + assert report.status == "failed" + assert report.error is not None + assert "Unknown config parameter" in report.error + +def test_pipeline_definition_executes(): from kgpipe_examples.pipe_examples import pipe_example - \ No newline at end of file + + # Main objective: execute the pipeline example end-to-end without errors. + pipe_example() + +def test_evaluation_example(tmp_path: Path): + from kgpipe_examples.eval_examples import eval_example + eval_example(tmp_path) \ No newline at end of file diff --git a/experiments/moviekg/Makefile b/experiments/moviekg/Makefile index 4eac2f6..e3fcdf7 100644 --- a/experiments/moviekg/Makefile +++ b/experiments/moviekg/Makefile @@ -1,6 +1,6 @@ .PHONY: -DATASET_URL := https://zenodo.org/record/17246358/files/inc_movie_kg_datasets.tar.gz?download=1 +ZENODO_RECORD := 17246357 BASE_DIR := ./data # === Main === @@ -16,13 +16,6 @@ pipelines: pipelines-llm: pytest -v src/moviekg/pipelines/ -k "llm" -evaluation: - pytest -v src/moviekg/evaluation/ -k "not llm" - -paper: - pytest -v src/moviekg/evaluation/test_inc_msp_evaluation.py -k concat; - pytest -v src/moviekg/paper/test_figtab.py; - # === Docker === $(BASE_DIR)/.kgpipe-docker-built: @@ -75,7 +68,9 @@ clean: $(BASE_DIR)/datasets.tar.gz: @mkdir -p $(BASE_DIR) - @cd $(BASE_DIR) && wget $(DATASET_URL) -O datasets.tar.gz + @cd $(BASE_DIR) && wget "$$(curl -sL https://zenodo.org/api/records/$(ZENODO_RECORD) \ + | jq -r '.files[] | select(.key=="inc_movie_kg_datasets.tar.gz") | .links.self')" \ + -O datasets.tar.gz $(BASE_DIR)/datasets/.extracted: $(BASE_DIR)/datasets.tar.gz @mkdir -p $(BASE_DIR)/datasets @@ -87,98 +82,40 @@ download-datasets: $(BASE_DIR)/datasets/.extracted datasets-eval: pytest -v src/moviekg/datasets/test_evaluate_film_data.py -# === SSPs === - -test-ssp-all: - time pytest -s -v src/moviekg/pipelines/test_inc_ssp.py - -eval-ssp-all: - time pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py - -test-ssp-classic: - time pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k "not llm_" - -test-ssp-llm: - time pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k "llm_" - -eval-ssp-llm: - time pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k "llm_" - # === RDF === -test-rdf-a: +test-rdf-base: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k rdf_a -eval-rdf-a: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k rdf_a - -test-rdf-b: +test-rdf-alt: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k rdf_b -eval-rdf-b: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k rdf_b - -test-rdf-c: +test-rdf-llm: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k rdf_llm -eval-rdf-c: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k rdf_llm - # === JSON === -test-json-a: +test-json-base: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k json_a -eval-json-a: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k json_a - -test-json-b: +test-json-alt: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k json_b -eval-json-b: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k json_b - -test-json-c: +test-json-llm: pytest -v -s src/moviekg/pipelines/test_inc_ssp.py -k json_llm -eval-json-c: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k json_llm - # === TEXT === -test-text-a: +test-text-base: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k text_a -eval-text-a: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k text_a - -test-text-b: +test-text-alt: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k text_b -eval-text-b: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k text_b - -test-text-c: +test-text-llm: pytest -s -v src/moviekg/pipelines/test_inc_ssp.py -k text_llm -eval-text-c: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k text_llm - # === MSPs === test-msp-all: pytest -s -v src/moviekg/pipelines/test_inc_msp.py - -eval-msp-all: - pytest -s -v src/moviekg/evaluation/test_inc_msp_evaluation.py - -eval-msp-rjt: - pytest -s -v src/moviekg/evaluation/test_inc_msp_evaluation.py -k rdf-json-text - -# === Paper === - -concatenate-metrics: - pytest -s -v src/moviekg/evaluation/test_inc_ssp_evaluation.py -k test_concatenate_long_table_rows - -paper-figtab: - pytest -s -v src/moviekg/paper/test_figtab.py diff --git a/experiments/moviekg/README.md b/experiments/moviekg/README.md index d713b49..4363cab 100644 --- a/experiments/moviekg/README.md +++ b/experiments/moviekg/README.md @@ -1,55 +1,64 @@ -# Inc Movie KG +# MovieKG (KGpipe pipelines) -Documentation and experiment code for incremental KG generation and evaluation. +This directory contains **MovieKG pipeline definitions and execution helpers** for running incremental KG construction +pipelines with KGpipe. +Evaluation of the produced KGs is now handled in the **KGI-Bench** repository (Movie benchmark). See: +- [KGI-Bench](https://github.com/ScaDS/KGI-Bench) +- [KGI-Bench-Movie](https://github.com/ScaDS/KGI-Bench/tree/main/benchmarks/kgi-bench-movie) +- [KGI-Bench/docs/cli.md](https://scads.github.io/KGI-Bench/#cli) (includes `kgibench evaluate --benchmark movie ...`) -# Dataset Overview +## What’s in here -- πŸ“Š [Benchmark Datasets](https://doi.org/10.5281/zenodo.17246357) +- **Pipeline catalog**: `pipeline.conf` (pipeline variants and their task sequences) +- **Execution helpers**: `src/moviekg/pipelines/` (pytest-driven runners + helpers) +- **Environment templates**: `env`, `docker_env` (copy to `.env` / `docker.env` for local configuration) -A benchmark derived from Wikipedia and DBpedia in the movie domain covering the three entities: `Film,Person,Company` described and connected by 23(+2) attributes. -The dataset consists of the following. +## Running pipelines (local) -Four Splits and three different formats: -- RDF: RDF from DBpedia, in the three namespaces for seed, reference and source data -- JSON: json files built from the tree like subgraphs of each film -- TEXT: abstract text of each film entity from wikipedia +From `experiments/moviekg/`: -Suplmenetary data: -- reference entity matches: for entity matching eval (rdf, json) -- reference entity links: for entity linking eval (text) -- provannce mappings: for tracing json entity mappings -- refernce key mappings: for tracing json to rdf schema matching - -Available in three sizes: -- small 100 films: for development -- medium 1,000 films: for testing -- large 10,000 films: for benchmarking +```bash +cp env .env +make pipelines +``` -# Running +LLM variants: -It is possible to execute the experiemnt in a docker environment. -Adapt the `docker.env` file -and choose the dataset size (small, medium, large) +```bash +make pipelines-llm +``` -> LLM tasks are disabled by default to enable them add -> make pipelines-llm as task in [moviekg_docker.sh](../../scripts/moviekg_docker.sh) +Per-pipeline targets are also available (see `Makefile`), e.g.: -Prepare -``` -make setup_docker +```bash +make test-json-base +make test-rdf-base +make test-msp-all ``` -Execution of dataset stats, pipelines, evalaution, and paper content generation -``` +## Running pipelines (Docker workflow) + +This uses the `Makefile` targets to build images + start services and run pipelines inside Docker. + +```bash +cp docker_env docker.env +make setup_docker make run_docker_small ``` -For more detailed information see also [reproduce.md](../../docs/reproduce.md) or [docs](../../docs/) +> Note: LLM pipelines are typically disabled by default in Docker orchestration; enable them by adding the +> `pipelines-llm` step to the orchestration script used in your setup. + +## Dataset overview (high level) -# Directory Structure +- Dataset release: `https://doi.org/10.5281/zenodo.17246357` +- Sizes: `small` (100 films), `medium` (1k), `large` (10k) +- Formats per split: RDF, JSON, TEXT (incremental splits with seed/reference/source) -## Input Structure +## Directory structure + +### Input structure (example) ``` β”œβ”€β”€ film_100 @@ -84,12 +93,16 @@ For more detailed information see also [reproduce.md](../../docs/reproduce.md) o β”œβ”€β”€ film_1k[... trunc] ``` -## Output Structure +### Output structure (example) + +Pipeline outputs are written under `$OUTPUT_DIR/$DATASET_SELECT//stage_/` and include: +- `result.nt` (and optionally `result_eval.nt`) +- `exec-plan.json`, `exec-report.json` +- `tmp/` intermediate artifacts ``` β”œβ”€β”€ small -β”‚Β Β  β”œβ”€β”€ all_metrics.csv -β”‚Β Β  β”œβ”€β”€ json_a +β”‚Β Β  β”œβ”€β”€ json_base β”‚Β Β  β”‚Β Β  β”œβ”€β”€ stage_1 β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ exec-plan.json β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ exec-report.json @@ -105,9 +118,6 @@ For more detailed information see also [reproduce.md](../../docs/reproduce.md) o β”‚Β Β  β”‚Β Β  β”œβ”€β”€ exec-report.json β”‚Β Β  β”‚Β Β  β”œβ”€β”€ result.nt β”‚Β Β  β”‚Β Β  └── tmp/ -β”‚ β”œβ”€β”€ json_b[... trunc] -β”‚Β Β  β”œβ”€β”€ paper -β”‚Β Β  β”‚Β Β  β”œβ”€β”€ test_fig....png -β”‚Β Β  β”‚Β Β  └── test_tab.....png +β”‚ β”œβ”€β”€ json_alt[... trunc] └── medium[... trunc] -``` \ No newline at end of file +``` diff --git a/experiments/moviekg/env b/experiments/moviekg/env index 7923c7e..cb9e376 100644 --- a/experiments/moviekg/env +++ b/experiments/moviekg/env @@ -1,12 +1,12 @@ PIPELINE_CONFIG=pipeline.conf -DATASET_SELECT=medium +DATASET_SELECT=small -ONTOLOGY_PATH=/home/marvin/project/KGpipe/experiments/moviekg/movie-ontology.ttl -OUTPUT_DIR=/home/marvin/project/data/out/ +ONTOLOGY_PATH=./data/datasets/film_10k/ontology.ttl +OUTPUT_DIR=./data/results/ -DATASET_SMALL=/home/marvin/project/data/final/film_100 -DATASET_MEDIUM=/home/marvin/project/data/final/film_1k -DATASET_LARGE=/home/marvin/project/data/final/film_10k +DATASET_SMALL=./data/datasets/film_100 +DATASET_MEDIUM=./data/datasets/film_1k +DATASET_LARGE=./data/datasets/film_10k EMBEDDER=sentence-transformer DBPEDIA_ANNOTATE_URL='http://localhost:2222/rest/annotate' @@ -18,4 +18,3 @@ OLLAMA_TOKEN= OPENAI_TOKEN= LLM_ENDPOINT_URL= - diff --git a/experiments/moviekg/eval.sh b/experiments/moviekg/eval.sh deleted file mode 100644 index ecaad6f..0000000 --- a/experiments/moviekg/eval.sh +++ /dev/null @@ -1,10 +0,0 @@ -kgpipe eval -c metric_config.yaml \ - -m ReferenceTripleAlignmentMetricSoftEV \ - -m entity_count \ - -m incorrect_relation_direction \ - -m incorrect_relation_cardinality \ - -m incorrect_relation_range \ - -m incorrect_relation_domain \ - -m incorrect_datatype \ - -m incorrect_datatype_format \ - data/out/small/rdf_a/stage_3/result.nt diff --git a/experiments/moviekg/pipeline.conf b/experiments/moviekg/pipeline.conf index 53fe6fc..5c6964e 100644 --- a/experiments/moviekg/pipeline.conf +++ b/experiments/moviekg/pipeline.conf @@ -1,10 +1,6 @@ -# Pipeline Defintion +# Pipeline Defintions -# ======== -# RDF SSPs -# ======== - -rdf_a: +rdf_base: description: "Align source RDF with target KG" config: ENTITY_MATCHING_THRESHOLD: "0.99" @@ -16,23 +12,31 @@ rdf_a: - paris_exchange # 3 Fuse matched RDF (threshold 0.5) - fusion_first_value + # 4 Infer types / align to ontology - type_inference_ontology_simple -rdf_b: +rdf_alt: description: "Align source RDF with target KG with a tabular matching approach" config: ENTITY_MATCHING_THRESHOLD: "0.5" RELATION_MATCHING_THRESHOLD: "0.1" tasks: + # 1 Transform RDF to tabular representation - transform2_rdf_to_csv_v2 + # 2 Match entities (tabular) - pyjedai_entity_matching_v2 + # 3 Keep best match per entity - reduce_to_best_match_per_entity + # 4 Match relations/schemas (tabular) - valentine_csv_matching_v2 + # 5 Aggregate entity + relation matches - aggregate_2matches + # 6 Fuse matched RDF - fusion_first_value + # 7 Infer types / align to ontology - type_inference_ontology_simple -rdf_llm_schema_align_v1: +rdf_llm: description: "Align relations of source RDF with target KG using LLM" config: ENTITY_MATCHING_THRESHOLD: "0.99" @@ -42,96 +46,122 @@ rdf_llm_schema_align_v1: # 1 Use LLM to match relations - llm_task_rdf_ontology_matching_v1 # results in er.json # 2 Map source KG relations to matching target KG relations - # - map_kg_alignments - map_er_match_relations # 3 Match entities with paris - paris_entity_matching # 4 Exchange matched RDF - paris_exchange + # 5 Aggregate entity + relation matches - aggregate_2matches - # 5 Fuse matched RDF maybe only entities + # 6 Fuse matched RDF (maybe only entities) - fusion_first_value + # 7 Infer types / align to ontology - type_inference_ontology_simple -# ========= -# JSON SSPs -# ========= - -json_a: +json_base: description: "Construct intermediate RDF from JSON" tasks: # 1 Nested tree Json to generic RDF graph - construct_rdf_from_json3 # 2 Match RDF graph with seed - paris_entity_matching - # 3 exchange + # 3 Exchange matches - paris_exchange # 4 Fuse matched RDF (threshold 0.5) - fusion_first_value + # 5 Infer types / align to ontology - type_inference_ontology_simple -json_b: +json_alt: description: "Link JSON objects to target KG" tasks: - # construct TE_Document from JSON + # 1 Construct TE_Document from JSON - construct_linkedrdf_from_json_v3 # extract_json.py - # 4 Fuse matched RDF (threshold 0.5) + # 2 Select / fuse values - select_first_value + # 3 Infer types / align to ontology - type_inference_ontology_simple -json_llm_mapping_v1: +json_llm: description: "Align JSON path to target KG (ontology + sample KG)" tasks: + # 1 Use LLM to map JSON and construct intermediate RDF - llm_task_map_and_construct + # 2 Aggregate intermediate RDF outputs - aggregate_rdf_files + # 3 Match entities with paris - paris_entity_matching + # 4 Exchange matched RDF - paris_exchange + # 5 Fuse matched RDF - fusion_first_value + # 6 Infer types / align to ontology - type_inference_ontology_simple -# ========= -# Text SSPs -# ========= - -text_a: +text_base: description: "Use spoltight build RDF stagging Graph and apply Paris matching" tasks: + # 1 Extract triples with OpenIE - corenlp_openie_extraction + # 2 Convert extraction output to TE JSON - corenlp_exchange + # 3 Link relations (label+alias embedding) - label_alias_embedding_rl + # 4 Link entities with DBpedia Spotlight - dbpedia_spotlight_ner_nel + # 5 Convert Spotlight output to TE JSON - dbpedia_spotlight_exchange + # 6 Aggregate TE JSON artifacts - aggregate3_te_json + # 7 Construct RDF staging graph (mappings only) - construct_rdf_from_te_json_mappings_only + # 8 Match entities with paris - paris_entity_matching + # 9 Exchange matched RDF - paris_exchange + # 10 Fuse matched RDF - fusion_first_value + # 11 Infer types / align to ontology - type_inference_ontology_simple - -text_b: +text_alt: description: "(semi expensive) Use mini transformer to link entities and relations (label+alias)" tasks: + # 1 Extract triples with OpenIE - corenlp_openie_extraction # ("Berlin", "is a", "city") + # 2 Convert extraction output to TE JSON - corenlp_exchange + # 3 Link entities (label+alias embedding) - label_alias_embedding_el # ("Berlin" -> http://dbpedia.org/resource/Berlin) + # 4 Link relations (label+alias embedding) - label_alias_embedding_rl # ("is a" -> "rdf:type") + # 5 Aggregate TE JSON artifacts - aggregate3_te_json + # 6 Construct RDF staging graph - construct_rdf_from_te_json + # 7 Select / fuse values - select_first_value + # 8 Infer types / align to ontology - type_inference_ontology_simple -text_llm_triple_extract_v1: +text_llm: description: "Extract RDF from TEXT using LLM" config: LLM_MODEL: "gpt-5-mini" tasks: + # 1 Extract triples using LLM - llm_task_text_triple_extract_v1 + # 2 Link entities (label+alias embedding) - label_alias_embedding_el # ("Berlin" -> http://dbpedia.org/resource/Berlin) + # 3 Link relations (label+alias embedding) - label_alias_embedding_rl # ("is a" -> "rdf:type") + # 4 Aggregate TE JSON artifacts - aggregate3_te_json + # 5 Construct RDF staging graph - construct_rdf_from_te_json + # 6 Select / fuse values - select_first_value + # 7 Infer types / align to ontology - type_inference_ontology_simple # - type_inference_ontology_simple TODO why was this commented diff --git a/experiments/moviekg/src/moviekg/config.py b/experiments/moviekg/src/moviekg/config.py index 4460916..3727e00 100644 --- a/experiments/moviekg/src/moviekg/config.py +++ b/experiments/moviekg/src/moviekg/config.py @@ -43,22 +43,16 @@ pipeline_types = { - "rdf_a": "rdf", - "rdf_b": "rdf", - "text_a": "text", - "text_b": "text", - "json_a": "json", - "json_b": "json", + "rdf_base": "rdf", + "rdf_alt": "rdf", + "text_base": "text", + "text_alt": "text", + "json_base": "json", + "json_alt": "json", } llm_pipeline_types = { - "json_llm_mapping_v1": "json", - "rdf_llm_schema_align_v1": "rdf", - "text_llm_triple_extract_v1": "text", + "json_llm": "json", + "rdf_llm": "rdf", + "text_llm": "text", } - -ssp = { - "rdf": "rdf_a", - "json": "json_b", - "text": "text_a" -} \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/datasets/tmp_remove_seeds.py b/experiments/moviekg/src/moviekg/datasets/tmp_remove_seeds.py new file mode 100644 index 0000000..ea3dded --- /dev/null +++ b/experiments/moviekg/src/moviekg/datasets/tmp_remove_seeds.py @@ -0,0 +1,33 @@ +from moviekg.evaluation.test_eval_refactor import KgBenchData + +""" +for every verified_seed remove in the bench data remove the seed entities and store as verified_entities_no_seed.csv +""" + +import pandas as pd +from pathlib import Path + +bench_data = KgBenchData.from_path(Path("/home/marvin/phd/data/moviekg/datasets/film_1k")) + +for i in range(1, 4): + seed = bench_data.dataset.splits[f"split_{0}"].kg_reference.meta.entities.file + current = bench_data.dataset.splits[f"split_{i}"].kg_reference.meta.entities.file + current_path = bench_data.dataset.splits[f"split_{i}"].kg_reference.meta.entities.file + current_new = current_path.with_name(f"{current_path.stem}_no_seed{current_path.suffix}") + + # remove all lines from current that are in seed and save to new file + with open(current, "r") as f: + current_lines = f.readlines() + with open(seed, "r") as f: + seed_lines = f.readlines() + with open(current_new, "w") as f: + if not current_lines: + continue + + # Preserve header (assumes first line is the CSV header) + f.write(current_lines[0]) + + seed_set = set(seed_lines[1:] if seed_lines else []) + for line in current_lines[1:]: + if line not in seed_set: + f.write(line) diff --git a/experiments/moviekg/src/moviekg/evaluation/helpers.py b/experiments/moviekg/src/moviekg/evaluation/helpers.py deleted file mode 100644 index 03ce1d0..0000000 --- a/experiments/moviekg/src/moviekg/evaluation/helpers.py +++ /dev/null @@ -1,201 +0,0 @@ -import json -import tempfile -import re -import shutil -from typing import List, Dict, Tuple -from pathlib import Path -from rdflib import Graph - -from kgpipe.common.models import KG, DataFormat -from kgpipe.evaluation.aspects import reference, semantic, statistical -from kgpipe.evaluation.aspects.reference import ReferenceConfig -from kgpipe.evaluation.base import MetricResult -from kgcore.model.ontology import OntologyUtil - -from moviekg.datasets.pipe_out import StageOut -from moviekg.config import dataset - -ontology_graph = Graph() -if dataset.ontology is None: - raise ValueError("No ontology found") -ontology_graph.parse(dataset.ontology.as_posix()) - -def show_ontology(): - if dataset.ontology is None: - raise ValueError("No ontology found") - ontology = OntologyUtil.load_ontology_from_file(dataset.ontology) - - for class_ in ontology.classes: - print(f"{class_.uri} {class_.label}") - # print(f"{class_.alias} {class_.description}") - print(f"{class_.equivalent}") - print(f"{class_.disjointWith}") - print("-" * 100) - - for property in ontology.properties: - print(f"{property.uri} {property.type} {property.label}") - # print(f"{property.alias} {property.description}") - print(f"{property.domain.uri} {property.range.uri} {property.equivalent}") - print(f"{property.min_cardinality} {property.max_cardinality}") - print("-" * 100) - -show_ontology() - - -def print_long_table_rows(rows: List[dict]): - """ - with correct margin and alignment - """ - max_aspect_length = max(len(row["aspect"]) for row in rows) - max_metric_name_length = max(len(row["metric"]) for row in rows) - max_value_length = max(len(str(row["value"])) for row in rows) - max_normalized_length = max(len(str(row["normalized"])) for row in rows) - max_duration_length = max(len(str(row["duration"])) for row in rows) - - print(f"{'Aspect':<{max_aspect_length}} | {'Metric':<{max_metric_name_length}} | {'Value':<{max_value_length}} | {'Normalized':<{max_normalized_length}} | {'Duration':<{max_duration_length}}") - print("-" * (max_aspect_length + max_metric_name_length + max_value_length + max_normalized_length + max_duration_length + 6)) - for row in rows: - print(f"{row['aspect']:<{max_aspect_length}} | {row['metric']:<{max_metric_name_length}} | {row['value']:<{max_value_length}} | {row['normalized']:<{max_normalized_length}} | {row['duration']:<{max_duration_length}}") - -def metrics_to_long_table_rows(metrics: List[MetricResult], pipeline_name: str, stage_name: str) -> List[dict]: - rows = [] - for metric in metrics: - rows.append({ - "pipeline": pipeline_name, - "stage": stage_name, - "aspect": metric.aspect.value, - "metric": metric.name, - "value": metric.value, - "normalized": metric.normalized_score, - "duration": metric.duration, - "details": json.dumps(metric.details, default=str) - }) - return rows - - - -def get_reference_config(stage: StageOut, is_ssp: bool) -> ReferenceConfig: - - # this is a pipeline name based hack to get the source type and source split id - def get_split_id_and_source_type(stage: StageOut, is_ssp: bool = False) -> Tuple[int, str]: - # stage_name is like "stage_1" - split_id = int(stage.stage_name.split("_")[1]) - pipeline_name = stage.root.parent.name - source_ord = pipeline_name.split("_") - - if len(source_ord) != 3 or is_ssp: - source_type = source_ord[0] - else: - source_type = source_ord[split_id-1] - return split_id, source_type - - split_id, source_type = get_split_id_and_source_type(stage) - - meta = dataset.splits[f"split_{split_id}"].sources[source_type].meta - verified_source_entities_path = dataset.splits[f"split_{split_id}"].kg_seed.root / "meta/verified_entities.csv" - verified_source_matches_path = meta.root / "verified_matches.csv" - - - kg_reference = dataset.splits[f"split_{split_id}"].kg_reference - if kg_reference is None: - raise ValueError(f"No reference KG found for split {split_id} and source type {source_type}") - reference_path = kg_reference.root / "data_agg.nt" - - kg_seed = dataset.splits[f"split_0"].kg_seed - if kg_seed is None: - raise ValueError(f"No seed KG found for split {0}") - seed_path = kg_seed.root / "data.nt" - - ENTITY_MATCH_THRESHOLD_MAP = { - "json_a": 0.99, - "rdf_a": 0.99, - "rdf_b": 0.5, - "rdf_c": 0.99, - "rdf_llm_schema_align_v1": 0.99 - } - - RELATION_MATCH_THRESHOLD_MAP = { - "json_a": 0.5, - "rdf_a": 0.5, - "rdf_b": 0.1, - "rdf_c": 0.5, - "rdf_llm_schema_align_v1": 0.5 - } - - return ReferenceConfig( - name="reference", - GT_MATCHES=verified_source_matches_path, - GT_MATCHES_TARGET_DATASET=dataset.splits[f"split_{0}"].root.name+"/kg/seed", - RELATION_MATCH_THRESHOLD=RELATION_MATCH_THRESHOLD_MAP.get(stage.root.parent.name, 0.5), - ENTITY_MATCH_THRESHOLD=ENTITY_MATCH_THRESHOLD_MAP.get(stage.root.parent.name, 0.99), - VERIFIED_SOURCE_ENTITIES=verified_source_entities_path, - REFERENCE_KG_PATH=reference_path, - SEED_KG_PATH=seed_path, - TE_LINK_THRESHOLD=0.5, - source_meta=meta, - dataset=dataset, - JSON_EXPECTED_DIR="/home/marvin/project/data/work/json", #TODO cleanup - JSON_EXPECTED_RELATION_FILE="/home/marvin/project/data/final/film_10k/split_0/sources/json/meta/verified_relation_matches.json" # TODO cleanup - ) - -from kgpipe.evaluation.base import MetricResult, EvaluationAspect - -def add_duration_metrics(stage: StageOut) -> MetricResult: - - try: - duration = stage.report.duration - return MetricResult( - aspect=EvaluationAspect.STATISTICAL, - name="duration", - value=duration, - normalized_score=0, - details={ - "duration": duration - } - ) - - except Exception as e: - return MetricResult( - aspect=EvaluationAspect.STATISTICAL, - name="duration", - value=0, - normalized_score=0, - details={"error": "No duration found"} - ) - - - -def evaluate_stage(stage: StageOut, is_ssp: bool) -> List[MetricResult]: - result_path = stage.resultKG - if result_path is None: - return [] - - result_kg = KG(id=f"result_{stage.stage_name}", name=f"result_{stage.stage_name}", path=result_path, format=DataFormat.RDF_NTRIPLES,plan=stage.plan) - - result_kg.set_ontology_graph(ontology_graph) - - stat_eval = statistical.StatisticalEvaluator() - ref_eval = reference.ReferenceEvaluator() - sem_eval = semantic.SemanticEvaluator() - - stats_aspect_result = stat_eval.evaluate(result_kg) - ref_aspect_result = ref_eval.evaluate(result_kg, config=get_reference_config(stage, is_ssp)) - sem_aspect_result = sem_eval.evaluate(result_kg) - - metrics = [] - metrics = stats_aspect_result.metrics + ref_aspect_result.metrics + sem_aspect_result.metrics - # metrics = sem_aspect_result.metrics - metrics.append(add_duration_metrics(stage)) - # metrics = ref_aspect_result.metrics - - return metrics - -def replace_with_dict(infile: str, mapping: dict[str, str]) -> None: - with open(infile, encoding="utf-8") as f, \ - tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8") as tmp: - for line in f: - for key, val in mapping.items(): - line = re.sub(re.escape(key), val, line) - tmp.write(line) - tmp_path = tmp.name - shutil.move(tmp_path, infile) diff --git a/experiments/moviekg/src/moviekg/evaluation/test_inc_msp_evaluation.py b/experiments/moviekg/src/moviekg/evaluation/test_inc_msp_evaluation.py deleted file mode 100644 index fa186d2..0000000 --- a/experiments/moviekg/src/moviekg/evaluation/test_inc_msp_evaluation.py +++ /dev/null @@ -1,61 +0,0 @@ -import pandas as pd -import pytest -import os -from typing import Sequence -from _pytest.compat import NotSetType -from itertools import permutations - -from moviekg.datasets.pipe_out import load_pipe_out -from moviekg.evaluation.helpers import evaluate_stage, metrics_to_long_table_rows -from moviekg.pipelines.test_inc_msp import ssp, idfn - -from moviekg.config import OUTPUT_ROOT - -@pytest.mark.parametrize( - "source_1, source_2, source_3", - permutations(list[str](ssp.keys()), 3), - ids=idfn -) -def test_inc_ssp_evaluation(source_1, source_2, source_3): - - output_dir = OUTPUT_ROOT / f"{source_1}_{source_2}_{source_3}" - - pipeline_name = f"{source_1}_{source_2}_{source_3}" - - print("-" * 100) - print(f"Evaluating {source_1}, {source_2}, {source_3}") - print("-" * 100) - - if not output_dir.exists(): - pytest.skip(f"Pipeline output directory {output_dir} not found") - - pipe_out = load_pipe_out(output_dir) - - rows = [] - - for stage in pipe_out.stages: - print("-" * 100) - print(f"{pipeline_name} - Stage: {stage.stage_name}") - print("-" * 100) - - metrics = evaluate_stage(stage, is_ssp=False) - rows.extend(metrics_to_long_table_rows(metrics, pipeline_name, stage.stage_name)) - # break # TODO remove - - metrics_df = pd.DataFrame(rows) - metrics_df.to_csv(OUTPUT_ROOT / f"{pipeline_name}_metrics.csv", index=False) - print("saved metrics to", OUTPUT_ROOT / f"{pipeline_name}_metrics.csv") - -def test_concatenate_long_table_rows(): - # glob - rows = [] - for file in OUTPUT_ROOT.glob("*_metrics.csv"): - if file.name == "all_metrics.csv": - continue - if os.path.getsize(file) < 3: - continue - df = pd.read_csv(file) - rows.extend(df.to_dict(orient="records")) - - metrics_df = pd.DataFrame(rows) - metrics_df.to_csv(OUTPUT_ROOT / "all_metrics.csv", index=False) diff --git a/experiments/moviekg/src/moviekg/evaluation/test_inc_ssp_evaluation.py b/experiments/moviekg/src/moviekg/evaluation/test_inc_ssp_evaluation.py deleted file mode 100644 index 2e62f54..0000000 --- a/experiments/moviekg/src/moviekg/evaluation/test_inc_ssp_evaluation.py +++ /dev/null @@ -1,58 +0,0 @@ -import pytest -import pandas as pd -import os -from pathlib import Path - -from moviekg.datasets.pipe_out import load_pipe_out -from moviekg.evaluation.helpers import evaluate_stage, metrics_to_long_table_rows, print_long_table_rows -from moviekg.pipelines.test_inc_ssp import pipeline_types, llm_pipeline_types - -from moviekg.config import OUTPUT_ROOT - -@pytest.mark.parametrize( - "pipeline_name", - list[str](pipeline_types.keys()) + list[str](llm_pipeline_types.keys()) -) -def test_inc_ssp_evaluation(pipeline_name): - - output_dir = OUTPUT_ROOT / pipeline_name - - print("-" * 100) - print(f"Evaluating {pipeline_name}") - print("-" * 100) - - if not output_dir.exists(): - pytest.skip(f"Pipeline output directory {output_dir} not found") - - pipe_out = load_pipe_out(output_dir) - - rows = [] - - for stage in pipe_out.stages: - print("-" * 100) - print(f"{pipeline_name} - Stage: {stage.stage_name}") - print("-" * 100) - - metrics = evaluate_stage(stage, is_ssp=True) - new_rows = metrics_to_long_table_rows(metrics, pipeline_name, stage.stage_name) - print_long_table_rows(new_rows) - rows.extend(new_rows) - # break # TODO remove this - - metrics_df = pd.DataFrame(rows) - metrics_df.to_csv(OUTPUT_ROOT / f"{pipeline_name}_metrics.csv", index=False) - print("saved metrics to", OUTPUT_ROOT / f"{pipeline_name}_metrics.csv") - -def test_concatenate_long_table_rows(): - # glob - rows = [] - for file in OUTPUT_ROOT.glob("*_metrics.csv"): - if file.name == "all_metrics.csv": - continue - if os.path.getsize(file) < 3: - continue - df = pd.read_csv(file) - rows.extend(df.to_dict(orient="records")) - - metrics_df = pd.DataFrame(rows) - metrics_df.to_csv(OUTPUT_ROOT / "all_metrics.csv", index=False) diff --git a/experiments/moviekg/src/moviekg/evaluation/test_ref_dev.py b/experiments/moviekg/src/moviekg/evaluation/test_ref_dev.py deleted file mode 100644 index 1a4d17b..0000000 --- a/experiments/moviekg/src/moviekg/evaluation/test_ref_dev.py +++ /dev/null @@ -1,247 +0,0 @@ -# from pathlib import Path -# import numpy as np -# from sentence_transformers import SentenceTransformer -# from rdflib import Graph, URIRef, Literal, RDF, RDFS, XSD -# import re -# from tqdm import tqdm - -# def integrated_entities(path_actual_kg, path_expected_kg): -# pass - -# SOFT_ENTITY_THRESHOLD = 0.75 -# SOFT_VALUES_THRESHOLD = 0.75 - -# def encode(values, model, desc: str): -# embeddings = [] -# for i in tqdm(range(0, len(values), 64), desc=desc): -# batch = values[i:i+64] -# batch_emb = model.encode(batch, show_progress_bar=False) -# embeddings.append(batch_emb) -# return np.vstack(embeddings) - -# def graph_fact_alginment(ga: Graph, ge: Graph): -# te = [ str(s)+str(p)+str(o) for s, p, o in ge ] -# ta = [ str(s)+str(p)+str(o) for s, p, o in ga ] - -# tp = len(set(ta) & set(te)) -# fp = len(set(ta) - set(te)) -# fn = len(set(te) - set(ta)) - -# print(f"TP: {tp}, FP: {fp}, FN: {fn}") -# print(f"Precision: {tp / (tp + fp)}") -# print(f"Recall: {tp / (tp + fn)}") -# print(f"F1: {2 * tp / (2 * tp + fp + fn)}") - -# def clean_label(label: str): -# # remove all non-alphanumeric characters -# cleaned_label = label.replace("_", " ") -# # remove parenthesis text -# cleaned_label = re.sub(r'\([^)]*\)', '', cleaned_label) -# return cleaned_label.strip() - - -# def graph_match_labels_soft(ga: Graph, ge: Graph, model: SentenceTransformer): -# actual_uri_to_abels = {} -# expected_uri_to_abels = {} - -# for s, _, o in ga.triples((None, RDFS.label, None)): -# actual_uri_to_abels[str(s)] = clean_label(str(o)) - -# for s, _, o in ge.triples((None, RDFS.label, None)): -# expected_uri_to_abels[str(s)]= clean_label(str(o)) - -# actual_embeddings = encode(list(actual_uri_to_abels.values()), model, "Encoding actual labels") -# expected_embeddings = encode(list(expected_uri_to_abels.values()), model, "Encoding expected labels") - -# cosine_scores = np.dot(actual_embeddings, expected_embeddings.T) - -# actual_uri_keys = list(actual_uri_to_abels.keys()) -# expected_uri_keys = list(expected_uri_to_abels.keys()) - -# # get best match expected uri for each actual uri - -# uri_mappings = {} - -# best_matches = [] -# for i in range(len(actual_uri_keys)): -# best_match = expected_uri_keys[np.argmax(cosine_scores[i])] -# best_score = cosine_scores[i][np.argmax(cosine_scores[i])] -# best_matches.append((best_match, best_score)) - -# for i in range(len(best_matches)): -# if best_matches[i][1] > SOFT_ENTITY_THRESHOLD: -# # la = actual_uri_to_abels[actual_uri_keys[i]].replace(" ", "_") -# # le = expected_uri_to_abels[best_matches[i][0]].replace(" ", "_") -# uri_actual = actual_uri_keys[i] -# uri_expected = best_matches[i][0] -# uri_mappings[uri_actual] = uri_expected - -# return uri_mappings - -# def graph_fact_alginment_soft_entities(ga: Graph, ge: Graph, model: SentenceTransformer): -# uri_mappings = graph_match_labels_soft(ga, ge, model) - -# ga_mapped = Graph() -# for s, p, o in ga: -# if str(s) in uri_mappings: -# s = URIRef(uri_mappings[str(s)]) -# if isinstance(o, URIRef) and str(o) in uri_mappings: -# o = URIRef(uri_mappings[str(o)]) -# ga_mapped.add((s, p, o)) - -# graph_fact_alginment(ga_mapped, ge) - -# # TODO rdf:type is removed for tp calculation -# def graph_fact_alginment_soft_entities_values(ga: Graph, ge: Graph, model: SentenceTransformer): -# uri_mappings = graph_match_labels_soft(ga, ge, model) - -# def get_label(o: URIRef, graph: Graph): -# labels = [str(l) for l in graph.objects(o, RDFS.label)] -# if len(labels) == 0: -# return [] -# else: -# return [clean_label(l) for l in labels] - -# ga_mapped = Graph() -# for s, p, o in ga: -# if str(s) in uri_mappings: -# s = URIRef(uri_mappings[str(s)]) -# if isinstance(o, URIRef): # and p != RDF.type -# for label in get_label(o, ga): -# ga_mapped.add((s, p, Literal(label))) -# else: -# ga_mapped.add((s, p, o)) - -# ge_mapped = Graph() -# for s, p, o in ge: -# if isinstance(o, URIRef): # and p != RDF.type -# for label in get_label(o, ge): -# ge_mapped.add((s, p, Literal(label))) -# else: -# ge_mapped.add((s, p, o)) - -# # encode all values -# vas = list(set([str(o) for _, _, o in ga_mapped if not isinstance(o, URIRef)])) -# ves = list(set([str(o) for _, _, o in ge_mapped if not isinstance(o, URIRef)])) - -# va_embeddings = encode(vas, model, "Encoding actual values") -# ve_embeddings = encode(ves, model, "Encoding expected values") - -# v2e_actual = {} -# v2e_expected = {} - -# for idx, v in enumerate(vas): -# v2e_actual[v] = va_embeddings[idx] - -# for idx, v in enumerate(ves): -# v2e_expected[v] = ve_embeddings[idx] - -# tp = 0 -# fp = 0 -# fn = 0 - -# sp_actual = set() - -# # for each (s, p, o) in ga_mapped check if there is a matching value for the same (s, p) in ge -# for s, p in ga_mapped.subject_predicates(unique=True): -# sp_actual.add((s, p)) -# _vas = [str(o) for o in ga_mapped.objects(s, p)] -# _ves = [str(o) for o in ge_mapped.objects(s, p)] -# _vas_embeddings = np.array([v2e_actual[v] for v in _vas]) -# _ves_embeddings = np.array([v2e_expected[v] for v in _ves]) - -# if len(_vas_embeddings) == 0 or len(_ves_embeddings) == 0: -# continue -# cosine_scores = np.dot(_vas_embeddings, _ves_embeddings.T) # (len(_vas_embeddings), len(_ves_embeddings)) - -# for idx in range(len(_vas)): -# best_match = _ves[np.argmax(cosine_scores[idx])] -# best_score = cosine_scores[idx][np.argmax(cosine_scores[idx])] -# if best_score > SOFT_VALUES_THRESHOLD: -# actual_value = _vas[idx] -# reference_value = best_match -# tp += 1 -# # if actual_value == reference_value: -# # # print(f"Found matching value for {s} {p} {actual_value}") -# # pass -# # else: -# # print(f"Found matching value for {s} {p} {actual_value} but not exact reference {reference_value}") -# # print(f"Value actual: {_vas[idx]}, {best_match}, {best_score}") -# # print(f"Value expected: {_ves[np.argmax(cosine_scores[idx])]}") -# else: -# fp += 1 -# # print(f"No matching value for {s} {p} {_vas[idx]} from references {_ves}") - -# sp_expected = set([(s, p) for s, p in ge_mapped.subject_predicates(unique=True)]) -# missing_sp = sp_expected - sp_actual -# for s, p in missing_sp: -# for _ in ge_mapped.triples((s, p, None)): -# fn += 1 - -# print(f"TP: {tp}, FP: {fp}, FN: {fn}") -# print(f"Precision: {tp / (tp + fp)}") -# print(f"Recall: {tp / (tp + fn)}") -# print(f"F1: {2 * tp / (2 * tp + fp + fn)}") - -# def reference_alignment(path_actual_kg: Path, path_expected_kg: Path): -# ga = Graph() -# ga.parse(path_actual_kg) - -# ge = Graph() -# ge.parse(path_expected_kg) - -# graph_fact_alginment(ga, ge) - -# def reference_alignment_soft_entities(path_actual_kg: Path, path_expected_kg: Path): - -# model = SentenceTransformer("all-MiniLM-L6-v2") -# model.to("cuda") - -# ga = Graph() -# ga.parse(path_actual_kg) - -# ge = Graph() -# ge.parse(path_expected_kg) - -# graph_fact_alginment_soft_entities(ga, ge, model) - -# def reference_alignment_soft_entities_values(path_actual_kg: Path, path_expected_kg: Path): - -# model = SentenceTransformer("all-MiniLM-L6-v2") -# model.to("cuda") - -# ga = Graph() -# ga.parse(path_actual_kg) - -# ge = Graph() -# ge.parse(path_expected_kg) - -# graph_fact_alginment_soft_entities_values(ga, ge, model) - -# def test_integrated_verified_source_entities(): -# print("Integrated verified source entities") -# path_actual_kg = Path("/home/marvin/project/code/experiments/out_film_100/rdf_a/stage_1/result.nt") -# path_expected_kg = Path("/home/marvin/project/data/final/film_100/split_3/kg/reference/data_agg.nt") -# integrated_entities(path_actual_kg, path_expected_kg) - -# def test_reference_alignment(): -# print("Reference alignment") -# path_actual_kg = Path("/home/marvin/project/code/experiments/out_film_100/rdf_a/stage_1/result.nt") -# path_expected_kg = Path("/home/marvin/project/data/final/film_100/split_3/kg/reference/data_agg.nt") -# reference_alignment(path_actual_kg, path_expected_kg) - -# def test_reference_alignment_soft(): -# print("Reference alignment soft") -# path_actual_kg = Path("/home/marvin/project/code/experiments/out_film_100/text_b/stage_1/result.nt") -# path_expected_kg = Path("/home/marvin/project/data/final/film_100/split_3/kg/reference/data_agg.nt") -# reference_alignment_soft_entities(path_actual_kg, path_expected_kg) - -# def test_reference_alignment_soft_entities_values(): -# print("Reference alignment soft entities values") -# path_actual_kg = Path("/home/marvin/project/code/experiments/out_film_100/rdf_a/stage_1/result.nt") -# path_expected_kg = Path("/home/marvin/project/data/final/film_100/split_3/kg/reference/data_agg.nt") -# reference_alignment_soft_entities_values(path_actual_kg, path_expected_kg) - -# if __name__ == "__main__": -# test_integrated_verified_source_entities() -# test_reference_alignment() \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/evaluation/test_sensitivity.py b/experiments/moviekg/src/moviekg/evaluation/test_sensitivity.py deleted file mode 100644 index 0c68210..0000000 --- a/experiments/moviekg/src/moviekg/evaluation/test_sensitivity.py +++ /dev/null @@ -1,159 +0,0 @@ -from dataclasses import dataclass -from typing import List -from kgpipe.common import KgPipe, Data, DataFormat, KG -from pathlib import Path -from kgpipe.common.models import KgPipePlan -from kgpipe.evaluation.aspects.reference import ( - ReferenceEvaluator, ReferenceConfig, - ER_EntityMatchMetric, ER_RelationMatchMetric, - TE_ExpectedEntityLinkMetric, TE_ExpectedRelationLinkMetric -) -import os -@dataclass -class BinaryClassifier: - tp: int - fp: int - tn: int - fn: int - - def accuracy(self) -> float: - return (self.tp + self.tn) / (self.tp + self.tn + self.fp + self.fn) - - def precision(self) -> float: - return self.tp / (self.tp + self.fp) - -@dataclass -class ThresholdSensitivityResult: - pipeline_name: str - threshold: float - result: BinaryClassifier - -benchdata = Path("/home/marvin/phd/kgpipe/experiments/moviekg/data/datasets/film_10k/") -seed_path = benchdata / "split_0/kg/seed/data.nt" -rdf_path = benchdata / "split_1/sources/rdf/data.nt" -result_dir_path = Path(f"data/moviekg/threshold_sensitivity/") - -# reference_evaluator = ReferenceEvaluator() - -def run_paris_pipeline(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - from kgpipe_tasks.tasks import paris_entity_matching, paris_exchange - - pipe_result_dir_path = result_dir_path / f"{pipeline_name}" - pipeline = KgPipe( - name="paris pipeline", - tasks=[paris_entity_matching, paris_exchange], - seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), - data_dir=pipe_result_dir_path / "tmp" - ) - plan = pipeline.build( - source=Data(path=rdf_path, format=DataFormat.RDF_NTRIPLES), - result=Data(path=pipe_result_dir_path / "result.json", format=DataFormat.ER_JSON) - ) - - os.makedirs(pipe_result_dir_path, exist_ok=True) - - with open(pipe_result_dir_path / "exec-plan.json", "w") as f: - f.write(plan.model_dump_json(indent=4)) - - pipeline.run() - -def paris_er_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - config = ReferenceConfig( - name="paris config", - ENTITY_MATCH_THRESHOLD=threshold, - RELATION_MATCH_THRESHOLD=threshold, - GT_MATCHES=benchdata / "split_1/sources/rdf/meta/verified_matches.csv", - GT_MATCHES_TARGET_DATASET="split_0/kg/seed" - ) - - plan = KgPipePlan.model_validate_json(open(result_dir_path / f"{pipeline_name}" / "exec-plan.json").read()) - - kg = KG(id="paris", name="paris", path=Path(f"data/moviekg/paris/{pipeline_name}.nt"), format=DataFormat.RDF_NTRIPLES, plan=plan) - - metric_result = ER_EntityMatchMetric().compute(kg, config=config) - # print(metric_result) - - return metric_result - -def paris_om_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - config = ReferenceConfig( - name="paris config", - ENTITY_MATCH_THRESHOLD=threshold, - RELATION_MATCH_THRESHOLD=threshold, - GT_MATCHES=benchdata / "split_1/sources/rdf/meta/verified_matches.csv", - GT_MATCHES_TARGET_DATASET="split_0/kg/seed" - ) - - plan = KgPipePlan.model_validate_json(open(result_dir_path / f"{pipeline_name}" / "exec-plan.json").read()) - - kg = KG(id="paris", name="paris", path=Path(f"data/moviekg/paris/{pipeline_name}.nt"), format=DataFormat.RDF_NTRIPLES, plan=plan) - - metric_result = ER_RelationMatchMetric().compute(kg, config=config) - # print(metric_result) - - return metric_result - -def test_paris(): - # run_paris_pipeline("paris", 0.99) - range_of_thresholds = [0.0, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 1.0] - - er_results = [] - for threshold in range_of_thresholds: - result = paris_er_threshold_sensitivity("paris", threshold) - er_results.append([threshold, result.normalized_score, result.details]) - - print() - print("ER Results:") - for r in er_results: - print(r[0], r[1], r[2]) - - om_results = [] - for threshold in range_of_thresholds: - result = paris_om_threshold_sensitivity("paris", threshold) - om_results.append([threshold, result.normalized_score, result.details]) - - print("OM Results:") - for r in om_results: - print(r[0], r[1], r[2]) - -# def paris_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: -# result = run_paris_pipeline(pipeline_name, threshold) - -# pipeline.run( -# input=[Data(path=Path(f"data/moviekg/paris/{pipeline_name}.nt"), format=DataFormat.RDF_NTRIPLES)], -# output=[Data(path=Path(f"data/moviekg/paris/{pipeline_name}.paris_csv"), format=DataFormat.PARIS_CSV)] -# ) - -# config = ReferenceConfig( -# name="paris config", -# ENTITY_MATCH_THRESHOLD=threshold, -# RELATION_MATCH_THRESHOLD=threshold -# ) - - - - -# # TODO get config from dataset -# # kg = KG(path=Path(f"data/moviekg/paris/{pipeline_name}.nt")) -# # reference_kg = KG(path=Path("data/moviekg/paris/reference.nt")) -# # result = reference_evaluator.evaluate(kg, reference_kg) -# # return result -# pass - -def jedai_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass - -def valentine_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass - -def corenlp_openie_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass - -def dbpedia_spotlight_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass - -def custom_relation_linking_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass - -def custom_entity_linking_threshold_sensitivity(pipeline_name: str, threshold: float) -> List[ThresholdSensitivityResult]: - pass \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/paper/config.py b/experiments/moviekg/src/moviekg/paper/config.py deleted file mode 100644 index 507a5c0..0000000 --- a/experiments/moviekg/src/moviekg/paper/config.py +++ /dev/null @@ -1,135 +0,0 @@ - -HEADERS = ["pipeline", "stage", "aspect", "metric", "value", "normalized", "duration", "details"] - -# Only keep these classes and aggregate the rest into "Other" -main_classes = [ - "http://kg.org/ontology/Company", - "http://kg.org/ontology/Person", - "http://kg.org/ontology/Film" -] - -name_mapping = { - "rdf_a": r"\sspRDFa", - "rdf_b": r"\sspRDFb", - "rdf_c": r"\sspRDFc", - "rdf_llm_schema_align_v1": r"\sspRDFc", - "json_a": r"\sspJSONa", - "json_b": r"\sspJSONb", - "json_baseA": r"\sspJSONbaseA", - "json_c": r"\sspJSONc", - "json_llm_mapping_v1": r"\sspJSONc", - "text_a": r"\sspTexta", - "text_b": r"\sspTextb", - "text_c": r"\sspTextc", - "text_llm_triple_extract_v1": r"\sspTextc", - "rdf_json_text": r"\mspRJT", - "rdf_text_json": r"\mspRTJ", - "json_rdf_text": r"\mspJRT", - "json_text_rdf": r"\mspJTR", - "text_rdf_json": r"\mspTRJ", - "text_json_rdf": r"\mspTJR", -} - -METRIC_NAME_MAP = { - "entity_count": "EC", - "relation_count": "RC", - "triple_count": "FC", - "class_count": "TC", - "duration": "Time", - "loose_entity_count": "LEC", - "shallow_entity_count": "SEC", - # Semantic/Reasoning metrics - "reasoning": "EO", - "disjoint_domain": "EO1", - "incorrect_relation_direction": "EO2", - "incorrect_relation_cardinality": "EO3", - "incorrect_relation_range": "EO4", - "incorrect_relation_domain": "EO5", - "incorrect_datatype": "EO6", - "incorrect_datatype_format": "EO7", - "ontology_class_coverage": "EO8", - "ontology_relation_coverage": "EO9", - "ontology_namespace_coverage": "E10", - # Reference metrics - "ReferenceTripleAlignmentMetric": "RTC", - "ReferenceTripleAlignmentMetricSoftE": "RTC-SoftE", - "ReferenceTripleAlignmentMetricSoftEV": "RTC-SoftEV", - "ReferenceClassCoverageMetric": "RCC", - # ER metrics - "ER_EntityMatchMetric": "ER-EM", - "ER_RelationMatchMetric": "ER-RM", - # TE metrics - "TE_ExpectedEntityLinkMetric": "TE-EEL", - "TE_ExpectedRelationLinkMetric": "TE-ERL", - # Source metrics - "SourceEntityCoverageMetric": "VSEC", - "SourceEntityCoverageMetricSoft": "VSEC-Soft", - "REI_precision": "REI-Precision", - -} - -# long: -# disjoint_domain -# incorrect_relation_domain -# incorrect_relation_range -# incorrect_relation_direction -# incorrect_datatype -# incorrect_datatype_format -# short:ODT OD OR ORD OLT OLF OAvg -SEM_METRIC_SHORT_NAMES = { - # "reasoning" : "EO0", - "disjoint_domain": "$O_{DT}$", - "incorrect_relation_direction": "$O_{RD}$", - "incorrect_relation_cardinality": "$O_{CA}$", - "incorrect_relation_range": "$O_{R}$", - "incorrect_relation_domain": "$O_{D}$", - "incorrect_datatype": "$O_{LT}$", - "incorrect_datatype_format": "$O_{LF}$", - # "ontology_class_coverage": "$O_{CC}$", - # "ontology_relation_coverage": "$O_{RC}$", - # "ontology_namespace_coverage": "$O_{NC}$", -} - -METRIC_NAME_INDEX_PRETTY = [ - ("duration", "Runtime Duration"), - ("triple_count", "Fact/Triple Count"), - ("entity_count", "Entity Count"), - ("relation_count", "Relation Count"), - ("class_count", "Entity Type Count"), - ("Person", "Persons"), - ("Film", "Films"), - ("Company", "Companies"), - # ("Other", "Other Type"), - ("loose_entity_count", "Empty Entities"), - ("shallow_entity_count", "Shallow Entities"), - # Semantic/Reasoning metrics - # ("reasoning", "Reasoning"), - ("disjoint_domain", "Disjoint Domain"), - ("incorrect_relation_direction", "Incorrect Relation Direction"), - ("incorrect_relation_cardinality", "Incorrect Relation Cardinality"), - ("incorrect_relation_range", "Incorrect Relation Range"), - ("incorrect_relation_domain", "Incorrect Relation Domain"), - ("incorrect_datatype", "Incorrect Datatype"), - ("incorrect_datatype_format", "Incorrect Datatype Format"), - # ("ontology_class_coverage", "Ontology Class Coverage"), - # ("ontology_relation_coverage", "Ontology Relation Coverage"), - # ("ontology_namespace_coverage", "Ontology Namespace Coverage"), - # Source metrics - ("SourceEntityCoverageMetric", "Source Entity Recall"), - ("SourceEntityCoverageMetricSoft", "Source Entity Recall (~ID)"), - ("REI_precision", "Source Entity Precision (~ID)"), - # Reference metrics - ("ReferenceTripleAlignmentMetric", "Reference Alignment (f1)"), - ("ReferenceTripleAlignmentMetricSoftE", "Reference Alignment (~ID) (f1)"), - ("ReferenceTripleAlignmentMetricSoftEV", "Reference Alignment (~ID~Value) (f1)"), - # ("ReferenceClassCoverageMetric", "Reference Class Coverage"), - # ER metrics - ("ER_EntityMatchMetric", "Entity Match (p)"), - ("ER_RelationMatchMetric", "Relation Match (p)"), - # TE metrics - ("TE_ExpectedEntityLinkMetric", "Expected Entity Link (p)"), - ("TE_ExpectedRelationLinkMetric", "Expected Relation Link (p)"), -] - -METRIC_NAME_MAP_PRETTY = {k: v for k, v in METRIC_NAME_INDEX_PRETTY} -SEM_METRIC_LONG_NAMES = {v: k for k, v in SEM_METRIC_SHORT_NAMES.items()} diff --git a/experiments/moviekg/src/moviekg/paper/helpers/agggregate.py b/experiments/moviekg/src/moviekg/paper/helpers/agggregate.py deleted file mode 100644 index adff60e..0000000 --- a/experiments/moviekg/src/moviekg/paper/helpers/agggregate.py +++ /dev/null @@ -1,224 +0,0 @@ -import pandas as pd -import json -import numpy as np -from moviekg.paper.config import SEM_METRIC_SHORT_NAMES -from moviekg.paper.helpers.helpers import load_metrics_from_file -from moviekg.config import OUTPUT_ROOT - -def agg_duration_over_stages_per_pipeline(metric_df): - # group by pipeline and stage and take mean of normalized - metric_df = metric_df[metric_df["metric"] == "duration"] - # print(metric_df) - metric_df = metric_df.groupby(["pipeline"])["value"].sum().reset_index() - # add stage column = stage 3 - metric_df["stage"] = "stage_3"# - metric_df["metric"] = "duration" - - # print(metric_df.to_string()) - return metric_df - -def norm_min(min, value): - return (min/value) - -def norm_max(max, value): - return 1 / (max/value) - -def get_average_f1_source_entity_f1(df: pd.DataFrame): - df = df[df["metric"] == "SourceEntityPrecisionMetric"] - - for row in df.itertuples(): - details = json.loads(row.details) - expected_entities_count = details["expected_entities_count"] - found_entities_count = details["found_entities_count"] - overlapping_entities_count = details["overlapping_entities_count"] - possible_duplicates_count = details["possible_duplicates_count"] - overlapping_entities_strict_count = details["overlapping_entities_strict_count"] - - # print(f"pipeline={row.pipeline}, stage={row.stage}, expected_entities_count={expected_entities_count}, found_entities_count={found_entities_count}, overlapping_entities_count={overlapping_entities_count}, possible_duplicates_count={possible_duplicates_count}, overlapping_entities_strict_count={overlapping_entities_strict_count}") - precision = overlapping_entities_strict_count / overlapping_entities_count - precision = precision if precision <= 1.0 else 1.0 - recall = overlapping_entities_count / expected_entities_count - recall = recall if recall <= 1.0 else 1.0 - f1 = 2 * (precision * recall) / (precision + recall) - df.loc[row.Index, "normalized"] = f1 - - df = df[["pipeline", "normalized"]] - - # save as csv - - # calculate the average of the metrics - df = df.groupby("pipeline").mean().reset_index() - # set as value for normalized and stage_3 - df["stage"] = "stage_3" - df["metric"] = "SourceEntityF1Metric" - df["value"] = df["normalized"] - df = df[["pipeline", "stage", "metric", "value"]] - - return df - -def aggregate_reference_metrics(df: pd.DataFrame): - metrics = [ - "ReferenceTripleAlignmentMetricSoftEV", - "SourceEntityPrecisionMetric", - ] - - source_entity_f1_df = get_average_f1_source_entity_f1(df) - df = pd.concat([df, source_entity_f1_df]) - - df = df[df["metric"].isin(metrics)] - # if metric is ReferenceTripleAlignmentMetricSoftEV get details["f1"] and set normalized to it - - df.loc[df["metric"] == "ReferenceTripleAlignmentMetricSoftEV", "normalized"] = df[df["metric"] == "ReferenceTripleAlignmentMetricSoftEV"]["details"].apply(lambda x: json.loads(x)["f1_score"]) - - df = df[["pipeline", "stage", "metric", "normalized"]] - - new_rows = [] - for pipeline in df["pipeline"].unique(): - new_rows.append({ - "pipeline": pipeline, - "stage": "stage_3", - "metric": "EntityMatchingMetric", - "normalized": 0.85 - }) - new_rows.append({ - "pipeline": pipeline, - "stage": "stage_3", - "metric": "OntologyMatchingMetric", - "normalized": 0.75 - }) - new_rows.append({ - "pipeline": pipeline, - "stage": "stage_3", - "metric": "EntityLinkingMetric", - "normalized": 0.44 - }) - - df = pd.concat([df, pd.DataFrame(new_rows)]) - - - # for each pipeline and stage = stage_3, calculate the average of the metrics - df = df[df["stage"] == "stage_3"] - - return df - -def aggregate_efficiency_metrics(df: pd.DataFrame): - metrics = ["duration", "memory_peak"] - df = df[df["metric"].isin(metrics)] - df = df[["pipeline", "stage", "metric", "value"]] - # for duration aggregate sum the values for each pipeline and stage - - duration_df = agg_duration_over_stages_per_pipeline(df) - # remove duration - df = df[df["metric"] != "duration"] - df = pd.concat([df, duration_df]) - - df["stage"] = "stage_3" - - def get_min_for_metric(metric): - return df[df["metric"] == metric]["value"].min() - - def get_max_for_metric(metric): - return df[df["metric"] == metric]["value"].max() - - for metric in df["metric"].unique(): - min_val = get_min_for_metric(metric) - max_val = get_max_for_metric(metric) - df.loc[df["metric"] == metric, "normalized"] = norm_min(min_val, df["value"]) - - return df - - -def aggregate_semantic_metrics(df: pd.DataFrame): - metrics = list(SEM_METRIC_SHORT_NAMES.keys()) - df = df[df["metric"].isin(metrics)] - df = df[["pipeline", "stage", "metric", "normalized"]] - # for each pipeline and stage = stage_3, calculate the average of the metrics - df = df[df["stage"] == "stage_3"] - - return df - -def aggregate_size_metrics(df: pd.DataFrame): - metrics = ["entity_count", "triple_count"] - df = df[df["metric"].isin(metrics)] - df = df[["pipeline", "stage", "metric", "value"]] - # for each pipeline and stage = stage_3, calculate the average of the metrics - df = df[df["stage"] == "stage_3"] - - # Pivot to compute density per pipeline - wide = df.pivot(index="pipeline", columns="metric", values="value") - - # Compute density = triple_count / entity_count (guard against zero/NaN) - denom = wide["entity_count"] - numer = wide["triple_count"] - density = np.where((denom > 0) & np.isfinite(denom), numer / denom, np.nan) - wide["density"] = density - - - # Return to long format: (pipeline, metric, value) - df = (wide.reset_index() - .melt(id_vars="pipeline", var_name="metric", value_name="value")) - - - def _normalize(group: pd.DataFrame): - vmax = group["value"].max() - vmin = group["value"].min() - - invert_normalization = False - if group.name == "density": - invert_normalization = True - - if invert_normalization: - group["normalized"] = norm_min(vmin, group["value"]) # largestβ†’0, smallestβ†’1 - else: - group["normalized"] = norm_max(vmax, group["value"]) #(group["value"] - vmin) / (vmax - vmin) # smallestβ†’0, largestβ†’1 - - return group - - df = df.groupby("metric", group_keys=False).apply(_normalize) - - return df - -def mean_scores(df, column_name): - df = df[["pipeline", "normalized"]] - # calculate the average of the metrics - df = df.groupby("pipeline").mean().reset_index() - # rename normalized to semantic - df = df.rename(columns={"normalized": column_name}) - return df - -def aggregate_ranking_df(): - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - - # # replace pipeline name with name_mapping - # metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name) - - # only pipelines where name contains 2 "_" chars - # metric_df = metric_df[metric_df["pipeline"].str.count("_") == 2] TODO - - norm_semantic_df = aggregate_semantic_metrics(metric_df) - norm_semantic_df = norm_semantic_df[["pipeline", "metric", "normalized"]] - agg_semantic_df = mean_scores(norm_semantic_df, "semantic") - - norm_reference_df = aggregate_reference_metrics(metric_df) - norm_reference_df = norm_reference_df[["pipeline", "metric", "normalized"]] - # print(norm_reference_df.to_string()) - agg_reference_df = mean_scores(norm_reference_df, "reference") - - norm_efficiency_df = aggregate_efficiency_metrics(metric_df) - # print(norm_efficiency_df) - norm_efficiency_df = norm_efficiency_df[["pipeline", "metric", "normalized"]] - agg_efficiency_df = mean_scores(norm_efficiency_df, "efficiency") - - norm_size_df = aggregate_size_metrics(metric_df) - norm_size_df = norm_size_df[["pipeline", "metric", "normalized"]] - agg_size_df = mean_scores(norm_size_df, "size") - - norm_df = pd.merge(norm_semantic_df, norm_reference_df, on=["pipeline", "metric", "normalized"], how="outer") - norm_df = pd.merge(norm_df, norm_efficiency_df, on=["pipeline", "metric", "normalized"], how="outer") - norm_df = pd.merge(norm_df, norm_size_df, on=["pipeline", "metric", "normalized"], how="outer") - - agg_df = pd.merge(agg_semantic_df, agg_reference_df, on=["pipeline"], how="left") - agg_df = pd.merge(agg_df, agg_efficiency_df, on=["pipeline"], how="left") - agg_df = pd.merge(agg_df, agg_size_df, on=["pipeline"], how="left") - - return norm_df, agg_df \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/paper/helpers/getter.py b/experiments/moviekg/src/moviekg/paper/helpers/getter.py deleted file mode 100644 index ab7fce4..0000000 --- a/experiments/moviekg/src/moviekg/paper/helpers/getter.py +++ /dev/null @@ -1,552 +0,0 @@ - -import pandas as pd -from collections import defaultdict -import json -from typing import List, Callable - - -type pipeline_name = str -type stage_name = str -type metric_name = str -type metric_value = float -type pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]] -type pipeline_stage_metric_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]] - -""" -Helper file to map final metrics as kgpipe.evaluation... still in progress - -Each getter returns a nested dictionary of pipeline, stage, metric_name -{ - "pipeline": { - "stage": { - "metric_name": value - } - } -} -""" - -# Util - -def dict_for_metric_name(df: pd.DataFrame, metric_name: str, row_name: str = "value") -> pipeline_stage_dict: - df = df[df["metric"] == metric_name] - metric_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for index, row in df.iterrows(): - metric_dict[row["pipeline"]][row["stage"]] = row[row_name] - return metric_dict - -# Statistical metrics - -def sta_entity_count(df: pd.DataFrame): - # only pipeline, stage, value - return dict_for_metric_name(df, "entity_count") - -def sta_fact_count(df: pd.DataFrame): - return dict_for_metric_name(df, "triple_count") - -def sta_type_count(df: pd.DataFrame): - return dict_for_metric_name(df, "class_count") - -def sta_relation_count(df: pd.DataFrame): - return dict_for_metric_name(df, "relation_count") - -def sta_shallow_entity_count(df: pd.DataFrame): - return dict_for_metric_name(df, "shallow_entity_count") - -def sta_denisity(df: pd.DataFrame): - fact_count = sta_fact_count(df) - entity_count = sta_entity_count(df) - - metric_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for pipeline, stage_dict in fact_count.items(): - for stage, value in stage_dict.items(): - metric_dict[pipeline][stage] = value / entity_count[pipeline][stage] - - return metric_dict - -def sta_duration(df: pd.DataFrame): - return dict_for_metric_name(df, "duration") - -# def sta_memory_peak(df: pd.DataFrame): -# return dict_for_metric_name(df, "memory_peak") - -# Semantic metrics - -def sem_disjoint_domain(df: pd.DataFrame): - return dict_for_metric_name(df, "disjoint_domain", "normalized") - -def sem_incorrect_relation_direction(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_relation_direction", "normalized") - -def sem_incorrect_relation_cardinality(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_relation_cardinality", "normalized") - -def sem_incorrect_relation_range(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_relation_range", "normalized") - -def sem_incorrect_relation_domain(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_relation_domain", "normalized") - -def sem_incorrect_datatype(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_datatype", "normalized") - -def sem_incorrect_datatype_format(df: pd.DataFrame): - return dict_for_metric_name(df, "incorrect_datatype_format", "normalized") - -# Reference metrics -def ref_kg_f1(df: pd.DataFrame): - df = df[df["metric"] == "ReferenceTripleAlignmentMetricSoftEV"] - - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - # print(details) - f1 = details.get("f1_score", -1) - res[row.pipeline][row.stage] = f1 - return res - -def ref_kg_p(df: pd.DataFrame): - df = df[df["metric"] == "ReferenceTripleAlignmentMetricSoftEV"] - - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - # print(details) - p = details["precision"] - res[row.pipeline][row.stage] = p - return res - -def ref_kg_r(df: pd.DataFrame): - df = df[df["metric"] == "ReferenceTripleAlignmentMetricSoftE"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - # print(details) - r = details["recall"] - res[row.pipeline][row.stage] = r - return res - -def ref_source_entity_f1(df: pd.DataFrame) -> pipeline_stage_dict: - df = df[df["metric"] == "SourceEntityPrecisionMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - expected_entities_count = details["expected_entities_count"] - found_entities_count = details["found_entities_count"] - overlapping_entities_count = details["overlapping_entities_count"] - possible_duplicates_count = details["possible_duplicates_count"] - overlapping_entities_strict_count = details["overlapping_entities_strict_count"] - - precision = overlapping_entities_strict_count / overlapping_entities_count - precision = precision if precision <= 1.0 else 1.0 - recall = overlapping_entities_count / expected_entities_count - recall = recall if recall <= 1.0 else 1.0 - f1 = 2 * (precision * recall) / (precision + recall) - df.loc[row.Index, "normalized"] = f1 - res[row.pipeline][row.stage] = f1 - return res - -def ref_source_entity_p(df: pd.DataFrame): - df = df[df["metric"] == "SourceEntityPrecisionMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - precision = details["overlapping_entities_strict_count"] / details["overlapping_entities_count"] - res[row.pipeline][row.stage] = precision - return res - -def ref_source_entity_r(df: pd.DataFrame): - df = df[df["metric"] == "SourceEntityPrecisionMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - recall = details["overlapping_entities_count"] / details["expected_entities_count"] - res[row.pipeline][row.stage] = recall - return res - -def ref_source_typed_entity_p(df: pd.DataFrame): - df = df[df["metric"] == "SourceTypedEntityCoverageMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - # print(details) - precision = details.get("fn", -1) - res[row.pipeline][row.stage] = precision - return res - -def ref_source_typed_entity_r(df: pd.DataFrame): - df = df[df["metric"] == "SourceTypedEntityCoverageMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - recall = details.get("recall", -1) - res[row.pipeline][row.stage] = recall - return res - -def ref_entity_matching_f1(df: pd.DataFrame): - df = df[df["metric"] == "ER_EntityMatchMetric"] - - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_seed_match_cnt"] - fp = details["false_seed_match_cnt"] - fn = details["false_missing_seed_match_cnt"] - f1 = 2 * tp / (2 * tp + fp + fn) - res[row.pipeline][row.stage] = f1 - return res - -def ref_entity_matching_p(df: pd.DataFrame): - df = df[df["metric"] == "ER_EntityMatchMetric"] - - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_seed_match_cnt"] - fp = details["false_seed_match_cnt"] - fn = details["false_missing_seed_match_cnt"] - precision = tp / (tp + fp) - res[row.pipeline][row.stage] = precision - return res - -def ref_entity_matching_r(df: pd.DataFrame): - df = df[df["metric"] == "ER_EntityMatchMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_seed_match_cnt"] - fp = details["false_seed_match_cnt"] - fn = details["false_missing_seed_match_cnt"] - recall = tp / (tp + fn) - res[row.pipeline][row.stage] = recall - return res - -RM_DEFAULT_FN=24 # 23 + label - -def ref_relation_matching_f1(df: pd.DataFrame): - df = df[df["metric"] == "ER_RelationMatchMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_relation_match_cnt"] - fp = details["false_relation_match_cnt"] - fn = RM_DEFAULT_FN - (tp+fp) # details.get("false_missing_relation_match_cnt", 0) - f1 = 2 * tp / (2 * tp + fp + fn) - res[row.pipeline][row.stage] = f1 - return res - - -def ref_relation_matching_p(df: pd.DataFrame): - df = df[df["metric"] == "ER_RelationMatchMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_relation_match_cnt"] - fp = details["false_relation_match_cnt"] - fn = RM_DEFAULT_FN - (tp+fp) # details.get("false_missing_relation_match_cnt", 0) - print(f"tp, fp, fn for {row.pipeline} {row.stage}: {tp}, {fp}, {fn}") - precision = tp / (tp + fp) if (tp + fp) > 0 else 0 - res[row.pipeline][row.stage] = precision - return res - - -def ref_relation_matching_r(df: pd.DataFrame): - df = df[df["metric"] == "ER_RelationMatchMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_relation_match_cnt"] - fp = details["false_relation_match_cnt"] - fn = RM_DEFAULT_FN - (tp+fp) # details.get("false_missing_relation_match_cnt", 0) - recall = tp / (tp + fn) if (tp + fn) > 0 else 0 - res[row.pipeline][row.stage] = recall - return res - -def ref_entity_linking_r(df: pd.DataFrame): - df = df[df["metric"] == "TE_ExpectedEntityLinkMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - # print(details) - tp = details["true_link_cnt"] - fp = details["false_link_cnt"] - fn = details["false_missing_link_cnt"] - r = tp / (tp + fn) if (tp + fn) > 0 else 0 - res[row.pipeline][row.stage] = r - return res - -def ref_json_entity_matching_f1(df: pd.DataFrame): - df = df[df["metric"] == "JsonEntityMatchingMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - res[row.pipeline][row.stage] = details["f1_score"] - return res - -def ref_json_entity_matching_p(df: pd.DataFrame): - df = df[df["metric"] == "JsonEntityMatchingMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - res[row.pipeline][row.stage] = details["precision"] - return res - -def ref_json_entity_matching_r(df: pd.DataFrame): - df = df[df["metric"] == "JsonEntityMatchingMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - res[row.pipeline][row.stage] = details["recall"] - return res - -def ref_json_entity_linking_r(df: pd.DataFrame): - df = df[df["metric"] == "JsonEntityLinkingMetric"] - res: pipeline_stage_dict = defaultdict[pipeline_name, defaultdict[stage_name, metric_value]](lambda: defaultdict[stage_name, metric_value](lambda: None)) - for row in df.itertuples(): - details = json.loads(row.details) - if "error" in details: - res[row.pipeline][row.stage] = -1 - continue - res[row.pipeline][row.stage] = details["recall"] - return res - -TABLE_DISPLAY_NAMES = { - # Statistical metrics - sta_entity_count.__name__ : "EC", - sta_fact_count.__name__: "FC", - sta_type_count.__name__: "TC", - sta_relation_count.__name__: "RC", - sta_shallow_entity_count.__name__: "SEC", - sta_denisity.__name__: "D", - sta_duration.__name__: "T", - # sta_memory_peak.__name__: "M", - # Semantic metrics - sem_disjoint_domain.__name__: "ODT", - sem_incorrect_relation_direction.__name__: "ORD", - sem_incorrect_relation_cardinality.__name__: "OCA", - sem_incorrect_relation_range.__name__: "OR", - sem_incorrect_relation_domain.__name__: "OD", - sem_incorrect_datatype.__name__: "OLT", - sem_incorrect_datatype_format.__name__: "OLF", - # Reference metrics - ref_kg_f1.__name__: "RTC", - ref_kg_p.__name__: "RTC-SoftE", - ref_kg_r.__name__: "RTC-SoftE-R", - ref_source_entity_f1.__name__: "VSEC", - ref_source_entity_p.__name__: "VSEC-P", - ref_source_entity_r.__name__: "VSEC-R", - ref_source_typed_entity_p.__name__: "VSEC-P-TE", - ref_source_typed_entity_r.__name__: "VSEC-R-TE", - ref_entity_matching_f1.__name__: "ER-EM", - ref_entity_matching_p.__name__: "ER-EM-P", - ref_entity_matching_r.__name__: "ER-EM-R", - ref_relation_matching_f1.__name__: "ER-RM", - ref_relation_matching_p.__name__: "ER-RM-P", - ref_relation_matching_r.__name__: "ER-RM-R", - ref_entity_linking_r.__name__: "TE-EEL", - ref_json_entity_matching_f1.__name__: "JSON-EM", - ref_json_entity_matching_p.__name__: "JSON-EM-P", - ref_json_entity_matching_r.__name__: "JSON-EM-R", - ref_json_entity_linking_r.__name__: "JSON-EL", -} - -def dict_of_metrics(df: pd.DataFrame, metric_getters: List[Callable[[pd.DataFrame], dict]]) -> pipeline_stage_metric_dict: - """ - # call the getter functions for each metric name not the dict_for_metric_name - """ - - # Create a 3-level nested defaultdict: pipeline -> stage -> metric_name -> value - metric_dict = defaultdict(lambda: defaultdict(dict)) - - for metric_getter in metric_getters: - metric_name = metric_getter.__name__ # the metric name (e.g., "sta_entity_count") - metric_data = metric_getter(df) # returns pipeline->stage->value - - if metric_data is None: - continue - - for pipeline, stage_dict in metric_data.items(): - for stage, value in stage_dict.items(): - metric_dict[pipeline][stage][metric_name] = value - - return metric_dict - -def get_pipeline_stage_metric_dict(df: pd.DataFrame, metric_names: List[str]) -> pipeline_stage_metric_dict: - """ - # call the getter functions for each metric name not the dict_for_metric_name - """ - return dict_of_metrics(df, [globals()[f"{metric_name.lower()}"] for metric_name in TABLE_DISPLAY_NAMES.keys()]) - - -def normalize_min_best(values: List[float], value: float) -> float: - def norm_min(min, value): - return (min/value) - return norm_min(min(values), value) - - -def normalize_max_best(values: List[float], value: float) -> float: - # print(f"values: {values}, value: {value}") - def norm_max(max, value): - return 1 / (max/value) - return norm_max(max(values), value) - - -def normalize_metric(psmd: pipeline_stage_metric_dict, metric_name: str, stages: List[str], func: Callable[[list[float], float], float]) -> pipeline_stage_metric_dict: - values_for_metric = [] - - pipelines_to_normalize = [] - stages_to_normalize = [] - - - for pipeline, stage_dict in psmd.items(): - pipelines_to_normalize.append(pipeline) - for stage, metric_dict in stage_dict.items(): - if stage not in stages or metric_name not in metric_dict: - continue - stages_to_normalize.append(stage) - values_for_metric.append(metric_dict[metric_name]) - - for pipeline in pipelines_to_normalize: - for stage in stages_to_normalize: - if metric_name not in psmd[pipeline][stage]: - continue - value = psmd[pipeline][stage][metric_name] - if metric_name == sta_fact_count.__name__: - if pipeline in ["json_llm_mapping_v1", "text_llm_triple_extract_v1"]: - values_for_metric=[65000] - else: - values_for_metric=[340000] - print(f"setting max ec for norm {pipeline} {values_for_metric}") - psmd[pipeline][stage][metric_name+"_norm"] = func(values_for_metric, value) - - return psmd - -def update_task_selected_task_metric(psmd: pipeline_stage_metric_dict, metric_name: str) -> pipeline_stage_metric_dict: - - for pipleine, stage_dict in psmd.items(): - if pipleine in ["reference", "seed"]: - continue - for stage, metric_dict in stage_dict.items(): - entity_matching_f1 = metric_dict.get(ref_entity_matching_f1.__name__, -1) - relation_matching_f1 = metric_dict.get(ref_relation_matching_f1.__name__, -1) - entity_linking_r = metric_dict.get(ref_entity_linking_r.__name__, -1) - json_entity_matching_f1 = metric_dict.get(ref_json_entity_matching_f1.__name__, -1) - json_entity_linking_r = metric_dict.get(ref_json_entity_linking_r.__name__, -1) - - if json_entity_matching_f1 != -1: - metric_dict[metric_name] = json_entity_matching_f1 - metric_dict[metric_name+"_spec"] = "JSON ER" - elif entity_matching_f1 != -1: - metric_dict[metric_name] = (entity_matching_f1 + relation_matching_f1) / 2 - metric_dict[metric_name+"_spec"] = "RDF ER" - elif json_entity_linking_r != -1: - metric_dict[metric_name] = json_entity_linking_r - metric_dict[metric_name+"_spec"] = "JSON EL" - else: - metric_dict[metric_name] = entity_linking_r - metric_dict[metric_name+"_spec"] = "TE" - - return psmd - -def agg_avg(values: list[float]) -> float: - return sum(values) / len(values) - -def agg_sum(values: list[float]) -> float: - return sum(values) - -def agg_metric_over_stages(psmd: pipeline_stage_metric_dict, metric_name: str, suffix: str, agg_func: Callable[[list[float]], float]) -> pipeline_stage_metric_dict: - - values_for_metric_by_pipeline = defaultdict[pipeline_name, list[float]](lambda: []) - pipelines_to_agg = [] - stages_to_agg = [] - - for pipeline, stage_dict in psmd.items(): - if pipeline in ["reference", "seed"]: - continue - pipelines_to_agg.append(pipeline) - for stage, metric_dict in stage_dict.items(): - if metric_name not in metric_dict: - continue - stages_to_agg.append(stage) - values_for_metric_by_pipeline[pipeline].append(metric_dict[metric_name]) - - for pipeline in pipelines_to_agg: - try: - psmd[pipeline]["stage_3"][metric_name+suffix] = agg_func(values_for_metric_by_pipeline[pipeline]) - except Exception as e: - print(f"Error aggregating metric {metric_name} for pipeline {pipeline}: {e}") - print(values_for_metric_by_pipeline[pipeline]) - psmd[pipeline]["stage_3"][metric_name+suffix] = 0 - return psmd - -def apply_selected_updates(psmd: pipeline_stage_metric_dict) -> pipeline_stage_metric_dict: - normalize_metric(psmd, "sta_entity_count", ["stage_3"], normalize_max_best) - update_task_selected_task_metric(psmd, "ref_selected_task_metric") - agg_metric_over_stages(psmd, "ref_selected_task_metric", "_avg", agg_avg) - agg_metric_over_stages(psmd, "sta_duration", "_sum", agg_sum) - agg_metric_over_stages(psmd, "ref_source_entity_f1", "_avg", agg_avg) - agg_metric_over_stages(psmd, "ref_source_entity_p", "_avg", agg_avg) - agg_metric_over_stages(psmd, "ref_source_entity_r", "_avg", agg_avg) - agg_metric_over_stages(psmd, "ref_source_typed_entity_p", "_avg", agg_avg) - agg_metric_over_stages(psmd, "ref_source_typed_entity_r", "_avg", agg_avg) - # agg_metric_over_stages(psmd, "ref_kg_f1", "_avg", agg_avg) - # agg_metric_over_stages(psmd, "ref_kg_p", "_avg", agg_avg) - # agg_metric_over_stages(psmd, "ref_kg_r", "_avg", agg_avg) - return psmd - -def test_getter(): - from pathlib import Path - from moviekg.paper.helpers.helpers import load_metrics_from_file - print(TABLE_DISPLAY_NAMES.keys()) - df = load_metrics_from_file(Path("/home/marvin/project/data/out/large") / "all_metrics.csv") - psmd = dict_of_metrics(df, [globals()[f"{metric_name.lower()}"] for metric_name in TABLE_DISPLAY_NAMES.keys()]) - - - apply_selected_updates(psmd) - - for pipeline, stage_dict in psmd.items(): - for stage, metric_dict in stage_dict.items(): - if pipeline in ["reference", "seed"]: - continue - # print(f"{pipeline} {stage} {metric_dict['ref_selected_task_metric']} {metric_dict['ref_selected_task_metric_spec']}") - if "stage_3" == stage: - # print(f"{pipeline} {stage} {metric_dict['ref_selected_task_metric_agg']}") - print(pipeline) - print(stage) - print(json.dumps(metric_dict, indent=4)) - print("--------------------------------") diff --git a/experiments/moviekg/src/moviekg/paper/helpers/helpers.py b/experiments/moviekg/src/moviekg/paper/helpers/helpers.py deleted file mode 100644 index 161f55e..0000000 --- a/experiments/moviekg/src/moviekg/paper/helpers/helpers.py +++ /dev/null @@ -1,739 +0,0 @@ -from matplotlib.font_manager import font_scalings -import pandas as pd -import numpy as np -import matplotlib.pyplot as plt -import seaborn as sns -import json -from matplotlib.patches import Patch -from matplotlib.ticker import ScalarFormatter -from typing import Dict -import re -import pandas as pd - -import pandas as pd -from typing import List, Optional - -from moviekg.paper.config import HEADERS, main_classes -from moviekg.pipelines.test_inc_ssp import pipeline_types, llm_pipeline_types - - -def load_metrics_from_file(file_path): - # print("Loading metrics from file: ", file_path) - df = pd.read_csv(file_path, names=HEADERS, skiprows=1) - return df - -def plot_growth_v1(df, metrics): - """ - df: pandas DataFrame with columns: - pipeline, stage, aspect, metric, value, normalized, details - metrics: list[str] of metric names to plot - - Generates a subplot for each metric. - Each subplot has x-axis: stage, y-axis: value. - Each pipeline's value is a grouped bar at each stage. - Returns (fig, axes). - """ - required_cols = {"pipeline", "stage", "aspect", "metric", "value", "normalized", "details"} - missing = required_cols - set(df.columns) - if missing: - raise ValueError(f"DataFrame is missing required columns: {sorted(missing)}") - - if not isinstance(metrics, (list, tuple)) or len(metrics) == 0: - raise ValueError("`metrics` must be a non-empty list of metric names.") - - # Only keep rows for requested metrics - plot_df = df[df["metric"].isin(metrics)].copy() - if plot_df.empty: - raise ValueError("No rows found for the requested metrics.") - - # Create subplots - n_metrics = len(metrics) - fig, axes = plt.subplots(n_metrics, 1, figsize=(10, max(3.5, 2.8 * n_metrics)), squeeze=False) - axes = axes.ravel() - - # Overall (stable) pipeline order: alphabetical for consistency - all_pipelines = sorted(plot_df["pipeline"].dropna().unique().tolist()) - - for ax, metric in zip(axes, metrics): - mdf = plot_df[plot_df["metric"] == metric].copy() - if mdf.empty: - ax.set_visible(False) - continue - - # Preserve stage order as first-appearance order for this metric - stage_order = pd.Index(mdf["stage"].dropna().astype(str)).drop_duplicates().tolist() - if not stage_order: - ax.set_visible(False) - continue - - # Pivot to stage x pipeline = values - pivot = ( - mdf.assign(stage=pd.Categorical(mdf["stage"].astype(str), categories=stage_order, ordered=True)) - .pivot_table( - index="stage", - columns="pipeline", - values="value", - aggfunc="sum", - ) - .reindex(columns=all_pipelines) # ensure consistent pipeline order - .sort_index() - ) - - # If some pipelines/stages don't exist, fill with 0 (or use NaN if you prefer gaps) - vals = pivot.fillna(0.0).values - stages = pivot.index.astype(str).tolist() - pipelines = pivot.columns.astype(str).tolist() - - n_stages = len(stages) - n_pipes = max(1, len(pipelines)) - - x = np.arange(n_stages, dtype=float) - total_width = 0.8 - bar_w = total_width / n_pipes - - # Center the grouped bars around each stage tick - start = x - (total_width / 2) + (bar_w / 2) - - for i, pipe in enumerate(pipelines): - y = pivot[pipe].fillna(0.0).to_numpy() - ax.bar(start + i * bar_w, y, width=bar_w, label=pipe) - - ax.set_title(str(metric)) - ax.set_xlabel("stage") - ax.set_ylabel("value") - ax.set_xticks(x) - ax.set_xticklabels(stages, rotation=0, ha="center") - - # Only show legend if multiple pipelines - if n_pipes > 1: - ax.legend(title="pipeline", frameon=False, ncols=min(3, n_pipes)) - ax.grid(axis="y", linestyle=":", linewidth=0.7, alpha=0.6) - - fig.tight_layout() - return fig, axes - -# --- Hardcoded pipeline colors (light/dark for solos; mid-tone for combined) -PALETTE = { - # JSON solo - "json_a": "#9ecae1", "json_b": "#1f77b4", "json_c": "21f77b4", - # "json_baseA": "#9ecae1", - # RDF solo - "rdf_a": "#a1d99b", "rdf_b": "#2ca02c", "rdf_c": "#3ca02c", - # TEXT solo - "text_a": "#fdd0a2", "text_b": "#ff7f0e", "text_c": "#ff7f0e", - - # JSON mixed β†’ violet - "json_rdf_text": "#756bb1", "json_text_rdf": "#756bb1", - # RDF mixed β†’ teal - "rdf_json_text": "#1c9099", "rdf_text_json": "#1c9099", - # TEXT mixed β†’ red-brown - "text_json_rdf": "#d95f0e", "text_rdf_json": "#d95f0e", -} - -HUE_ORDER = [ - "json_a","json_b","json_rdf_text","json_text_rdf", - "rdf_a","rdf_b","rdf_json_text","rdf_text_json", - "text_a","text_b","text_json_rdf","text_rdf_json" -] - -def plot_growth(df, metrics, kind="bar", references={}): - """ - df: pandas DataFrame with columns: - pipeline, stage, aspect, metric, value, normalized, details - metrics: list[str] of metric names to plot - kind: "bar" or "line" - - Generates a facet plot (subplot per metric). - Each subplot has x-axis: stage, y-axis: value, - with different pipelines distinguished by color. - """ - required_cols = {"pipeline", "stage", "aspect", "metric", "value", "normalized", "details"} - missing = required_cols - set(df.columns) - if missing: - raise ValueError(f"DataFrame is missing required columns: {sorted(missing)}") - - if not metrics: - raise ValueError("`metrics` must be a non-empty list of metric names.") - - # Filter to requested metrics - plot_df = df[df["metric"].isin(metrics)].copy() - if plot_df.empty: - raise ValueError("No rows found for the requested metrics.") - - # Consistent style - sns.set(style="whitegrid") - - stage_order = list(dict.fromkeys(plot_df["stage"])) - - # sns.set_context("notebook", font_scale=1.2) - - # Facet grid WITHOUT hue to avoid legend kwarg collisions - g = sns.FacetGrid( - plot_df, - col="metric", - col_wrap=len(metrics), - height=len(metrics)*1.6, - aspect=1.5, - sharey=False, - col_order=metrics, - legend_out=False, - ) - - if kind != "bar": - raise ValueError("`kind` must be 'bar' for per-bar labels.") - - # Draw grouped bars with hue specified inside map_dataframe - g.map_dataframe( - sns.barplot, - x="stage", - y="value", - hue="pipeline", - hue_order=HUE_ORDER, - palette=PALETTE, - order=stage_order, - dodge=True, - errorbar=None - ) - - - try: - g._legend.remove() - except Exception: - pass - - # build a single combined legend below everything - handles, labels = g.axes[0].get_legend_handles_labels() - g.fig.legend( - handles, labels, - loc="lower center", - ncol=min(6, len(labels)), # 6 items per row (β†’ 2 rows for 12 pipelines) - bbox_to_anchor=(0.5, -0.1), # adjust vertical offset - frameon=False - ) - - for ax_idx, ax in enumerate(g.axes.flat): - - # remove x axis label - ax.set_xlabel("") - - # numbers 1 to 3 - for stage_idx in range(1, 4): - value, nvalue, details = get_reference_value(df, metrics[ax_idx], "stage_"+str(stage_idx)) - # print(metrics[ax_idx], value) - xpos = stage_idx - if stage_idx == 0: - ax.axhline(value, ls="--", color="red") - else: - ax.axhline(value, ls="--", color="black") - - for ax in g.axes.flat: - ax.set_xlabel("") - # tidy up axes - ax.set_xticks(range(len(stage_order))) - ax.set_xticklabels(stage_order) - ax.yaxis.set_major_formatter(ScalarFormatter(useMathText=True)) - ax.ticklabel_format(style='sci', axis='y', scilimits=(0,0)) - ax.grid(True, axis="y", linestyle="--", alpha=0.3) - ax.margins(x=0.02) - - return g - -def _stage_sort_key(s): - """ - Convert 'stage_3' -> 3 for natural sorting; unknown formats go to +inf. - """ - m = re.search(r"(\d+)$", str(s)) - return int(m.group(1)) if m else float("inf") - -def _shorten_iri(iri): - """ - Turn 'http://kg.org/ontology/Person' -> 'Person' for cleaner legends. - """ - return str(iri).rstrip("/").split("/")[-1] - -def _flatten_to_df(nested): - """ - nested: dict like { - 'rdf_a': {'stage_1': {'iri': count, ...}, ...}, - 'reference': {...}, - ... - } - Returns a tidy DataFrame with columns: - Pipeline, Stage, Class, Actual, Expected - """ - - # Split out reference (Expected) from others (Actual) - if "reference" not in nested: - raise ValueError("Input must contain a 'reference' key with expected counts.") - ref = nested["reference"] - pipelines = {k: v for k, v in nested.items() if k != "reference"} - - # Collect all stages/classes across data to ensure aligned zeros - all_stages = sorted( - {s for d in nested.values() for s in d.keys()}, - key=_stage_sort_key - ) - - - - all_classes = sorted( - {c for d in nested.values() for s in d.values() for c in s.keys()} - ) - - - - # Build rows - rows = [] - for pipe, pdata in pipelines.items(): - for stage in all_stages: - for cls in all_classes: - actual = pdata.get(stage, {}).get(cls, 0) - expected = ref.get(stage, {}).get(cls, 0) - if cls not in main_classes: - cls = "Other" - rows.append({ - "Pipeline": pipe, - "Stage": stage, - "Class": cls, - "Actual": actual, - "Expected": expected, - "Class Short": _shorten_iri(cls), - }) - return pd.DataFrame(rows), [ _shorten_iri(c) for c in all_classes ], all_stages, list(pipelines.keys()) - -import pandas as pd -import matplotlib.pyplot as plt -from matplotlib.patches import Patch -import seaborn as sns - -def plot_actual_expected_stacked(df, - pipeline_order=None, - stage_order=None, - class_order=None, - col_wrap=3, - height=4, - suptitle="Actual vs Expected (stacked by Class) per Pipeline & Stage"): - # --- prep --- - df = df.copy() - # ensure numeric & fill NAs - for col in ["Actual", "Expected"]: - df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0) - - # Use Class Short as plotting label (cleaner legend) - if "Class Short" not in df.columns: - df["Class Short"] = df["Class"] - - # Default orders (preserve first-seen order) - if pipeline_order is None: - pipeline_order = list(pd.unique(df["Pipeline"])) - if stage_order is None: - stage_order = list(pd.unique(df["Stage"])) - if class_order is None: - class_order = list(pd.unique(df["Class Short"])) - - # aggregate once - gdf = ( - df.groupby(["Pipeline", "Stage", "Class Short"], as_index=False) - .agg(Actual=("Actual","sum"), Expected=("Expected","sum")) - ) - - # full grid to align missing combos to 0 - full_index = pd.MultiIndex.from_product( - [pipeline_order, stage_order], names=["Pipeline","Stage"] - ) - - # pivots: (Pipeline, Stage) Γ— Class - actual = (gdf.pivot_table(index=["Pipeline","Stage"], columns="Class Short", - values="Actual", aggfunc="sum") - .reindex(full_index) - .reindex(columns=class_order) - .fillna(0)) - expected = (gdf.pivot_table(index=["Pipeline","Stage"], columns="Class Short", - values="Expected", aggfunc="sum") - .reindex(full_index) - .reindex(columns=class_order) - .fillna(0)) - - # --- plot --- - sns.set(style="whitegrid") - n_pipes = len(pipeline_order) - ncols = min(col_wrap, n_pipes) - nrows = (n_pipes + ncols - 1) // ncols - fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*height*1.6, nrows*height), squeeze=False, constrained_layout=True) - axes = axes.flatten() - - # palettes - blues = sns.color_palette("Blues", n_colors=max(3, len(class_order))) - oranges = sns.color_palette("Oranges", n_colors=max(3, len(class_order))) - color_map_actual = {cls: blues[i % len(blues)] for i, cls in enumerate(class_order)} - color_map_expected = {cls: oranges[i % len(oranges)] for i, cls in enumerate(class_order)} - - width = 0.4 - for ax, pipeline in zip(axes, pipeline_order): - act = actual.loc[pipeline] # index=Stage, cols=Class Short - exp = expected.loc[pipeline] # index=Stage, cols=Class Short - - x = range(len(stage_order)) - - # stacked bars - bottom_a = [0.0]*len(stage_order) - bottom_e = [0.0]*len(stage_order) - - for cls in class_order: - a_vals = act[cls].to_numpy() - e_vals = exp[cls].to_numpy() - - ax.bar([xi - 0.2 for xi in x], a_vals, width=width, bottom=bottom_a, color=color_map_actual[cls], edgecolor="none", label="Actual") - ax.bar([xi + 0.2 for xi in x], e_vals, width=width, bottom=bottom_e, color=color_map_expected[cls], edgecolor="none", label="Expected") - - # update bottoms - bottom_a = [b + v for b, v in zip(bottom_a, a_vals)] - bottom_e = [b + v for b, v in zip(bottom_e, e_vals)] - - # cosmetics - ax.set_title(pipeline) - ax.set_xticks(list(x)) - ax.set_xticklabels(stage_order) - ax.set_xlabel("Stage") - ax.set_ylabel("Count") - ax.grid(axis="y", linestyle=":", linewidth=0.7, alpha=0.6) - - # hide any unused axes - for j in range(len(pipeline_order), len(axes)): - fig.delaxes(axes[j]) - - # legend - handles = ( - [Patch(facecolor=color_map_actual[c], label=f"{c} β€’ Actual") for c in class_order] + - [Patch(facecolor=color_map_expected[c], label=f"{c} β€’ Expected") for c in class_order] - ) - - # legend (robust placement) - ncol_leg = min(4, len(handles)) - nrows_leg = int(np.ceil(len(handles) / ncol_leg)) - - leg = fig.legend( - handles=handles, - loc="lower center", - ncol=ncol_leg, - bbox_to_anchor=(0.5, 0.02), # inside the figure, just above bottom - frameon=False - ) - - # Title inside the top of the figure - fig.suptitle(suptitle, y=0.99, fontsize=14) - - # Give the legend guaranteed space at the bottom, proportional to its rows - # (works alongside constrained_layout) - plt.subplots_adjust(bottom=0.08 + 0.05 * max(0, nrows_leg - 1)) - - return fig - - -def plot_expected_actual_from_nested( - nested, - col_wrap=3, - height=4, - suptitle="Actual vs Expected (stacked by Class) per Pipeline & Stage" -): - """ - nested: dict structured like the user's example. - Creates one subplot per pipeline. For each Stage on that subplot, - draws two stacked bars (Actual & Expected), each stacked by Class. - """ - - df, class_labels, stage_order, pipeline_order = _flatten_to_df(nested) - - # We’ll use the *short* class labels for stacking order & legend - classes = class_labels - - # Prepare nice style - sns.set(style="whitegrid") - g = sns.FacetGrid( - df, - col="Pipeline", - col_wrap=col_wrap, - height=height, - sharey=True, - col_order=pipeline_order - ) - - df[['Actual','Expected']] = df[['Actual','Expected']].fillna(0) - - # Aggregate by Pipeline, Stage, Class, and Class Short - df = ( - df.groupby(['Pipeline', 'Stage', 'Class', 'Class Short'], as_index=False) - .agg({'Actual': 'sum', 'Expected': 'sum'}) - ) - - return plot_actual_expected_stacked(df, pipeline_order, stage_order, ["Other", "Person", "Company", "Film"], col_wrap, height, suptitle) - - -def plot_class_occurence(df): - """ - df: pandas dataframe with columns: pipeline, stage, aspect, metric, value, normalized, details - """ - - # filter df for metrics - df = df[df["metric"].isin(["class_occurrence"])] - # filter details contains unique_classes - df = df[df["details"].str.contains("unique_classes")] - # remove duration column - df = df.drop(columns=["duration"]) - # filter not seed pipeline - df = df[df["pipeline"] != "seed"] - - - class_counts_by_stage_by_pipeline = {} - # for each row - for index, row in df.iterrows(): - details = json.loads(row["details"]) - classes = details["classes"] - if row["pipeline"] not in class_counts_by_stage_by_pipeline: - class_counts_by_stage_by_pipeline[row["pipeline"]] = {} - # skip stage 0 - if row["stage"] == "stage_0": - continue - if row["stage"] not in class_counts_by_stage_by_pipeline[row["pipeline"]]: - class_counts_by_stage_by_pipeline[row["pipeline"]][row["stage"]] = {} - for class_name, count in classes.items(): - if class_name not in class_counts_by_stage_by_pipeline[row["pipeline"]][row["stage"]]: - class_counts_by_stage_by_pipeline[row["pipeline"]][row["stage"]][class_name] = 0 - class_counts_by_stage_by_pipeline[row["pipeline"]][row["stage"]][class_name] += count - - # remove stage_0 - class_counts_by_stage_by_pipeline = {k: v for k, v in class_counts_by_stage_by_pipeline.items() if k != "stage_0"} - - return plot_expected_actual_from_nested(class_counts_by_stage_by_pipeline, col_wrap=2, height=4, suptitle="Actual vs Reference by Stage β€’ Stacked by Class") - - -def rank_pipeline_stage(group_df, metric_names, metric_weights): - weights = pd.Series(metric_weights, index=metric_names) - vals = ( - group_df.set_index("metric")["normalized"] - .reindex(metric_names) # align order - .astype(float) - ) - return float((vals * weights).sum()/len(vals)) - -def rank_metrics_apply(df, metric_names, metric_weights): - dff = df[df["metric"].isin(metric_names)] - return ( - dff.groupby(["pipeline", "stage"]) - .apply(lambda g: rank_pipeline_stage(g, metric_names, metric_weights)) - .rename("score") - .reset_index() - ) - - -def rank_metrics( - df: pd.DataFrame, - metric_names: List[str], - metric_weights: List[float], - *, - agg: str = "mean", - fill_missing: Optional[float] = 0.0, - score_col: str = "score", -) -> pd.DataFrame: - """ - Compute a weighted score per (pipeline, stage) using normalized metric values. - - Parameters - ---------- - df : DataFrame - Must include columns: pipeline, stage, metric, normalized - (other columns are ignored). - metric_names : list of str - Names of metrics to include, in the same order as their weights. - metric_weights : list of float - Weights aligned to metric_names. - agg : {"mean","sum","max","min"}, default "mean" - If there are duplicate rows per (pipeline, stage, metric), how to aggregate. - fill_missing : float or None, default 0.0 - Value to fill when a metric is missing for a (pipeline, stage). - Use None to leave as NaN (then the final score may be NaN). - score_col : str, default "score" - Name of the output score column. - - Returns - ------- - DataFrame with columns: pipeline, stage, - """ - if len(metric_names) != len(metric_weights): - raise ValueError("metric_names and metric_weights must have the same length") - - # Keep only what we need - dff = df.loc[df["metric"].isin(metric_names), ["pipeline", "stage", "metric", "normalized"]] - - # Aggregate duplicates per (pipeline, stage, metric) - agg_map = {"mean": "mean", "sum": "sum", "max": "max", "min": "min"} - if agg not in agg_map: - raise ValueError(f'agg must be one of {list(agg_map)}') - pivot = dff.pivot_table( - index=["pipeline", "stage"], - columns="metric", - values="normalized", - aggfunc=agg_map[agg], - ) - - # Enforce column order and align with weights - pivot = pivot.reindex(columns=metric_names) - if fill_missing is not None: - pivot = pivot.fillna(fill_missing) - - weights = pd.Series(metric_weights, index=metric_names) - scores = pivot.dot(weights).rename(score_col) - - return scores.reset_index() - -def get_reference_value(df, metric_name, stage): - df = df[df["metric"] == metric_name] - df = df[df["stage"] == stage] - df = df[df["pipeline"] == "reference"] - # print(df.to_string()) - value = df["value"].values[0] - nvalue = df["normalized"].values[0] - details = json.loads(df["details"].values[0]) - return value, nvalue, details - - -def get_reference_class_counts(df) -> Dict[str, Dict[str, int]]: - df = df[df["pipeline"] == "reference"] - reference_stage_class_count: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int)) - df = df[df["metric"] == "class_occurrence"] - for stage in df["stage"].unique(): - df_stage = df[df["stage"] == stage] - details = json.loads(df_stage["details"].values[0]) - class_counts = details["classes"] - for class_name, count in class_counts.items(): - reference_stage_class_count[stage][class_name.split("/")[-1]] += count - - return reference_stage_class_count - -# def subplot_source_entity_integration(df): -# pass - -from collections import defaultdict - -def plot_class_occurence_new(df, reference_stage_class_count, classes): - - df = df[df["metric"] == "class_occurrence"] - - - pipeline_stage_class_count = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) - - rows = [] - - # for each pipeline and stage - for pipeline in df["pipeline"].unique(): - for stage in df["stage"].unique(): - df_pipeline_stage = df[df["pipeline"] == pipeline] - df_pipeline_stage = df_pipeline_stage[df_pipeline_stage["stage"] == stage] - details = json.loads(df_pipeline_stage["details"].values[0]) - class_counts = details["classes"] - for class_name, count in class_counts.items(): - if class_name not in classes: - class_name = "Other" - pipeline_stage_class_count[pipeline][stage][class_name] += count - - # convert dict of dict to rows - for pipeline, stage_class_count in pipeline_stage_class_count.items(): - for stage, class_count in stage_class_count.items(): - for class_name, count in class_count.items(): - rows.append({"pipeline": pipeline, "stage": stage, "class": class_name.split("/")[-1], "count": count}) - - # df: pipeline, stage, class, count - df = pd.DataFrame(rows) - df = df[df["class"] != "Other"] - - classes_short = [class_name.split("/")[-1] for class_name in classes] - - sns.set(style="whitegrid") - - stage_order = list(dict.fromkeys(df["stage"])) - g = sns.FacetGrid( - df, - col="class", - col_wrap=3, - height=4, - aspect=1.5, - sharey=False, - col_order=classes_short #+["Other"], # preserve requested order - ) - g.map_dataframe( - sns.barplot, - x="stage", - y="count", - hue="pipeline", - hue_order=HUE_ORDER, - palette=PALETTE, - order=stage_order, - dodge=True, - errorbar=None - ) - - - for ax_idx, ax in enumerate(g.axes.flat): - class_idx = ax_idx - class_name = classes_short[class_idx] - - # remove x axis label - ax.set_xlabel("") - - for stage, class_counts in reference_stage_class_count.items(): - xpos = int(stage.split("_")[1]) - if stage == "stage_0": - ax.axhline(class_counts[class_name], ls="--", color="red") - else: - ax.axhline(class_counts[class_name], ls="--", color="black") - - - # g.add_legend() - - if g.legend is not None: - g.legend.remove() - - # build a combined legend below everything - handles, labels = g.axes[0].get_legend_handles_labels() - g.fig.legend( - handles, labels, - loc="lower center", - ncol=min(6, len(labels)), # split across columns - bbox_to_anchor=(0.5, -0.02) # push below the grid - ) - - # make space at bottom so legend isn’t cut off - g.fig.subplots_adjust(bottom=0.2) - - plt.subplots_adjust(top=0.88) - - # g.savefig("class_occurence_new.png") - - return g - - -def plot_class_occ_4_bar_chart(df): - metrics = ["class_occurrence"] - stages = ["stage_1", "stage_2", "stage_3"] - all_reference_values = {} - for metric in metrics: - for stage in stages: - value, nvalue, details = get_reference_value(df, metric, stage) - all_reference_values[metric] = { - "value": value, - "nvalue": nvalue, - "details": details - } - - reference_stage_class_count = get_reference_class_counts(df) - - # remove seed and reference pipeline - df = df[df["pipeline"] != "seed"] - df = df[df["pipeline"] != "reference"] - - # subplot_source_entity_integration(df) - - classes = ["http://kg.org/ontology/Film", "http://kg.org/ontology/Person", "http://kg.org/ontology/Company"] - - - return plot_class_occurence_new(df, reference_stage_class_count, classes) diff --git a/experiments/moviekg/src/moviekg/paper/helpers/ranking.py b/experiments/moviekg/src/moviekg/paper/helpers/ranking.py deleted file mode 100644 index e181e15..0000000 --- a/experiments/moviekg/src/moviekg/paper/helpers/ranking.py +++ /dev/null @@ -1,179 +0,0 @@ -import pandas as pd -from collections import defaultdict -from typing import Any, Mapping, List, Dict - -from moviekg.config import OUTPUT_ROOT -from moviekg.paper.helpers.getter import ( - pipeline_stage_metric_dict, pipeline_name, metric_name, metric_value, - TABLE_DISPLAY_NAMES, - normalize_metric, normalize_min_best, normalize_max_best, - sta_fact_count, sta_denisity, sta_duration, #memory_peak is not considered - ref_kg_p, ref_kg_r, ref_source_entity_f1, ref_source_entity_r, ref_source_entity_p, - sem_disjoint_domain, sem_incorrect_relation_direction, sem_incorrect_relation_range, sem_incorrect_relation_domain, sem_incorrect_datatype, sem_incorrect_datatype_format -) - -type pipeline_agg = Mapping[pipeline_name, float] - -def agg_metrics(psmd: pipeline_stage_metric_dict, metric_names: List[metric_name]) -> pipeline_agg: - values_by_pipeline: Dict[pipeline_name, List[metric_value]] = defaultdict[pipeline_name, List[metric_value]](lambda: []) - for pipeline, stage_dict in psmd.items(): - if pipeline in ["reference", "seed"]: - continue - for stage, metric_dict in stage_dict.items(): - if stage not in ["stage_3"]: # only stage 3 is considered - continue - for metric_name in metric_names: - if metric_name in metric_dict: - values_by_pipeline[pipeline].append(metric_dict[metric_name]) - else: - print(f"pipeline: {pipeline}") - print(f"stage: {stage}") - print(f"metric_names: {metric_names}") - print(f"metric_dict: {metric_dict}") - raise ValueError(f"Metric {metric_name} not found in metric_names") - - res: pipeline_agg = defaultdict[pipeline_name, float](lambda: 0.0) - - for pipeline, values in values_by_pipeline.items(): - filtered_values = [value for value in values if value >= 0] - res[pipeline] = sum(filtered_values) / len(filtered_values) - print(pipeline) - print(" |\t".join(metric_names)) - print(" |\t".join([ str(value) for value in values_by_pipeline[pipeline]])) - print("="+str(res[pipeline])) - print("--------------------------------") - - return res - -def _rank_and_save2csv(weights: dict, outfile_stem: str, psmd: pipeline_stage_metric_dict, round_digits: int = 3) -> None: - - # psmd = normalize_metric(psmd, sta_fact_count.__name__, ["stage_3"], normalize_max_best) - psmd = normalize_metric(psmd, sta_denisity.__name__, ["stage_3"], normalize_max_best) - psmd = normalize_metric(psmd, sta_fact_count.__name__, ["stage_3"], normalize_max_best) - sta_metric_names = [sta_denisity.__name__+"_norm", sta_fact_count.__name__+"_norm"] - sta_agg = agg_metrics(psmd, sta_metric_names) - - sem_metric_names = [ - sem_disjoint_domain.__name__, sem_incorrect_relation_direction.__name__, - sem_incorrect_relation_range.__name__, sem_incorrect_relation_domain.__name__, - sem_incorrect_datatype.__name__, sem_incorrect_datatype_format.__name__] - sem_agg = agg_metrics(psmd, sem_metric_names) - - ref_metric_names = [ref_kg_p.__name__, ref_source_entity_f1.__name__+"_avg", "ref_selected_task_metric_avg"] - ref_agg = agg_metrics(psmd, ref_metric_names) - - psmd = normalize_metric(psmd, sta_duration.__name__+"_sum", ["stage_3"], normalize_min_best) - eff_metric_names = [sta_duration.__name__+"_sum_norm"] - eff_agg = agg_metrics(psmd, eff_metric_names) - - import json - json.dump(psmd, open(OUTPUT_ROOT / f"paper/{outfile_stem}_psmd.json", "w"), indent=4) - - df_rows = [] - - for pipeline, value in sem_agg.items(): - df_rows.append( - { - "pipeline": pipeline, - "semantic": round(value, round_digits), - "reference": round(ref_agg[pipeline], round_digits), - "size": round(sta_agg[pipeline], round_digits), - "efficiency": round(eff_agg[pipeline], round_digits) - } - ) - - - df = pd.DataFrame(df_rows) - - cols = ["size", "semantic", "reference", "efficiency"] - # Ensure we only use known columns; fill missing weights with 0.0 - w = pd.Series(weights).reindex(cols, fill_value=0.0) - - # Compute combined score - df = df[["pipeline"] + cols].copy() - df["combined"] = (df[cols] * w).sum(axis=1).round(round_digits) - - print(df.to_string()) - - # Sort & save (keep default index=True to match original behavior) - out = df[["pipeline", "combined"]].sort_values(by="combined", ascending=False) - out.to_csv(OUTPUT_ROOT / f"paper/{outfile_stem}.csv", sep="\t") - -def _rank_and_save3csv(outfile_stem: str, psmd: pipeline_stage_metric_dict, round_digits: int = 3) -> pd.DataFrame: - - # psmd = normalize_metric(psmd, sta_fact_count.__name__, ["stage_3"], normalize_max_best) - # psmd = normalize_metric(psmd, sta_denisity.__name__, ["stage_3"], normalize_max_best) - # psmd = normalize_metric(psmd, sta_fact_count.__name__, ["stage_3"], normalize_max_best) - # sta_metric_names = [sta_denisity.__name__+"_norm", sta_fact_count.__name__+"_norm"] - # sta_agg = agg_metrics(psmd, sta_metric_names) - - sem_metric_names = [ - sem_disjoint_domain.__name__, sem_incorrect_relation_direction.__name__, - sem_incorrect_relation_range.__name__, sem_incorrect_relation_domain.__name__, - sem_incorrect_datatype.__name__, sem_incorrect_datatype_format.__name__] - sem_agg = agg_metrics(psmd, sem_metric_names) - - acc_metric_names = [ref_kg_p.__name__] # only final stage (3) - acc_agg = agg_metrics(psmd, acc_metric_names) - - cov_metric_names = [ref_source_typed_entity_r.__name__+"_avg"] # avg of all stages - cov_agg = agg_metrics(psmd, cov_metric_names) - - # psmd = normalize_metric(psmd, sta_duration.__name__+"_sum", ["stage_3"], normalize_min_best) - # eff_metric_names = [sta_duration.__name__+"_sum_norm"] - # eff_agg = agg_metrics(psmd, eff_metric_names) - - import json - json.dump(psmd, open(OUTPUT_ROOT / f"paper/{outfile_stem}_psmd.json", "w"), indent=4) - - df_rows = [] - - for pipeline, value in sem_agg.items(): - df_rows.append( - { - "pipeline": pipeline, - "semantic": round(value, round_digits), - "correctness": round(acc_agg[pipeline], round_digits), - "coverage": round(cov_agg[pipeline], round_digits), - # "size": round(sta_agg[pipeline], round_digits), - # "efficiency": round(eff_agg[pipeline], round_digits) - } - ) - - - df = pd.DataFrame(df_rows) - - return df - - # cols = ["semantic", "correctness", "coverage"] - # # Ensure we only use known columns; fill missing weights with 0.0 - # w = pd.Series(weights).reindex(cols, fill_value=0.0) - - # # Compute combined score - # df = df[["pipeline"] + cols].copy() - # df["combined"] = (df[cols] * w).sum(axis=1).round(round_digits) - - # print(df.to_string()) - - # # Sort & save (keep default index=True to match original behavior) - # out = df[["pipeline", "combined"]].sort_values(by="combined", ascending=False) - # out.to_csv(OUTPUT_ROOT / f"paper/{outfile_stem}.csv", sep="\t") - -# TODO cleanup -# def _rank_and_save(weights: dict, outfile_stem: str, df: pd.DataFrame, round_digits: int = 3) -> None: -# """ -# Compute weighted 'combined' score and save a TSV sorted by 'combined'. -# Uses the same behavior as your original functions (round to 3, keep default index in CSV). -# """ -# cols = ["size", "semantic", "reference", "efficiency"] -# # Ensure we only use known columns; fill missing weights with 0.0 -# w = pd.Series(weights).reindex(cols, fill_value=0.0) - -# # Compute combined score -# df = df[["pipeline"] + cols].copy() -# df["combined"] = (df[cols] * w).sum(axis=1).round(round_digits) - -# # Sort & save (keep default index=True to match original behavior) -# out = df[["pipeline", "combined"]].sort_values(by="combined", ascending=False) -# out.to_csv(OUTPUT_ROOT / f"paper/{outfile_stem}.csv", sep="\t") - diff --git a/experiments/moviekg/src/moviekg/paper/test_figtab.py b/experiments/moviekg/src/moviekg/paper/test_figtab.py deleted file mode 100644 index f2dca06..0000000 --- a/experiments/moviekg/src/moviekg/paper/test_figtab.py +++ /dev/null @@ -1,782 +0,0 @@ -import json -import pandas as pd -from pathlib import Path -from collections import defaultdict - -from moviekg.config import OUTPUT_DIR, DATASET_SELECT -from moviekg.paper.helpers.agggregate import agg_duration_over_stages_per_pipeline -from moviekg.paper.helpers.getter import get_pipeline_stage_metric_dict, TABLE_DISPLAY_NAMES, apply_selected_updates -from moviekg.paper.helpers.helpers import load_metrics_from_file, plot_growth, plot_class_occ_4_bar_chart -from moviekg.paper.helpers.ranking import _rank_and_save2csv -from moviekg.paper.config import ( - name_mapping, METRIC_NAME_MAP, SEM_METRIC_SHORT_NAMES, - METRIC_NAME_INDEX_PRETTY, METRIC_NAME_MAP_PRETTY, SEM_METRIC_LONG_NAMES -) - - -# === Preamble === -if not OUTPUT_DIR: - raise ValueError("OUTPUT_DIR is not set") -if not DATASET_SELECT: - raise ValueError("DATASET_SELECT is not set") - -OUTPUT_ROOT = Path(OUTPUT_DIR) / DATASET_SELECT -(OUTPUT_ROOT / "paper").mkdir(parents=True, exist_ok=True) - -PIPLEINE_NAME_MAP = { - "json_rdf_text": "JRT", - "json_text_rdf": "JTR", - "rdf_json_text": "RJT", - "rdf_text_json": "RTJ", - "text_json_rdf": "TJR", - "text_rdf_json": "TRJ", - "json_a": "J_A", - "json_b": "J_B", - "json_c": "J_C", - "json_llm_mapping_v1": "J_C", - "json_baseA": "J_baseA", - "rdf_a": "R_A", - "rdf_b": "R_B", - "rdf_c": "R_C", - "rdf_llm_schema_align_v1": "R_C", - "text_a": "T_A", - "text_b": "T_B", - "text_c": "T_C", - "text_llm_triple_extract_v1": "T_C", - } - -def map_pipeline_name_pretty(pipeline_name): - return PIPLEINE_NAME_MAP.get(pipeline_name, pipeline_name) - -# === Helper Functions === - -def map_pipeline_name(pipeline_name): - return name_mapping.get(pipeline_name, pipeline_name) - - -def map_metric_name(metric_name): - return METRIC_NAME_MAP.get(metric_name, metric_name) - - -def add_REI_precision(metric_df): - # REI_fscore = 2 * (precision * recall) / (precision + recall) - source_entity_coverage_metric_soft = metric_df[metric_df["metric"] == "SourceEntityCoverageMetricSoft"] - - additional_rows = [] - for index, row in source_entity_coverage_metric_soft.iterrows(): - details = json.loads(row["details"]) - #"{""expected_entities_count"": 2758, ""found_entities_count"": 3099, ""overlapping_entities_count"": 53}" - - expected_entities_count = details["expected_entities_count"] - #found_entities_count = details["found_entities_count"] - overlapping_entities_count = details["overlapping_entities_count"] - - tp = overlapping_entities_count if overlapping_entities_count <= expected_entities_count else expected_entities_count - fp = overlapping_entities_count - tp if overlapping_entities_count > tp else 0 - precision = tp / (tp + fp) - - additional_rows.append( - {"pipeline": row["pipeline"], - "stage": row["stage"], - "metric": "REI_precision", - "aspect": "reference", - "normalized": precision, - "value": precision, - "details": row["details"]}) - - additional_df = pd.DataFrame(additional_rows) - return pd.concat([metric_df, additional_df]) - -def extract_class_occurence_df(df): - - classes = ["http://kg.org/ontology/Film", "http://kg.org/ontology/Person", "http://kg.org/ontology/Company"] - - - pipeline_stage_class_count = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) - - # for each pipeline and stage - for pipeline in df["pipeline"].unique(): - for stage in df["stage"].unique(): - df_pipeline_stage = df[df["pipeline"] == pipeline] - df_pipeline_stage = df_pipeline_stage[df_pipeline_stage["stage"] == stage] - try: - details = json.loads(df_pipeline_stage["details"].values[0]) - class_counts = details["classes"] - for class_name, count in class_counts.items(): - if class_name not in classes: - class_name = "Other" - pipeline_stage_class_count[pipeline][stage][class_name] += count - except: - print(f"Error loading details for {pipeline} {stage}") - # print(df_pipeline_stage["details"].values[0]) - - rows = [] - for pipeline, stage_class_count in pipeline_stage_class_count.items(): - for stage, class_count in stage_class_count.items(): - for class_name, count in class_count.items(): - rows.append({"pipeline": pipeline, "stage": stage, "metric": class_name.split("/")[-1], "score": count}) - - return pd.DataFrame(rows) - - - -def map_metric_name_pretty(metric_name): - return METRIC_NAME_MAP_PRETTY.get(metric_name, metric_name) # TODO: remove this - -def get_statistics_df(df): - # only pipeline, stage, metric, normalized - - class_occurence_df = df[df["metric"] == "class_occurrence"] - class_count_df = extract_class_occurence_df(class_occurence_df) - - # print(class_count_df) - - df = df[df["aspect"] == "statistical"] - metircs = ["entity_count", "relation_count", "triple_count", "class_count", "duration", "loose_entity_count", "shallow_entity_count"] - df = df[df["metric"].isin(metircs)] - - df = df[["pipeline", "stage", "metric", "value"]] - df["score"] = df["value"].round(2) - - # union df and class_count_df - df = pd.concat([df, class_count_df]) - df[["pipeline"]] = df[["pipeline"]].map(map_pipeline_name) - - # rename metric to short name - df["metric"] = df["metric"].map(map_metric_name_pretty) - - # make each metric a column - df = df.pivot(index=["pipeline", "stage"], columns="metric", values="score") - df = df.reset_index() - - - return df - -def get_semantic_df(df): - # only pipeline, stage, metric, normalized - df = df[df["aspect"] == "semantic"] - df = df[["pipeline", "stage", "metric", "normalized"]] - - metrics = list(SEM_METRIC_SHORT_NAMES.keys()) - df = df[df["metric"].isin(metrics)] - - df["score"] = df["normalized"].round(2) - - # rename metric to short name - df["metric"] = df["metric"].map(map_metric_name_pretty) - - # make each metric a column - df = df.pivot(index=["pipeline", "stage"], columns="metric", values="score") - df = df.reset_index() - - return df - -def get_reference_df(df): - # TODO metric names and selection - # only pipeline, stage, metric, normalized - df = df[df["aspect"] == "reference"] - df = add_REI_precision(df) - - df = df[["pipeline", "stage", "metric", "normalized"]] - - metrics = [ - "ReferenceTripleAlignmentMetricSoftEV", - "ReferenceTripleAlignmentMetricSoftE", - "ReferenceTripleAlignmentMetric", - # "ReferenceClassCoverageMetric", - "SourceEntityCoverageMetric", - "SourceEntityCoverageMetricSoft", - "REI_precision", - "TE_ExpectedEntityLinkMetric", - "TE_ExpectedRelationLinkMetric", - "ER_EntityMatchMetric", - "ER_RelationMatchMetric", - ] - - df = df[df["metric"].isin(metrics)] - - df["score"] = df["normalized"].round(2) - - # rename metric to short name - df["metric"] = df["metric"].map(map_metric_name_pretty) - - # make each metric a column - df = df.pivot(index=["pipeline", "stage"], columns="metric", values="score") - df = df.reset_index() - - - return df - -# === Tests === - -def test_wide_table_smoth(): - """ - Stores all metrics in a wide table format. - """ - - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - - # replace pipeline name with name_mapping - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name) - - - # statistics_df - statistics_df = get_statistics_df(metric_df) - # semantic_df - semantic_df = get_semantic_df(metric_df) - # reference_df - reference_df = get_reference_df(metric_df) - - # join all of them on pipeline and stage - df = pd.merge(statistics_df, semantic_df, on=["pipeline", "stage"], how="left") - df = pd.merge(df, reference_df, on=["pipeline", "stage"], how="left") - - # colum order - df = df[["pipeline", "stage"] + [v for k, v in METRIC_NAME_INDEX_PRETTY]] - # print(df) - - df.to_csv(OUTPUT_ROOT / "paper/test_wide_table_smoth.csv", sep="\t") - - -def test_table_with_statistic_metrics(): - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metrics = ["entity_count", "relation_count", "triple_count", "class_count", "loose_entity_count", "shallow_entity_count"] - - # filter for metrics - metric_df = metric_df[["pipeline", "stage", "metric", "value"]] - duration_df = agg_duration_over_stages_per_pipeline(metric_df) - duration_df = duration_df[["pipeline", "stage", "metric", "value"]] - metric_df = metric_df[metric_df["metric"].isin(metrics)] - - metric_df = pd.concat([metric_df, duration_df]) - - metric_df["metric"] = metric_df["metric"].map(map_metric_name) - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - # only stage = stage_3 - metric_df = metric_df[metric_df["stage"] == "stage_3"] - - # Assuming your dataframe is called df - pivot_df = metric_df.pivot_table( - index=["pipeline", "stage"], # rows - columns="metric", # pivoted column - values="value" # values to fill - ).reset_index() - - # (Optional) Flatten the column index if needed - pivot_df.columns.name = None # remove "metric" header - - # column selection and order Pipeline FC EC RC TC Time - pivot_df = pivot_df[["pipeline", "FC", "EC", "RC", "TC", "SEC", "Time"]] - # save as TSV - output_path = OUTPUT_ROOT / "paper/test_tab_2_statistic_metrics.csv" - pivot_df.to_csv(output_path, sep="\t") - - -def test_table_with_semantic_metrics(): - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - # replace pipeline name with name_mapping - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - # remove details colums - local_metric_df = metric_df.drop(columns=["details"]) - - # only stage = stage_1 and aspect = statistical - stage_3_df = local_metric_df[local_metric_df["stage"] == "stage_3"] - statistical_df = stage_3_df[stage_3_df["aspect"] == "semantic"] - # statistical_df["pipeline"] = statistical_df["pipeline"].map(map_pipeline_name_pretty) - - # print all available metric names - print(statistical_df["metric"].unique()) - - # rename metric to short name and remove metrics that are not in SEM_METRIC_SHORT_NAMES - statistical_df = statistical_df[statistical_df["metric"].isin(list(SEM_METRIC_SHORT_NAMES.keys()))] - statistical_df["metric"] = statistical_df["metric"].map(SEM_METRIC_SHORT_NAMES) - - # format normalized value to 2 decimal places - statistical_df["normalized"] = statistical_df["normalized"].round(3) - - # only stage = stage_3 - statistical_df = statistical_df[statistical_df["stage"] == "stage_3"] - - # make CSV with, x axis: pipeline, y axis: metric_name, cell: value - # Pivot the table: index=metric, columns=pipeline, values=value - pivot_df = statistical_df.pivot(index="metric", columns="pipeline", values="normalized") - # transpose the table - pivot_df = pivot_df.T - - # assume you have a dict SEM_METRIC_LONG_NAMES mapping short->long - long_name_row = {col: SEM_METRIC_LONG_NAMES.get(col, col) for col in pivot_df.columns} - pivot_df = pd.concat([pd.DataFrame([long_name_row], index=["metric_long_name"]), pivot_df]) - - # column selection and order pipeline 𝑂𝐷𝑇 𝑂𝐷 𝑂𝑅 𝑂𝑅𝐷 𝑂𝐿𝑇 𝑂𝐿𝐹 𝑂𝐴𝑣𝑔 - - output_path = OUTPUT_ROOT / "paper/test_tab_3_ssp_semantic_eval.csv" - pivot_df.to_csv(output_path, sep="\t") - -def test_table_with_matching_metrics(): - from moviekg.paper.helpers.getter import TABLE_DISPLAY_NAMES, get_pipeline_stage_metric_dict, ref_entity_matching_f1, ref_relation_matching_f1, ref_json_entity_matching_f1 - - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - metrics = [metric for metric in list(TABLE_DISPLAY_NAMES.keys()) if metric in [ref_entity_matching_f1.__name__, ref_relation_matching_f1.__name__, ref_json_entity_matching_f1.__name__]] - - metric_dict = get_pipeline_stage_metric_dict(metric_df, metrics) - - df_rows = [] - for pipeline, stage_dict in metric_dict.items(): - for stage, metric_dict in stage_dict.items(): - rdf_em_f1 = metric_dict.get(ref_entity_matching_f1.__name__, -1) - json_em_f1 = metric_dict.get(ref_json_entity_matching_f1.__name__, -1) - em_f1 = -1 - if rdf_em_f1 != -1: - em_f1 = rdf_em_f1 - elif json_em_f1 != -1: - em_f1 = json_em_f1 - - rdf_rm_f1 = metric_dict.get(ref_relation_matching_f1.__name__, -1) - json_el_r = -1 # metric_dict.get(ref.__name__, -1) - rm_f1 = -1 - if rdf_rm_f1 != -1: - rm_f1 = rdf_rm_f1 - elif json_el_r != -1: - rm_f1 = json_el_r - - df_rows.append({"pipeline": pipeline, "stage": stage, "EM_f1": em_f1, "RM_f1": rm_f1}) - - # remove -1 rows - df_rows = [row for row in df_rows if row["EM_f1"] != -1 and row["RM_f1"] != -1] - - df = pd.DataFrame(df_rows) - # df = df.pivot(index=["pipeline", "stage"], columns="metric", values="value") - # df = df.reset_index() - output_path = OUTPUT_ROOT / "paper/test_tab_4_matching_metrics.csv" - df.to_csv(output_path, sep="\t") - -def test_table_with_matching_metrics_pr(): - from moviekg.paper.helpers.getter import ( - TABLE_DISPLAY_NAMES, get_pipeline_stage_metric_dict, - ref_entity_matching_f1, ref_entity_matching_p, ref_entity_matching_r, - ref_relation_matching_f1, ref_relation_matching_p, ref_relation_matching_r, - ref_json_entity_matching_f1, ref_json_entity_matching_p, ref_json_entity_matching_r - ) - - - - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - metrics = [ - ref_entity_matching_p.__name__, ref_entity_matching_r.__name__, - ref_relation_matching_p.__name__, ref_relation_matching_r.__name__, - ref_json_entity_matching_p.__name__, ref_json_entity_matching_r.__name__ - ] - - psmd = get_pipeline_stage_metric_dict(metric_df, metrics) - - df_rows = [] - for pipeline, stage_dict in psmd.items(): - for stage, metric_dict in stage_dict.items(): - rdf_em_p = metric_dict.get(ref_entity_matching_p.__name__, -1) - rdf_em_r = metric_dict.get(ref_entity_matching_r.__name__, -1) - json_em_p = metric_dict.get(ref_json_entity_matching_p.__name__, -1) - json_em_r = metric_dict.get(ref_json_entity_matching_r.__name__, -1) - em_p = -1 - em_r = -1 - if rdf_em_p != -1: - em_p = rdf_em_p - em_r = rdf_em_r - elif json_em_p != -1: - em_p = json_em_p - em_r = json_em_r - - # print(json.dumps(metric_dict, indent=4)) - # print("--------------------------------") - - rdf_rm_p = metric_dict.get(ref_relation_matching_p.__name__, -1) - rdf_rm_r = metric_dict.get(ref_relation_matching_r.__name__, -1) - json_rm_p = metric_dict.get(ref_relation_matching_p.__name__, -1) - json_rm_r = metric_dict.get(ref_relation_matching_r.__name__, -1) - - rm_p = -1 - rm_r = -1 - if rdf_rm_p != -1: - rm_p = rdf_rm_p - rm_r = rdf_rm_r - elif json_rm_p != -1: - rm_p = json_rm_p - rm_r = json_rm_r - - df_rows.append({"pipeline": pipeline, "stage": stage, "EM_p": em_p, "EM_r": em_r, "RM_p": rm_p, "RM_r": rm_r}) - - # remove -1 rows - df_rows = [row for row in df_rows if row["EM_p"] != -1 and row["EM_r"] != -1 and row["RM_p"] != -1 and row["RM_r"] != -1] - - df = pd.DataFrame(df_rows) - # df = df.pivot(index=["pipeline", "stage"], columns="metric", values="value") - # df = df.reset_index() - output_path = OUTPUT_ROOT / "paper/test_tab_4_matching_metrics_pr.csv" - df.to_csv(output_path, sep="\t") - -def test_table_with_linking_metrics(): - from moviekg.paper.helpers.getter import TABLE_DISPLAY_NAMES, get_pipeline_stage_metric_dict, ref_entity_linking_r, ref_json_entity_linking_r - - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - metrics = [metric for metric in list(TABLE_DISPLAY_NAMES.keys()) if metric in [ref_entity_linking_r.__name__, ref_json_entity_linking_r.__name__]] - - metric_dict = get_pipeline_stage_metric_dict(metric_df, metrics) - - df_rows = [] - for pipeline, stage_dict in metric_dict.items(): - for stage, metric_dict in stage_dict.items(): - rdf_el_r = metric_dict.get(ref_entity_linking_r.__name__, -1) - json_el_r = metric_dict.get(ref_json_entity_linking_r.__name__, -1) - el_r = -1 - if rdf_el_r != -1: - el_r = rdf_el_r - elif json_el_r != -1: - el_r = json_el_r - - df_rows.append({"pipeline": pipeline, "stage": stage, "EL_r": el_r}) - - # remove -1 rows - df_rows = [row for row in df_rows if row["EL_r"] != -1] - - df = pd.DataFrame(df_rows) - # df = df.pivot(index=["pipeline", "stage"], columns="metric", values="value") - # df = df.reset_index() - output_path = OUTPUT_ROOT / "paper/test_tab_5_linking_metrics.csv" - df.to_csv(output_path, sep="\t") - - -def test_table_6(): - """ - External KG R @inc (film) - EC (no Seed) REI @inc (film) - Pipeline | f1@1 f1@2 f1@3 p@3 | f1@1 f@2 f@3 - """ - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - from moviekg.paper.helpers.getter import ( - get_pipeline_stage_metric_dict, ref_kg_f1, ref_kg_p, ref_kg_r, ref_source_entity_f1, ref_source_entity_p, ref_source_entity_r, ref_source_typed_entity_r, ref_source_typed_entity_p - ) - - metrics = [ - ref_kg_f1.__name__, ref_kg_p.__name__, ref_kg_r.__name__, ref_source_entity_f1.__name__, ref_source_entity_p.__name__, ref_source_entity_r.__name__, ref_source_typed_entity_r.__name__, ref_source_typed_entity_p.__name__ - ] - - psmd = get_pipeline_stage_metric_dict(metric_df, metrics) - # import json - # json.dump(psmd, open(OUTPUT_ROOT / "paper/test_tab_6_metrics.json", "w"), indent=4) - - rows = [] - - round_to = 3 - - for pipeline, stage_dict in psmd.items(): - if pipeline in ["reference", "seed"]: - continue - kg_p = [0, 0, 0] - kg_r = [0, 0, 0] - se_p = [0, 0, 0] - se_r = [0, 0, 0] - ste_p = [0, 0, 0] - ste_r = [0, 0, 0] - - for stage, metric_dict in stage_dict.items(): - kg_p[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_kg_p.__name__, -1), round_to) - kg_r[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_kg_r.__name__, -1), round_to) - se_p[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_source_entity_p.__name__, -1), round_to) - se_r[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_source_entity_r.__name__, -1), round_to) - ste_p[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_source_typed_entity_p.__name__, -1), round_to) - ste_r[int(stage.split("_")[1]) - 1] = round(metric_dict.get(ref_source_typed_entity_r.__name__, -1), round_to) - - rows.append({ - "pipeline": pipeline, - "kg_p@1": kg_p[0], "kg_r@1": kg_r[0], "kg_p@2": kg_p[1], "kg_r@2": kg_r[1], "kg_p@3": kg_p[2], "kg_r@3": kg_r[2], - "se_p@1": se_p[0], "se_r@1": se_r[0], "se_p@2": se_p[1], "se_r@2": se_r[1], "se_p@3": se_p[2], "se_r@3": se_r[2], - "ste_p@1": ste_p[0], "ste_r@1": ste_r[0], "ste_p@2": ste_p[1], "ste_r@2": ste_r[1], "ste_p@3": ste_p[2], "ste_r@3": ste_r[2]}) - - df = pd.DataFrame(rows) - output_path = OUTPUT_ROOT / "paper/test_tab_6_reference_alignment.csv" - df.to_csv(output_path, sep="\t") - -def test_table_with_reference_overlap_metrics(): - # "Pipeline Inc. P R F1 ∼P ∼F ∼F1" - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - - # replace pipeline name with name_mapping - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name) - - metric_names = ["ReferenceTripleAlignmentMetricSoftEV", "ReferenceTripleAlignmentMetricSoftE", "ReferenceTripleAlignmentMetric"] - names_map = { - "ReferenceTripleAlignmentMetricSoftEV": "soft_ev_", - "ReferenceTripleAlignmentMetricSoftE": "soft_e_", - "ReferenceTripleAlignmentMetric": "strict_", - } - - # filter for pipeline in pipeline_types - # global metric_df - # apply filter function - # only stage = stage_1 - metric_df = metric_df[metric_df["stage"] == "stage_3"] - metric_df = metric_df[metric_df["metric"].isin(metric_names)] - metric_df["metric"] = metric_df["metric"].map(names_map) - # order by stage and pipeline - # print(metric_df.pivot_table(index=["pipeline", "metric"], values="normalized", aggfunc="mean")) - - # extract precision, recall from details.json - metric_df["p"] = metric_df["details"].apply(lambda x: json.loads(x)["precision"] if "precision" in json.loads(x) else 0) - metric_df["r"] = metric_df["details"].apply(lambda x: json.loads(x)["recall"] if "recall" in json.loads(x) else 0) - # renmae value to f1 - metric_df["f1"] = metric_df["normalized"] - - # only pipline, metric, p, r, f1 - metric_df = metric_df[["pipeline", "metric", "p", "r", "f1"]] - - # result - df_wide = metric_df.pivot( - index="pipeline", - columns="metric", - values=["p", "r", "f1"] - ) - - # flatten MultiIndex columns - df_wide.columns = [f"{m if m!='' else ''}{k}" for k, m in df_wide.columns] - df_wide = df_wide.reset_index() - - # sort columns by name - df_wide = df_wide[sorted(df_wide.columns)] - # normalize values to 2 decimal places for all coluns except pipeline - df_wide.iloc[:, 1:] = df_wide.iloc[:, 1:].round(2) - - output_path = OUTPUT_ROOT / "paper/test_reference_alignment" - df_wide.to_csv(output_path, sep="\t") - -def test_figure_with_kg_growth(): - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - # remove reference stage_0 - metric_df["pipeline"] = metric_df["pipeline"].replace("json_b2", "json_b") - - metric_df = metric_df[metric_df["stage"] != "stage_0"] - - # filter for pipeline in pipeline_types - # global metric_df - # metric_df = filter_msp_and_reference(metric_df) - sorted_metric_df = metric_df.sort_values(by=["stage", "pipeline"]) - g = plot_growth(sorted_metric_df, metrics=["entity_count", "triple_count"], kind="bar") - g.fig.subplots_adjust(wspace=0.1) - # save as png - g.savefig(OUTPUT_ROOT / "paper/test_fig_both_growth.png") - - -def test_figure_with_entity_class_occurence(): - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - g = plot_class_occ_4_bar_chart(metric_df) - g.savefig(OUTPUT_ROOT / "paper/test_fig_msp_type_reference.png") - - -# Preset weight configs (kept exactly as used in your original code) -PRESETS = { - "equal": { - "size": 0.25, "semantic": 0.25, "reference": 0.25, "efficiency": 0.25 - }, - # Quantity-focused (your code used 0.5, 0.1, 0.1, 0.3) - "quantity_focused": { - "size": 0.5, "semantic": 0.1, "reference": 0.1, "efficiency": 0.3 - }, - # Quality-focused (your code used 0.0, 0.5, 0.5, 0.0) - "quality_focused": { - "size": 0.0, "semantic": 0.5, "reference": 0.5, "efficiency": 0.0 - }, - # Reference-alignment focused (your code used 0.0, 0.2, 0.8, 0.0) - "reference_alignment_focused": { - "size": 0.0, "semantic": 0.2, "reference": 0.8, "efficiency": 0.0 - }, - "efficiency_oriented": { - "size": 0.2, "semantic": 0.2, "reference": 0.2, "efficiency": 0.4 - }, -} - -psmd_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") -psmd = get_pipeline_stage_metric_dict(psmd_df, TABLE_DISPLAY_NAMES.keys()) -psmd = apply_selected_updates(psmd) - -# TODO cleanup -# norm_df, agg_df = aggregate_ranking_df() -# def test_rank_save_norm_df(): -# norm_df["normalized"] = norm_df["normalized"].round(2) -# # to format pipeline, metric_name1... metric_nameN, normalized -# wide = norm_df.pivot(index="pipeline", columns="metric", values="normalized").reset_index() -# wide.to_csv(OUTPUT_ROOT / "paper/test_rank_norm_df.csv", sep="\t") - -# c1 size, c2 sem, c3 ref, c4 eff -def test_rank_equal(): - # _rank_and_save(PRESETS["equal"], "test_rank_equal", agg_df) - _rank_and_save2csv(PRESETS["equal"], "test_rank_equal", psmd) - -def test_rank_quantity_focused(): - #_rank_and_save(PRESETS["quantity_focused"], "test_rank_quantity_focused", agg_df) - _rank_and_save2csv(PRESETS["quantity_focused"], "test_rank_quantity_focused", psmd) - -def test_rank_quality_focused(): - #_rank_and_save(PRESETS["quality_focused"], "test_rank_quality_focused", agg_df) - _rank_and_save2csv(PRESETS["quality_focused"], "test_rank_quality_focused", psmd) - -def test_rank_reference_alignment_focused(): - #_rank_and_save(PRESETS["reference_alignment_focused"], "test_rank_reference_alignment_focused", agg_df) - _rank_and_save2csv(PRESETS["reference_alignment_focused"], "test_rank_reference_alignment_focused", psmd) - -def test_rank_efficiency_oriented(): - #_rank_and_save(PRESETS["efficiency_oriented"], "test_rank_efficiency_oriented", agg_df) - _rank_and_save2csv(PRESETS["efficiency_oriented"], "test_rank_efficiency_oriented", psmd) - -def test_full_ranking_table(): - """ - for each rank table read it and then concatenate them into one table joining on the index - for example: - test_rank_equal.csv: - pipeline combined - 0 json_rdf_text 0.855084 - 1 json_text_rdf 0.867719 - 2 rdf_json_text 0.855081 - 3 rdf_text_json 0.867721 - 4 text_json_rdf 0.864522 - 5 text_rdf_json 0.864522 - test_rank_quantity_focused.csv: - pipeline combined - 0 rdf_json_text 0.950847 - 1 text_rdf_json 0.940847 - 2 json_text_rdf 0.93847 - 3 rdf_text_json 0.920847 - 4 json_rdf_text 0.910847 - 5 text_json_rdf 0.900847 - - the result should be: - pipeline combined - 0 json_rdf_text 0.855084 rdf_json_text_0.950847 - 1 json_text_rdf 0.867719 text_rdf_json_0.940847 - 2 rdf_json_text 0.855081 json_text_rdf_0.93847 - 3 rdf_text_json 0.867721 rdf_text_json_0.920847 - 4 text_json_rdf 0.864522 json_rdf_text_0.910847 - 5 text_rdf_json 0.864522 text_json_rdf_0.900847 - - rename the "combined" column for each to the name of the file - """ - - ranking_files = [ - "test_rank_equal.csv", - "test_rank_quantity_focused.csv", - "test_rank_quality_focused.csv", - "test_rank_reference_alignment_focused.csv", - "test_rank_efficiency_oriented.csv" - ] - - - ranking_files = [OUTPUT_ROOT / "paper" / file for file in ranking_files] - - # Base frame with fixed ranks 0..5 (top to bottom) - result = pd.DataFrame({"rank": range(15)}) - # result = pd.DataFrame() - - for file in ranking_files: - name = Path(file).stem # e.g., "test_rank_equal" - df = pd.read_csv(file, sep="\t") - # Ensure we have at least 6 rows; if more, keep top-6; if fewer, allow NaNs - # df = df.head(6).reset_index(drop=True) - - # pipeline name != reference and reset index - df = df[df["pipeline"] != "reference"] - df["pipeline"] = df["pipeline"].map(PIPLEINE_NAME_MAP) - df = df.reset_index(drop=True) - - - # Build two columns for this file: pipeline + score - sub = pd.DataFrame({ - "rank": df.index, - f"{name.split(".")[0]}_pipe": df["pipeline"], - f"{name.split(".")[0]}_score": df["combined"] - }) - - # Join on rank to keep rows aligned 0..5 - result = result.merge(sub, on="rank", how="left") - - # Make 'rank' the index if you prefer, or keep as a column - result = result.set_index("rank") - - result.to_csv(OUTPUT_ROOT / "paper/test_tab_7_full_ranking_table.csv", sep="\t") - -def test_new_ranking_table(): - """ - """ - from moviekg.paper.helpers.ranking import _rank_and_save3csv - df =_rank_and_save3csv("test_rank_new", psmd) - df["pipeline"] = df["pipeline"].map(PIPLEINE_NAME_MAP) - df.to_csv(OUTPUT_ROOT / "paper/test_tab_8_new_ranking_table.csv", sep="\t") - # metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - # metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - # # metric_df = metric_df[metric_df["stage"] == "stage_3"] - # # metric_df = metric_df[metric_df["metric"].isin(TABLE_DISPLAY_NAMES.keys())] - # # metric_df = metric_df[metric_df["pipeline"] != "reference"] - # # metric_df = metric_df.reset_index(drop=True) - # # metric_df = metric_df.pivot(index="pipeline", columns="metric", values="normalized") - # # metric_df = metric_df.reset_index() - # metric_df.to_csv(OUTPUT_ROOT / "paper/test_tab_8_new_ranking_table.csv", sep="\t") - - -def test_new_quality_table(): - - metric_df = load_metrics_from_file(OUTPUT_ROOT / "all_metrics.csv") - metric_df["pipeline"] = metric_df["pipeline"].map(map_pipeline_name_pretty) - from moviekg.paper.helpers.getter import ( - get_pipeline_stage_metric_dict, - sta_entity_count, sta_fact_count, sta_type_count, sta_relation_count, sta_shallow_entity_count, sta_denisity, sta_duration, - ref_kg_f1, ref_kg_p, ref_kg_r, - ref_source_entity_f1, ref_source_entity_p, ref_source_entity_r, - ref_source_typed_entity_r, ref_source_typed_entity_p, - sem_disjoint_domain, sem_incorrect_relation_direction, sem_incorrect_relation_cardinality, sem_incorrect_relation_range, sem_incorrect_relation_domain, sem_incorrect_datatype, sem_incorrect_datatype_format, - ) - - metrics = [ - sta_entity_count.__name__, sta_fact_count.__name__, sta_type_count.__name__, sta_relation_count.__name__, sta_shallow_entity_count.__name__, sta_denisity.__name__, sta_duration.__name__, - ref_kg_f1.__name__, ref_kg_p.__name__, - ref_kg_r.__name__, ref_source_entity_f1.__name__, - ref_source_entity_p.__name__, ref_source_entity_r.__name__, - ref_source_typed_entity_r.__name__, ref_source_typed_entity_p.__name__, - sem_disjoint_domain.__name__, sem_incorrect_relation_direction.__name__, sem_incorrect_relation_cardinality.__name__, sem_incorrect_relation_range.__name__, sem_incorrect_relation_domain.__name__, sem_incorrect_datatype.__name__, sem_incorrect_datatype_format.__name__, - ] - - psmd = get_pipeline_stage_metric_dict(metric_df, metrics) - # import json - # json.dump(psmd, open(OUTPUT_ROOT / "paper/test_tab_6_metrics.json", "w"), indent=4) - - rows = [] - - round_to = 3 - - for pipeline, stage_dict in psmd.items(): - if pipeline in ["reference", "seed"]: - continue - - for stage, metric_dict in stage_dict.items(): - ec = round(metric_dict.get(sta_entity_count.__name__, -1), round_to) - kg_p = round(metric_dict.get(ref_kg_p.__name__, -1), round_to) - kg_r = round(metric_dict.get(ref_kg_r.__name__, -1), round_to) - se_p = round(metric_dict.get(ref_source_entity_p.__name__, -1), round_to) - se_r= round(metric_dict.get(ref_source_entity_r.__name__, -1), round_to) - ste_p = round(metric_dict.get(ref_source_typed_entity_p.__name__, -1), round_to) - ste_r = round(metric_dict.get(ref_source_typed_entity_r.__name__, -1), round_to) - o_dt = round(metric_dict.get(sem_disjoint_domain.__name__, -1), round_to) - o_d = round(metric_dict.get(sem_incorrect_relation_domain.__name__, -1), round_to) - o_r = round(metric_dict.get(sem_incorrect_relation_range.__name__, -1), round_to) - o_rd = round(metric_dict.get(sem_incorrect_relation_direction.__name__, -1), round_to) - o_lt = round(metric_dict.get(sem_incorrect_datatype.__name__, -1), round_to) - o_lf = round(metric_dict.get(sem_incorrect_datatype_format.__name__, -1), round_to) - - rows.append({ - "pipeline": pipeline, "stage": stage, - "EC": ec, - "kg_p": kg_p, "kg_r": kg_r, "se_p": se_p, "se_r": se_r, "ste_p": ste_p, "ste_r": ste_r, - "O_DT": o_dt, "O_D": o_d, "O_R": o_r, "O_RD": o_rd, "O_LT": o_lt, "O_LF": o_lf - }) - - df = pd.DataFrame(rows) - df.to_csv(OUTPUT_ROOT / "paper/test_tab_9_new_quality_table.csv", sep="\t") \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/paper/test_ranksens.py b/experiments/moviekg/src/moviekg/paper/test_ranksens.py deleted file mode 100644 index e53f284..0000000 --- a/experiments/moviekg/src/moviekg/paper/test_ranksens.py +++ /dev/null @@ -1,162 +0,0 @@ -import re -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -from itertools import product - -# ========================= -# 1) Data -# ========================= -data = [ - ("T_C", 0.824, 0.367, 0.332), - ("R_A", 0.996, 0.993, 0.994), - ("TJR", 0.980, 0.980, 0.793), - ("RJT", 0.981, 0.967, 0.849), - ("TRJ", 0.980, 0.967, 0.808), - ("JRT", 0.982, 0.980, 0.838), - ("J_A", 0.938, 0.976, 0.988), - ("T_B", 0.893, 0.555, 0.580), - ("J_B", 0.968, 0.961, 0.788), - ("JTR", 0.981, 0.980, 0.806), - ("J_C", 0.993, 0.751, 0.851), - ("R_B", 0.993, 0.982, 0.962), - ("RTJ", 0.979, 0.967, 0.845), - ("R_C", 0.996, 0.984, 0.993), - ("T_A", 0.986, 0.526, 0.590), -] -df = pd.DataFrame(data, columns=["pipeline", "semantic", "correctness", "coverage"]) - -# ========================= -# 2) Define cohorts -# ========================= -# Single-source pipelines: "R_A", "J_B", "T_C", etc. -single_re = re.compile(r"^[RJT]_[A-Z]$") - -df["is_single"] = df["pipeline"].apply(lambda s: bool(single_re.match(s))) -df["source_type"] = df["pipeline"].apply(lambda s: s[0]) # 'R', 'J', 'T' - -single_df = df[df["is_single"]].copy() -multi_df = df[~df["is_single"]].copy() # e.g., "TJR", "RJT", ... - -# Cohort dict: RDF-only, JSON-only, TEXT-only, and Multi-source -cohorts = { - "RDF-only (R_*)": single_df[single_df["source_type"] == "R"].copy(), - "JSON-only (J_*)": single_df[single_df["source_type"] == "J"].copy(), - "Text-only (T_*)": single_df[single_df["source_type"] == "T"].copy(), - "Multi-source (no underscore)": multi_df.copy(), -} - -# ========================= -# 3) Weight grid on simplex -# ========================= -# Weights are (w_sem, w_cor, w_cov) with w_sum=1 and w_i>=0 -STEP = 0.05 # set to 0.1 for fewer points -vals = np.round(np.arange(0, 1 + 1e-9, STEP), 10) - -weights = [] -for w in product(vals, repeat=3): - if abs(sum(w) - 1.0) < 1e-9: - weights.append(w) -weights = np.array(weights) # (N, 3) -print(f"Weight grid: step={STEP}, N={len(weights)} points") - -# ========================= -# 4) Sensitivity computation -# ========================= -def sensitivity_summary(cohort_df: pd.DataFrame, weights: np.ndarray) -> pd.DataFrame: - """ - Returns per-pipeline: - - wins: how many weight points where it ranks #1 - - win_fraction - - avg_rank - - avg_score (mean across weights) - """ - if cohort_df.empty: - return pd.DataFrame() - - M = cohort_df[["semantic", "correctness", "coverage"]].to_numpy() # (m,3) - scores = weights @ M.T # (N,m) - - # winner counts - winner_idx = np.argmax(scores, axis=1) - winners = cohort_df["pipeline"].iloc[winner_idx].to_numpy() - win_counts = pd.Series(winners).value_counts().reindex(cohort_df["pipeline"]).fillna(0).astype(int) - - # rank matrix: rank 1 = best - order = scores.argsort(axis=1)[:, ::-1] - rank_matrix = np.empty_like(order) - for i in range(order.shape[0]): - rank_matrix[i, order[i]] = np.arange(1, M.shape[0] + 1) - - summary = pd.DataFrame({ - "wins": win_counts.values, - "win_fraction": (win_counts.values / len(weights)), - "avg_rank": rank_matrix.mean(axis=0), - "avg_score": scores.mean(axis=0), - }, index=cohort_df["pipeline"].values) - - summary = summary.sort_values(["win_fraction", "avg_rank"], ascending=[False, True]) - return summary - -all_summaries = {name: sensitivity_summary(cdf, weights) for name, cdf in cohorts.items()} - -# Print summaries -for name, summ in all_summaries.items(): - print("\n" + "=" * 80) - print(name) - if summ.empty: - print("(empty cohort)") - else: - print(summ) - -# ========================= -# 5) Plots (VLDB-friendly) -# ========================= -# A) Win-fraction bars for each cohort -# for name, summ in all_summaries.items(): -# if summ.empty: -# continue -# plt.figure(figsize=(9, 3.8)) -# plt.bar(summ.index, summ["win_fraction"].values) -# plt.xticks(rotation=45, ha="right") -# plt.ylabel("Win fraction (#1 over weight grid)") -# plt.title(f"{name} β€” winner sensitivity (step={STEP})") -# plt.tight_layout() -# plt.show() - -# B) Average-rank bars for each cohort -# for name, summ in all_summaries.items(): -# if summ.empty: -# continue -# plt.figure(figsize=(9, 3.8)) -# plt.bar(summ.index, summ["avg_rank"].values) -# plt.xticks(rotation=45, ha="right") -# plt.ylabel("Average rank (lower is better)") -# plt.title(f"{name} β€” average rank over weight grid") -# plt.tight_layout() -# plt.show() - -# ========================= -# 6) Optional: a compact β€œpaper table” per cohort -# ========================= -paper_tables = {} -for name, summ in all_summaries.items(): - if summ.empty: - continue - paper_tables[name] = summ[["win_fraction", "avg_rank"]].copy() - -print("\n" + "=" * 80) -print("Compact paper tables (win_fraction, avg_rank):") -for name, t in paper_tables.items(): - print("\n---", name, "---") - print(t) - -# ========================= -# 7) Optional: export to CSV (uncomment if you want files) -# ========================= -# for name, summ in all_summaries.items(): -# if summ.empty: -# continue -# safe_name = re.sub(r"[^A-Za-z0-9]+", "_", name).strip("_") -# summ.to_csv(f"sensitivity_{safe_name}.csv") -# print("Wrote CSV files.") \ No newline at end of file diff --git a/experiments/moviekg/src/moviekg/pipelines/helpers.py b/experiments/moviekg/src/moviekg/pipelines/helpers.py index cfc4e92..d47c2e5 100644 --- a/experiments/moviekg/src/moviekg/pipelines/helpers.py +++ b/experiments/moviekg/src/moviekg/pipelines/helpers.py @@ -7,7 +7,7 @@ from kgpipe.generation.loaders import build_from_conf from kgpipe.datasets.multipart_multisource import Dataset -from moviekg.datasets.pipe_out import PipeOut, StageOut +from kgpipe.io.pipe_out import PipeOut, StageOut from moviekg.config import dataset, catalog @@ -70,7 +70,12 @@ def run_helper( tmp_dir = stage_dir / "tmp" tmp_dir.mkdir(parents=True, exist_ok=True) - pipeline = build_from_conf(pipeline_name, pipeline_conf, target_data, tmp_dir.as_posix()) + pipeline = build_from_conf( + name=pipeline_name, + conf=pipeline_conf, + target_data=target_data, + data_dir=tmp_dir.as_posix(), + ) stage_dir.mkdir(parents=True, exist_ok=True) diff --git a/experiments/ontologies/scads-papers.owl.ttl b/experiments/ontologies/scads-papers.owl.ttl new file mode 100644 index 0000000..87c4280 --- /dev/null +++ b/experiments/ontologies/scads-papers.owl.ttl @@ -0,0 +1,221 @@ +@prefix : . +@prefix owl: . +@prefix rdfs: . + +######## +# Classes +######## + +:ScientificPaper a owl:Class . + +:ContentUnit a owl:Class . +:RhetoricalUnit a owl:Class ; rdfs:subClassOf :ContentUnit . +:ScientificContribution a owl:Class ; rdfs:subClassOf :ContentUnit . + +:ResearchProblem a owl:Class ; rdfs:subClassOf :ScientificContribution . +:ResearchQuestion a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Motivation a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Goal a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Hypothesis a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Claim a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Method a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Material a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Dataset a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Experiment a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Model a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Observation a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Result a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Conclusion a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Limitation a owl:Class ; rdfs:subClassOf :ScientificContribution . +:FutureWork a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Evidence a owl:Class ; rdfs:subClassOf :ScientificContribution . +:RelatedWorkStatement a owl:Class ; rdfs:subClassOf :ScientificContribution . +:Concept a owl:Class . +:Variable a owl:Class . +:Metric a owl:Class . + +######## +# Paper -> content +######## + +:hasContentUnit a owl:ObjectProperty ; + rdfs:domain :ScientificPaper ; + rdfs:range :ContentUnit . + +:hasProblem a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :ResearchProblem . + +:hasResearchQuestion a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :ResearchQuestion . + +:hasMotivation a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Motivation . + +:hasGoal a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Goal . + +:hasHypothesis a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Hypothesis . + +:hasClaim a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Claim . + +:hasMethod a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Method . + +:hasMaterial a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Material . + +:hasDataset a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Dataset . + +:hasExperiment a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Experiment . + +:hasModel a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Model . + +:hasObservation a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Observation . + +:hasResult a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Result . + +:hasConclusion a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Conclusion . + +:hasLimitation a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :Limitation . + +:hasFutureWork a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :FutureWork . + +:hasRelatedWorkStatement a owl:ObjectProperty ; + rdfs:subPropertyOf :hasContentUnit ; + rdfs:domain :ScientificPaper ; + rdfs:range :RelatedWorkStatement . + +######## +# Internal semantics +######## + +:addressesProblem a owl:ObjectProperty ; + rdfs:domain :Method ; + rdfs:range :ResearchProblem . + +:investigatesQuestion a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :ResearchQuestion . + +:testsHypothesis a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Hypothesis . + +:usesMethod a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Method . + +:usesMaterial a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Material . + +:usesDataset a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Dataset . + +:studiesConcept a owl:ObjectProperty ; + rdfs:domain :ScientificContribution ; + rdfs:range :Concept . + +:hasVariable a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Variable . + +:usesMetric a owl:ObjectProperty ; + rdfs:domain :Result ; + rdfs:range :Metric . + +:producesObservation a owl:ObjectProperty ; + rdfs:domain :Experiment ; + rdfs:range :Observation . + +:supportsClaim a owl:ObjectProperty ; + rdfs:domain :Evidence ; + rdfs:range :Claim . + +:reportsEvidence a owl:ObjectProperty ; + rdfs:domain :Result ; + rdfs:range :Evidence . + +:derivedFromObservation a owl:ObjectProperty ; + rdfs:domain :Result ; + rdfs:range :Observation . + +:supports a owl:ObjectProperty ; + rdfs:domain :ScientificContribution ; + rdfs:range :ScientificContribution . + +:contradicts a owl:ObjectProperty ; + rdfs:domain :ScientificContribution ; + rdfs:range :ScientificContribution . + +:extends a owl:ObjectProperty ; + rdfs:domain :ScientificContribution ; + rdfs:range :ScientificContribution . + +:motivates a owl:ObjectProperty ; + rdfs:domain :Motivation ; + rdfs:range :Goal . + +:answers a owl:ObjectProperty ; + rdfs:domain :Conclusion ; + rdfs:range :ResearchQuestion . + +:basedOn a owl:ObjectProperty ; + rdfs:domain :Conclusion ; + rdfs:range :Result . + +:hasLimitationOn a owl:ObjectProperty ; + rdfs:domain :Limitation ; + rdfs:range :Method . + +######## +# Optional rhetorical typing +######## + +:IntroductionUnit a owl:Class ; rdfs:subClassOf :RhetoricalUnit . +:MethodsUnit a owl:Class ; rdfs:subClassOf :RhetoricalUnit . +:ResultsUnit a owl:Class ; rdfs:subClassOf :RhetoricalUnit . +:DiscussionUnit a owl:Class ; rdfs:subClassOf :RhetoricalUnit . diff --git a/experiments/ontologies/scads-papers.ttl b/experiments/ontologies/scads-papers.ttl new file mode 100644 index 0000000..e18623c --- /dev/null +++ b/experiments/ontologies/scads-papers.ttl @@ -0,0 +1,38 @@ +@prefix : . + +:paper1 a :ScientificPaper ; + :hasProblem :problem1 ; + :hasGoal :goal1 ; + :hasMethod :method1 ; + :hasExperiment :exp1 ; + :hasObservation :obs1 ; + :hasResult :result1 ; + :hasClaim :claim1 ; + :hasConclusion :concl1 . + +:problem1 a :ResearchProblem . +:goal1 a :Goal . +:method1 a :Method ; + :addressesProblem :problem1 . + +:exp1 a :Experiment ; + :usesMethod :method1 ; + :testsHypothesis :hyp1 ; + :producesObservation :obs1 . + +:hyp1 a :Hypothesis . +:obs1 a :Observation . + +:result1 a :Result ; + :derivedFromObservation :obs1 . + +:evidence1 a :Evidence ; + :supportsClaim :claim1 . + +:result1 :reportsEvidence :evidence1 . + +:claim1 a :Claim ; + :supports :goal1 . + +:concl1 a :Conclusion ; + :basedOn :result1 . diff --git a/experiments/ontologies/src/onto_chat.py b/experiments/ontologies/src/onto_chat.py new file mode 100644 index 0000000..a4c5b58 --- /dev/null +++ b/experiments/ontologies/src/onto_chat.py @@ -0,0 +1,404 @@ +"""Streamlit ontology chat prototype. + +Run: + uv run streamlit run experiments/ontologies/src/onto_chat.py +""" + +from __future__ import annotations + +from dataclasses import dataclass +import importlib +import os +import re +from textwrap import dedent + +import streamlit as st +import streamlit.components.v1 as components +from rdflib import Graph, RDF, RDFS, URIRef +from rdflib.namespace import OWL + + +EXAMPLE_OWL = dedent( + """\ + @prefix ex: . + @prefix rdf: . + @prefix rdfs: . + @prefix owl: . + + ex:Person a owl:Class . + ex:Company a owl:Class . + ex:Project a owl:Class . + + ex:worksFor a owl:ObjectProperty ; + rdfs:domain ex:Person ; + rdfs:range ex:Company . + + ex:worksOn a owl:ObjectProperty ; + rdfs:domain ex:Person ; + rdfs:range ex:Project . + """ +) + +DEFAULT_OPENAI_MODEL = "gpt-4o-mini" +KNOWN_PREFIXES = { + "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#", + "rdfs": "http://www.w3.org/2000/01/rdf-schema#", + "owl": "http://www.w3.org/2002/07/owl#", + "xsd": "http://www.w3.org/2001/XMLSchema#", +} + + +@dataclass +class OntologySchema: + classes: list[str] + object_edges: list[tuple[str, str, str]] + datatype_edges: list[tuple[str, str, str]] + + +def short_name(uri: URIRef) -> str: + """Return a compact local name for URI nodes.""" + text = str(uri) + if "#" in text: + return text.rsplit("#", maxsplit=1)[-1] + if "/" in text: + return text.rstrip("/").rsplit("/", maxsplit=1)[-1] + return text + + +def parse_graph(raw_text: str, rdf_format: str) -> Graph: + """Parse ontology text into an RDF graph.""" + graph = Graph() + graph.parse(data=raw_text, format=rdf_format) + return graph + + +def extract_schema(graph: Graph) -> OntologySchema: + """Extract classes and property relations from graph.""" + classes: set[str] = set() + object_edges: list[tuple[str, str, str]] = [] + datatype_edges: list[tuple[str, str, str]] = [] + + for cls in graph.subjects(RDF.type, OWL.Class): + if isinstance(cls, URIRef): + classes.add(short_name(cls)) + for cls in graph.subjects(RDF.type, RDFS.Class): + if isinstance(cls, URIRef): + classes.add(short_name(cls)) + + for prop in graph.subjects(RDF.type, OWL.ObjectProperty): + if not isinstance(prop, URIRef): + continue + prop_name = short_name(prop) + domains = [d for d in graph.objects(prop, RDFS.domain) if isinstance(d, URIRef)] + ranges = [r for r in graph.objects(prop, RDFS.range) if isinstance(r, URIRef)] + for domain in domains or [URIRef("UnknownDomain")]: + for rng in ranges or [URIRef("UnknownRange")]: + src, dst = short_name(domain), short_name(rng) + classes.update([src, dst]) + object_edges.append((src, prop_name, dst)) + + for prop in graph.subjects(RDF.type, OWL.DatatypeProperty): + if not isinstance(prop, URIRef): + continue + prop_name = short_name(prop) + domains = [d for d in graph.objects(prop, RDFS.domain) if isinstance(d, URIRef)] + ranges = [r for r in graph.objects(prop, RDFS.range) if isinstance(r, URIRef)] + for domain in domains or [URIRef("UnknownDomain")]: + for rng in ranges or [URIRef("Literal")]: + src, dst = short_name(domain), short_name(rng) + classes.add(src) + datatype_edges.append((src, prop_name, dst)) + + return OntologySchema( + classes=sorted(classes), + object_edges=object_edges, + datatype_edges=datatype_edges, + ) + + +def to_mermaid(schema: OntologySchema) -> str: + """Serialize ontology schema as Mermaid classDiagram.""" + lines = ["classDiagram"] + for cls_name in schema.classes: + lines.append(f" class {cls_name}") + for src, rel, dst in schema.object_edges: + lines.append(f" {src} --> {dst} : {rel}") + for src, rel, dst in schema.datatype_edges: + lines.append(f" {src} : {rel} -> {dst}") + return "\n".join(lines) + + +def render_mermaid(mermaid_text: str) -> None: + """Render Mermaid diagram in Streamlit via embedded HTML.""" + escaped = ( + mermaid_text.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + ) + html = f""" +
{escaped}
+ + + """ + components.html(html, height=500, scrolling=True) + + +def draft_llm_prompt(user_request: str, ontology_text: str, rdf_format: str) -> str: + """Build a prompt for a future LLM integration.""" + return dedent( + f"""\ + You are editing an OWL ontology. + + Task: + {user_request} + + Requirements: + - Return only ontology text in {rdf_format} format. + - Preserve existing prefixes when possible. + - Declare all prefixes you use (especially xsd when using xsd:* datatypes). + - Keep edits minimal and valid. + - Do not include markdown fences. + + Current ontology: + {ontology_text} + """ + ) + + +def strip_markdown_fences(text: str) -> str: + """Remove markdown code fences if model returns them.""" + cleaned = text.strip() + if cleaned.startswith("```") and cleaned.endswith("```"): + lines = cleaned.splitlines() + if len(lines) >= 2: + return "\n".join(lines[1:-1]).strip() + return cleaned + + +def extract_declared_prefixes(text: str) -> set[str]: + """Extract declared prefixes from Turtle/N3 text.""" + return set(re.findall(r"@prefix\s+([A-Za-z][\w\-]*)\s*:", text)) + + +def extract_used_prefixes(text: str) -> set[str]: + """Extract prefixed terms used in Turtle/N3 text.""" + matches = re.findall(r"(? tuple[str, list[str]]: + """Inject known prefix declarations when terms use undeclared prefixes.""" + declared = extract_declared_prefixes(text) + used = extract_used_prefixes(text) + missing = sorted((used - declared) & set(KNOWN_PREFIXES)) + if not missing: + return text, [] + + injections = [f"@prefix {p}: <{KNOWN_PREFIXES[p]}> ." for p in missing] + updated = "\n".join(injections) + "\n" + text.lstrip() + return updated, missing + + +def validate_and_normalize_ontology(raw_text: str, rdf_format: str) -> tuple[str, list[str]]: + """Normalize and validate returned ontology text.""" + normalized = raw_text.strip() + added_prefixes: list[str] = [] + if rdf_format in {"turtle", "n3"}: + normalized, added_prefixes = inject_missing_known_prefixes(normalized) + parse_graph(normalized, rdf_format) + return normalized, added_prefixes + + +def request_ontology_edit(prompt: str, model: str) -> str: + """Call OpenAI and return ontology text.""" + api_key = os.getenv("OPENAI_API_KEY") + if not api_key: + raise RuntimeError("OPENAI_API_KEY is not set.") + + try: + openai_module = importlib.import_module("openai") + openai_client = getattr(openai_module, "OpenAI") + except Exception as exc: # noqa: BLE001 + raise RuntimeError( + "The 'openai' package is required. Install it with: uv add openai" + ) from exc + + client = openai_client(api_key=api_key) + response = client.chat.completions.create( + model=model, + temperature=0, + messages=[ + { + "role": "system", + "content": ( + "You edit OWL ontologies. Return only ontology text in the requested " + "serialization format. Do not add markdown." + ), + }, + {"role": "user", "content": prompt}, + ], + ) + content = response.choices[0].message.content or "" + if not content.strip(): + raise RuntimeError("OpenAI returned an empty response.") + return strip_markdown_fences(content) + + +def request_ontology_syntax_fix( + ontology_text: str, + rdf_format: str, + parse_error: Exception, + model: str, +) -> str: + """Ask OpenAI for a syntax-only repair of ontology text.""" + prompt = dedent( + f"""\ + Fix the syntax of this ontology serialization. + + Requirements: + - Return only ontology text in {rdf_format}. + - Preserve meaning; only fix syntax/prefix issues. + - Ensure all used prefixes are declared. + - Do not include markdown fences. + + Parser error: + {parse_error} + + Ontology text: + {ontology_text} + """ + ) + return request_ontology_edit(prompt=prompt, model=model) + + +def init_state() -> None: + """Initialize app session state keys.""" + st.session_state.setdefault("ontology_text", EXAMPLE_OWL) + st.session_state.setdefault("rdf_format", "turtle") + st.session_state.setdefault("messages", []) + st.session_state.setdefault("last_llm_prompt", "") + st.session_state.setdefault("last_llm_response", "") + st.session_state.setdefault("last_normalized_response", "") + st.session_state.setdefault("openai_model", DEFAULT_OPENAI_MODEL) + + +def main() -> None: + st.set_page_config(page_title="Ontology Chat Draft", layout="wide") + st.title("Ontology Chat + Mermaid (Draft)") + st.caption("Prototype UI for OWL editing with chat-driven change requests.") + + init_state() + + left_col, right_col = st.columns([1, 1], gap="large") + + with left_col: + st.subheader("Ontology Text") + st.session_state.rdf_format = st.selectbox( + "RDF format", + options=["turtle", "xml", "nt", "n3"], + index=["turtle", "xml", "nt", "n3"].index(st.session_state.rdf_format), + ) + st.session_state.openai_model = st.text_input( + "OpenAI model", + value=st.session_state.openai_model, + help="Requires OPENAI_API_KEY in environment.", + ) + st.session_state.ontology_text = st.text_area( + "Edit ontology", + value=st.session_state.ontology_text, + height=340, + ) + + st.subheader("Chat") + for msg in st.session_state.messages: + with st.chat_message(msg["role"]): + st.markdown(msg["content"]) + + user_request = st.chat_input("Describe ontology change...") + if user_request: + st.session_state.messages.append({"role": "user", "content": user_request}) + prompt = draft_llm_prompt( + user_request=user_request, + ontology_text=st.session_state.ontology_text, + rdf_format=st.session_state.rdf_format, + ) + st.session_state.last_llm_prompt = prompt + try: + with st.spinner("Requesting ontology update from OpenAI..."): + model_name = st.session_state.openai_model.strip() or DEFAULT_OPENAI_MODEL + edited_ontology = request_ontology_edit( + prompt=prompt, + model=model_name, + ) + st.session_state.last_llm_response = edited_ontology + try: + normalized_ontology, added_prefixes = validate_and_normalize_ontology( + edited_ontology, st.session_state.rdf_format + ) + except Exception as parse_exc: # noqa: BLE001 + with st.spinner("Attempting syntax repair..."): + repaired = request_ontology_syntax_fix( + ontology_text=edited_ontology, + rdf_format=st.session_state.rdf_format, + parse_error=parse_exc, + model=model_name, + ) + st.session_state.last_llm_response = repaired + normalized_ontology, added_prefixes = validate_and_normalize_ontology( + repaired, st.session_state.rdf_format + ) + + st.session_state.ontology_text = normalized_ontology + st.session_state.last_normalized_response = normalized_ontology + prefix_note = "" + if added_prefixes: + prefix_note = f" Added missing prefixes: {', '.join(added_prefixes)}." + st.session_state.messages.append( + { + "role": "assistant", + "content": ( + "Applied OpenAI ontology update and refreshed Mermaid diagram." + f"{prefix_note}" + ), + } + ) + except Exception as exc: # noqa: BLE001 + st.session_state.messages.append( + { + "role": "assistant", + "content": ( + "OpenAI request failed after validation/repair attempts: " + f"{exc}" + ), + } + ) + st.rerun() + + with st.expander("Last drafted LLM prompt", expanded=False): + st.code(st.session_state.last_llm_prompt or "No prompt drafted yet.", language="text") + with st.expander("Last OpenAI response", expanded=False): + st.code(st.session_state.last_llm_response or "No model response yet.", language="text") + with st.expander("Last normalized ontology", expanded=False): + st.code( + st.session_state.last_normalized_response or "No normalized ontology yet.", + language="text", + ) + + with right_col: + st.subheader("Mermaid Render") + try: + graph = parse_graph(st.session_state.ontology_text, st.session_state.rdf_format) + schema = extract_schema(graph) + mermaid = to_mermaid(schema) + render_mermaid(mermaid) + with st.expander("Mermaid source", expanded=False): + st.code(mermaid, language="text") + except Exception as exc: # noqa: BLE001 + st.error(f"Could not parse ontology: {exc}") + st.info("Check RDF format and ontology syntax in the left panel.") + + +if __name__ == "__main__": + main() diff --git a/experiments/ontologies/src/onto_diff.py b/experiments/ontologies/src/onto_diff.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/param-opti/.gitignore b/experiments/param-opti/.gitignore index 3d632f9..a5521ea 100644 --- a/experiments/param-opti/.gitignore +++ b/experiments/param-opti/.gitignore @@ -1,2 +1,7 @@ output/ -repos/ \ No newline at end of file +repos/ +output_qap_mock/ +testdata/ +tmp/ +data/ +data \ No newline at end of file diff --git a/experiments/param-opti/README.md b/experiments/param-opti/README.md index 8c62e8c..edc5def 100644 --- a/experiments/param-opti/README.md +++ b/experiments/param-opti/README.md @@ -2,6 +2,44 @@ This experiment extracts and analyzes configuration parameters from open-source data integration tools using the `kgpipe_parameters` extraction module. +## Paper mock experiments (Quality-Aware Pipelines) + +This directory also contains a **self-contained mock** of the experiments described in `Quality_Aware_Pipelines.pdf` (Section 6, β€œExperimental Evaluation”). + +- **What it is**: a small simulation of (a) a pipeline configuration space (implementations + thresholds), (b) a β€œtrue” end-to-end quality objective (accuracy/coverage/consistency aggregated), (c) an approximate quality estimator \( \hat{Q} \), and (d) search strategies (Default, Random Search, Quality-Aware Search). +- **What it is not**: it does **not** run KGpipe or reproduce the paper’s numbers. It’s meant as a scaffolding to iterate on the experimental protocol and factor out cleaner subpackages later. + +### Run the mock experiments + +From `experiments/param-opti`: + +```bash +python3 run_qap_mock.py all +python3 run_qap_mock.py exp1 # search effectiveness (Table-2-like) +python3 run_qap_mock.py exp2 # estimation reliability (corr/MAE/top-k) +python3 run_qap_mock.py exp3 # impl-only vs param-only vs joint +``` + +Outputs are written to `output_qap_mock/` (JSON). + +#### β€œMock β†’ real” execution mode + +The `qap_mock` package can now execute **real KGpipe tasks** (instead of purely simulated formulas) when dependencies are installed. + +- **Install dependencies** (from repo root): + +```bash +python3 -m pip install -e . +``` + +- **Enable docker-backed tasks** (PARIS, CoreNLP) for richer pipelines: + +```bash +export QAP_MOCK_USE_DOCKER=1 +``` + +Without `QAP_MOCK_USE_DOCKER=1`, `qap_mock` will use non-docker fallbacks where available (e.g., union-only RDF fusion and a lightweight pattern-based IE) so the experiment harness stays runnable. + ## Directory Structure ``` @@ -14,6 +52,7 @@ param-opti/ β”‚ └── repo.url β”œβ”€β”€ repos/ # Cloned repositories (auto-populated) β”œβ”€β”€ output/ # Extraction results (JSON) +β”œβ”€β”€ output_qap_mock/ # Mock paper experiment results (JSON) β”œβ”€β”€ src/ β”‚ └── param_opti/ # Experiment code └── run_experiment.py # Main entry point @@ -98,3 +137,13 @@ Results are saved as JSON files in `output/`: A `_summary.json` file is also generated with aggregate statistics. +# Configuration Apsects + +1. Task Assignment: Selecting +2. Task Tunning +3. + + +# Notes + +../../.venv/bin/pytest -s --show-capture=no src/qap/test_exec_pipelines.py -k "test_rdf_pipeline_from_saved_sampled_configs" \ No newline at end of file diff --git a/experiments/param-opti/input/Parameters.md b/experiments/param-opti/input/Parameters.md new file mode 100644 index 0000000..2dac3e7 --- /dev/null +++ b/experiments/param-opti/input/Parameters.md @@ -0,0 +1,17 @@ + +# Entity Matching +Algo +Cluster +Threshold + +# Ontology Matching +Algo +Cluster +Threshold + +# Entity Linking + +# Relation Linking + +# Fusion +Method \ No newline at end of file diff --git a/experiments/param-opti/input/am_light/parameters.properties b/experiments/param-opti/input/am_light/parameters.properties new file mode 100644 index 0000000..9a296d7 --- /dev/null +++ b/experiments/param-opti/input/am_light/parameters.properties @@ -0,0 +1,3 @@ +# manual file for parameters +similarity_threshold=0.7 +similarity_threshold_mapping=SIMILARITY_THRESHOLD diff --git a/experiments/param-opti/input/am_light/repo.url b/experiments/param-opti/input/am_light/repo.url new file mode 100644 index 0000000..7d29f7d --- /dev/null +++ b/experiments/param-opti/input/am_light/repo.url @@ -0,0 +1 @@ +https://github.com/AgreementMakerLight/AML-Project.git \ No newline at end of file diff --git a/experiments/param-opti/run_qap_mock.py b/experiments/param-opti/run_qap_mock.py new file mode 100644 index 0000000..162042f --- /dev/null +++ b/experiments/param-opti/run_qap_mock.py @@ -0,0 +1,26 @@ +#!/usr/bin/env python3 +""" +Quick runner for the "Quality Aware Knowledge Graph Pipeline Configurations" +paper mock experiments. + +Run from this directory: + python run_qap_mock.py exp1 + python run_qap_mock.py exp2 + python run_qap_mock.py exp3 + python run_qap_mock.py all +""" + +import sys +from pathlib import Path + +# Add local experiment src + project src to path +exp_src_path = Path(__file__).parent / "src" +repo_src_path = Path(__file__).resolve().parents[2] / "src" +sys.path.insert(0, str(exp_src_path)) +sys.path.insert(0, str(repo_src_path)) + +from qap_mock.__main__ import main # noqa: E402 + +if __name__ == "__main__": + raise SystemExit(main()) + diff --git a/experiments/param-opti/src/param_opti/pipeline_selection/__init__.py b/experiments/param-opti/src/param_opti/pipeline_selection/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/param-opti/src/param_opti/pipeline_selection/test_configuration.py b/experiments/param-opti/src/param_opti/pipeline_selection/test_configuration.py new file mode 100644 index 0000000..926605f --- /dev/null +++ b/experiments/param-opti/src/param_opti/pipeline_selection/test_configuration.py @@ -0,0 +1,26 @@ +from random import random, seed, sample +from typing import List + + + +def entity_matching_a() -> List[str]: + seed(42) + # select 5 positive values and 5 negative values + positive_values=["+A", "+B", "+C", "+D", "+E", "+F", "+G", "+H", "+I", "+J", "+K", "+L", "+M", "+N", "+O", "+P", "+Q", "+R", "+S", "+T", "+U", "+V", "+W", "+X", "+Y", "+Z"] + negative_values=["-A", "-B", "-C", "-D", "-E", "-F", "-G", "-H", "-I", "-J", "-K", "-L", "-M", "-N", "-O", "-P", "-Q", "-R", "-S", "-T", "-U", "-V", "-W", "-X", "-Y", "-Z"] + positive_values = sample(positive_values, 5) + negative_values = sample(negative_values, 5) + return positive_values + negative_values + +def schmea_matching_a(): pass + +def test_selecting_pipelines(): pass + + +def test_run(): + + values = entity_matching_a() + print(values) + values2 = entity_matching_a() + print(values2) + # print(values == values2) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/pipeline_util.py b/experiments/param-opti/src/param_opti/pipeline_util.py new file mode 100644 index 0000000..eb0acf3 --- /dev/null +++ b/experiments/param-opti/src/param_opti/pipeline_util.py @@ -0,0 +1,5 @@ + + + +# check current implementation state + diff --git a/experiments/param-opti/src/param_opti/search.py b/experiments/param-opti/src/param_opti/search.py new file mode 100644 index 0000000..82658f7 --- /dev/null +++ b/experiments/param-opti/src/param_opti/search.py @@ -0,0 +1,17 @@ + + +def sample_random_valid(task_impls: List[str]): + pass + +class SearchSpace: + def __init__(self, task_impls: List[str]): + self.task_impls = task_impls + +class NeighborhoodSearch: + def __init__(self, search_space: SearchSpace): + self.search_space = search_space + + def search(self, budget: int): + pass + + diff --git a/experiments/param-opti/src/param_opti/tasks/__init__.py b/experiments/param-opti/src/param_opti/tasks/__init__.py new file mode 100644 index 0000000..bb421c4 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/__init__.py @@ -0,0 +1,4 @@ +from .paris import paris_entity_alignment_task, paris_graph_alignment_task +from .fusion import fusion_first_value_task, fusion_union_task + +__all__ = ["paris_entity_matching_task", "paris_exchange_task", "fusion_first_value_task", "fusion_union_task"] \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/agreementmaker.py b/experiments/param-opti/src/param_opti/tasks/agreementmaker.py new file mode 100644 index 0000000..716457b --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/agreementmaker.py @@ -0,0 +1,14 @@ +from kgpipe.common import Data, DataFormat, KgTask, Registry, TaskInput, TaskOutput, BasicTaskCategoryCatalog + +@Registry.task( + input_spec={"source": DataFormat.RDF, "target": DataFormat.RDF}, + output_spec={"output": DataFormat.AGREEMENTMAKER_RDF}, + description="Perform entity matching using AgreementMaker", + category=[BasicTaskCategoryCatalog.entity_matching] +) +def entity_matching_aggrement_maker(inputs: TaskInput, outputs: TaskOutput): + """Perform entity matching using AgreementMaker.""" + source_data = inputs["source"] + target_data = inputs["target"] + output_data = outputs["output"] + return output_data \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/base_linker.py b/experiments/param-opti/src/param_opti/tasks/base_linker.py new file mode 100644 index 0000000..aee7932 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/base_linker.py @@ -0,0 +1,45 @@ +from kgpipe.common import TaskInput, TaskOutput, Data, DataFormat, KgTask +from kgpipe.common.model.configuration import ConfigurationProfile, ConfigurationDefinition, Parameter, ParameterType + +def relation_linker_label_alias_embedding_transformer_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + Link relations using a base transformer model. + """ + from param_opti.tasks.base_linker_lib import label_alias_embedding_rl + label_alias_embedding_rl(inputs, outputs, model_name=config.get_parameter_value("model_name"), threshold=config.get_parameter_value("similarity_threshold")) + +relation_linker_label_alias_embedding_transformer_task = KgTask( + name="relation_linker_label_alias_embedding_transformer", + function=relation_linker_label_alias_embedding_transformer_function, + input_spec={"source": DataFormat.TE_JSON, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.TE_JSON}, + config_spec=ConfigurationDefinition( + name="relation_linker_label_alias_embedding_transformer", + parameters=[ + Parameter(name="model_name", native_keys=["--model-name"], datatype=ParameterType.string, default_value="sentence-transformers/all-MiniLM-L6-v2", required=True, allowed_values=["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "intfloat/e5-base-v2"]), + Parameter(name="similarity_threshold", native_keys=["--similarity-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) + +def entity_linker_label_alias_embedding_transformer_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + Link entities using a base transformer model. + """ + from param_opti.tasks.base_linker_lib import label_alias_embedding_el + label_alias_embedding_el(inputs, outputs, model_name=config.get_parameter_value("model_name"), threshold=config.get_parameter_value("similarity_threshold")) + +entity_linker_label_alias_embedding_transformer_task = KgTask( + name="entity_linker_label_alias_embedding_transformer", + function=entity_linker_label_alias_embedding_transformer_function, + input_spec={"source": DataFormat.TE_JSON, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.TE_JSON}, + config_spec=ConfigurationDefinition( + name="entity_linker_label_alias_embedding_transformer", + parameters=[ + Parameter(name="model_name", native_keys=["--model-name"], datatype=ParameterType.string, default_value="sentence-transformers/all-MiniLM-L6-v2", required=True, allowed_values=["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "intfloat/e5-base-v2"]), + Parameter(name="similarity_threshold", native_keys=["--similarity-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) + diff --git a/experiments/param-opti/src/param_opti/tasks/base_linker_lib.py b/experiments/param-opti/src/param_opti/tasks/base_linker_lib.py new file mode 100644 index 0000000..1b0b12d --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/base_linker_lib.py @@ -0,0 +1,216 @@ +import json +import os +from abc import ABC, abstractmethod +from typing import Dict, List + +import numpy as np +import torch +from kgcore.api.ontology import OntologyUtil, OwlProperty +from kgpipe.common import Data, DataFormat, Registry +from kgpipe_tasks.transform_interop.exchange.text_extraction import TE_Document, TE_Pair +from rdflib import Graph, RDFS +from sentence_transformers import SentenceTransformer, util +from tqdm import tqdm + +_models: Dict[str, SentenceTransformer] = {} + +class Embedder(ABC): + def __init__(self, embedder_name: str): + self.embedder_name = embedder_name + + @abstractmethod + def encode_as_dict(self, texts: List[str]) -> Dict[str, np.ndarray]: + pass + + @abstractmethod + def encode(self, texts: List[str]) -> np.ndarray: + pass + + +def get_model(model_name: str) -> SentenceTransformer: + if model_name not in _models: + model = SentenceTransformer(model_name) + if torch.cuda.is_available(): + model.to(torch.cuda.current_device()) + _models[model_name] = model + return _models[model_name] + +class SentenceTransformerEmbedder(Embedder): + def __init__(self, model_name: str): + super().__init__("sentence-transformer") + self.model_name = model_name + + def encode_as_dict(self, text_list: List[str]) -> Dict[str, np.ndarray]: + embeddings = self.encode(text_list) + return {text: embedding for text, embedding in zip(text_list, embeddings)} + + def encode(self, text_list: List[str]) -> np.ndarray: + embeddings = get_model(self.model_name).encode(text_list, show_progress_bar=False) + return embeddings + +class EntityMatch: + def __init__(self, entity: str, label: str, score: float): + self.entity = entity + self.label = label + self.score = score + + +def _validate_embedding_dimensions(query_embeddings: np.ndarray, target_embeddings: np.ndarray) -> None: + if query_embeddings.shape[1] != target_embeddings.shape[1]: + raise ValueError( + "Embedding dimension mismatch: " + f"{query_embeddings.shape[1]} vs {target_embeddings.shape[1]}" + ) + + +class AliasAndLabelBasedEntityLinker: + """ + Link extracted entity mentions to graph resources using label embeddings. + """ + + def __init__(self, graph: Graph, model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.0): + self.graph = graph + self.embedder = SentenceTransformerEmbedder(model_name=model_name) + self.threshold = float(threshold) + self.entity_uri_label_tuples = [ + (entity_uri, str(label)) + for entity_uri, _, label in self.graph.triples((None, RDFS.label, None)) + ] + entity_texts = [label for _, label in self.entity_uri_label_tuples] + self.entity_embeddings = self.embedder.encode(entity_texts) + + def link_entities(self, extracted_entities: List[str]) -> List[EntityMatch]: + if not extracted_entities: + return [] + + best_matches = [] + key_embeddings = self.embedder.encode(extracted_entities) + _validate_embedding_dimensions(key_embeddings, self.entity_embeddings) + similarities = util.cos_sim(key_embeddings, self.entity_embeddings) + + for i, entity in enumerate(extracted_entities): + best_idx = int(similarities[i].argmax()) + best_score = float(similarities[i][best_idx]) + if best_score < self.threshold: + continue + entity_uri, _ = self.entity_uri_label_tuples[best_idx] + best_matches.append(EntityMatch(entity, entity_uri, best_score)) + + return best_matches + + + +def label_alias_embedding_el(inputs: Dict[str, Data], outputs: Dict[str, Data], model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.5): + graph = Graph() + graph.parse(inputs["target"].path, format="nt") + linker = AliasAndLabelBasedEntityLinker(graph, model_name=model_name, threshold=threshold) + + if os.path.isdir(inputs["source"].path): + os.makedirs(outputs["output"].path, exist_ok=True) + for file in tqdm(os.listdir(inputs["source"].path), desc="Linking entities"): + te_doc_in = TE_Document(**json.load(open(os.path.join(inputs["source"].path, file)))) + entity_texts = list({triple.subject.surface_form for triple in te_doc_in.triples if triple.subject.surface_form}) + entity_texts += list({triple.object.surface_form for triple in te_doc_in.triples if triple.object.surface_form}) + entity_matches = linker.link_entities(entity_texts) + te_links = [TE_Pair(span=match.entity, mapping=match.label, link_type="entity", score=match.score) for match in entity_matches] + te_doc_out = te_doc_in.model_copy(deep=True) + te_doc_out.links += te_links + with open(os.path.join(outputs["output"].path, file), "w") as f: + f.write(te_doc_out.model_dump_json()) + else: + te_doc_in = TE_Document(**json.load(open(inputs["source"].path))) + entity_matches = linker.link_entities(list({triple.subject.surface_form for triple in te_doc_in.triples if triple.subject.surface_form})) + te_links = [TE_Pair(span=match.entity, mapping=match.label, link_type="entity", score=match.score) for match in entity_matches] + te_doc_out = te_doc_in.model_copy(deep=True) + te_doc_out.links += te_links + with open(outputs["output"].path, "w") as f: + f.write(te_doc_out.model_dump_json()) + + + +class RelationMatch: + def __init__(self, relation: str, predicate: OwlProperty, score: float): + self.relation = relation + self.predicate = predicate + self.score = score + + def __str__(self): + return f"RelationMatch(relation={self.relation}, predicate={self.predicate.uri}, score={self.score})" + + +def normalize(text): + return text.replace('_', ' ').replace('-', ' ').strip().lower() + +def build_property_text(prop: OwlProperty): + text_parts = [ + f"label: {normalize(prop.label)}", + f"altLabels: {', '.join(normalize(lbl) for lbl in prop.alias)}" + # f"domain: {normalize(prop.get('domain', ''))}", + # f"comment: {normalize(prop.get('comment', ''))}" + ] + return "; ".join(text_parts) + +class AliasAndTransformerBasedRelationLinker: + """ + Link extracted relation phrases to ontology predicates using label and alias embeddings. + """ + + def __init__(self, ontology_file, model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.0): + print(f"Init AliasAndTransformerBasedRelationLinker with ontology file: {ontology_file} and model name: {model_name} and threshold: {threshold}") + self.ontology = OntologyUtil.load_ontology_from_file(ontology_file) + self.embedder = SentenceTransformerEmbedder(model_name=model_name) + self.threshold = float(threshold) + property_texts = [build_property_text(p) for p in self.ontology.properties] + self.property_embeddings = self.embedder.encode(property_texts) + + def link_relations(self, extracted_relations: List[str]) -> List[RelationMatch]: + if not extracted_relations: + return [] + + best_matches = [] + key_texts = [normalize(relation) for relation in extracted_relations] + key_embeddings = self.embedder.encode(key_texts) + _validate_embedding_dimensions(key_embeddings, self.property_embeddings) + similarities = util.cos_sim(key_embeddings, self.property_embeddings) + + for i, relation in enumerate(extracted_relations): + best_idx = int(similarities[i].argmax()) + best_score = float(similarities[i][best_idx]) + # print(f"Relation: {relation}, matched to: {self.ontology.properties[best_idx].uri}, label: {self.ontology.properties[best_idx].label}, Best Index: {best_idx}, Best Score: {best_score}") + if best_score < self.threshold: + continue + match = self.ontology.properties[best_idx] + best_matches.append(RelationMatch(relation, match, best_score)) + + return best_matches + + +def label_alias_embedding_rl(inputs: Dict[str, Data], outputs: Dict[str, Data], model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.5): + + ontology_path = os.environ.get("ONTOLOGY_PATH", "false") + if ontology_path == "false": + raise ValueError("ONTOLOGY_PATH is not set") + else: + ontology_path = ontology_path + + linker = AliasAndTransformerBasedRelationLinker(ontology_path, model_name=model_name, threshold=threshold) + + if os.path.isdir(inputs["source"].path): + os.makedirs(outputs["output"].path, exist_ok=True) + for file in tqdm(os.listdir(inputs["source"].path), desc="Linking relations"): + te_doc_in = TE_Document(**json.load(open(os.path.join(inputs["source"].path, file)))) + relation_texts = list({triple.predicate.surface_form for triple in te_doc_in.triples if triple.predicate.surface_form}) + relation_matches = linker.link_relations(relation_texts) + te_links = [TE_Pair(span=match.relation, mapping=match.predicate.uri, link_type="predicate", score=match.score) for match in relation_matches] + te_doc_out = te_doc_in.model_copy(deep=True) + te_doc_out.links += te_links + with open(os.path.join(outputs["output"].path, file), "w") as f: + f.write(te_doc_out.model_dump_json()) + else: + te_doc_in = TE_Document(**json.load(open(inputs["source"].path))) # TODO: check if this is correct + relation_matches = linker.link_relations(list({triple.predicate.surface_form for triple in te_doc_in.triples if triple.predicate.surface_form})) + te_links = [TE_Pair(span=match.relation, mapping=match.predicate.uri, link_type="predicate", score=match.score) for match in relation_matches] + te_doc_out = te_doc_in.model_copy(deep=True) + te_doc_out.links += te_links + with open(outputs["output"].path, "w") as f: + f.write(te_doc_out.model_dump_json()) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/base_matcher.py b/experiments/param-opti/src/param_opti/tasks/base_matcher.py new file mode 100644 index 0000000..223fdc4 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/base_matcher.py @@ -0,0 +1,111 @@ +from kgpipe.common import TaskInput, TaskOutput, DataFormat +from kgpipe.common.model.configuration import ConfigurationDefinition, Parameter, ParameterType, ConfigurationProfile +from kgpipe.common.model.task import KgTask + +# Same as paris_graph_alignment_task / paris_entity_alignment_task: +# input_spec + output_spec as in experiments/param-opti/src/param_opti/tasks/paris.py (e.g. lines 58–59). +_ALIGNMENT_TWO_GRAPH_INPUT_SPEC = {"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES} +_ALIGNMENT_ER_JSON_OUTPUT_SPEC = {"output": DataFormat.ER_JSON} + + +def _embedding_config_params(): + return [ + Parameter( + name="model_name", + native_keys=["--model-name"], + datatype=ParameterType.string, + default_value="sentence-transformers/all-MiniLM-L6-v2", + required=True, + allowed_values=[ + "sentence-transformers/all-MiniLM-L6-v2", + "sentence-transformers/all-mpnet-base-v2", + "intfloat/e5-base-v2", + ], + ), + Parameter( + name="similarity_threshold", + native_keys=["--similarity-threshold"], + datatype=ParameterType.number, + default_value=0.5, + required=True, + allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], + ), + ] + + +def graph_alignment_label_alias_embedding_transformer_function( + inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile +): + """Match entities and relations between two RDF graphs (full graph alignment).""" + from param_opti.tasks.base_matcher_lib import label_embedding_graph_alignment_match + + label_embedding_graph_alignment_match( + inputs, + outputs, + model_name=config.get_parameter_value("model_name"), + threshold=float(config.get_parameter_value("similarity_threshold")), + ) + + +graph_alignment_label_alias_embedding_transformer_task = KgTask( + name="graph_alignment_label_alias_embedding_transformer", + function=graph_alignment_label_alias_embedding_transformer_function, + input_spec=dict(_ALIGNMENT_TWO_GRAPH_INPUT_SPEC), + output_spec=dict(_ALIGNMENT_ER_JSON_OUTPUT_SPEC), + config_spec=ConfigurationDefinition( + name="graph_alignment_label_alias_embedding_transformer", + parameters=_embedding_config_params(), + ), +) + + +def entity_matcher_label_alias_embedding_transformer_function( + inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile +): + """Entity alignment only (subject/object URIs with rdfs:label).""" + from param_opti.tasks.base_matcher_lib import label_embedding_entity_alignment_match + + label_embedding_entity_alignment_match( + inputs, + outputs, + model_name=config.get_parameter_value("model_name"), + threshold=float(config.get_parameter_value("similarity_threshold")), + ) + + +entity_matcher_label_alias_embedding_transformer_task = KgTask( + name="entity_matcher_label_alias_embedding_transformer", + function=entity_matcher_label_alias_embedding_transformer_function, + input_spec=dict(_ALIGNMENT_TWO_GRAPH_INPUT_SPEC), + output_spec=dict(_ALIGNMENT_ER_JSON_OUTPUT_SPEC), + config_spec=ConfigurationDefinition( + name="entity_matcher_label_alias_embedding_transformer", + parameters=_embedding_config_params(), + ), +) + + +def relation_matcher_label_alias_embedding_transformer_function( + inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile +): + """Relation / predicate alignment only.""" + from param_opti.tasks.base_matcher_lib import label_embedding_relation_alignment_match + + label_embedding_relation_alignment_match( + inputs, + outputs, + model_name=config.get_parameter_value("model_name"), + threshold=float(config.get_parameter_value("similarity_threshold")), + ) + + +relation_matcher_label_alias_embedding_transformer_task = KgTask( + name="relation_matcher_label_alias_embedding_transformer", + function=relation_matcher_label_alias_embedding_transformer_function, + input_spec=dict(_ALIGNMENT_TWO_GRAPH_INPUT_SPEC), + output_spec=dict(_ALIGNMENT_ER_JSON_OUTPUT_SPEC), + config_spec=ConfigurationDefinition( + name="relation_matcher_label_alias_embedding_transformer", + parameters=_embedding_config_params(), + ), +) diff --git a/experiments/param-opti/src/param_opti/tasks/base_matcher_lib.py b/experiments/param-opti/src/param_opti/tasks/base_matcher_lib.py new file mode 100644 index 0000000..6ca3f41 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/base_matcher_lib.py @@ -0,0 +1,234 @@ +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +from typing import Dict, Iterable, List, Optional, Sequence + +from rdflib import Graph, Literal, RDFS, URIRef +from sentence_transformers import util + +from kgpipe.common import Data +from kgpipe_tasks.transform_interop.exchange.entity_matching import ER_Document, ER_Match + +# Reuse the shared embedder/model cache from the linker implementation. +from param_opti.tasks.base_linker_lib import SentenceTransformerEmbedder, _validate_embedding_dimensions + + +def _normalize_label(text: str) -> str: + return " ".join(text.replace("_", " ").replace("-", " ").strip().lower().split()) + + +def _safe_first_literal(values: Iterable[object]) -> Optional[str]: + for v in values: + if isinstance(v, Literal): + s = str(v).strip() + if s: + return s + return None + + +def _fallback_label_from_uri(uri: URIRef) -> str: + s = str(uri) + if "#" in s: + return s.rsplit("#", 1)[-1] + return s.rsplit("/", 1)[-1] + + +@dataclass(frozen=True) +class _LabeledUri: + uri: URIRef + label: str + + +def _extract_labeled_entities(graph: Graph) -> List[_LabeledUri]: + """ + Extract subject/object URIRefs that have an rdfs:label. + """ + uris: set[URIRef] = set() + for s, _, o in graph: + if isinstance(s, URIRef): + uris.add(s) + if isinstance(o, URIRef): + uris.add(o) + + labeled: List[_LabeledUri] = [] + for u in uris: + label = _safe_first_literal(graph.objects(u, RDFS.label)) + if label: + labeled.append(_LabeledUri(u, label)) + return labeled + + +def _extract_labeled_predicates(graph: Graph) -> List[_LabeledUri]: + """ + Extract predicate URIRefs and use rdfs:label if present, otherwise fall back to local-name. + """ + preds: set[URIRef] = {p for _, p, _ in graph if isinstance(p, URIRef)} + labeled: List[_LabeledUri] = [] + for p in preds: + label = _safe_first_literal(graph.objects(p, RDFS.label)) or _fallback_label_from_uri(p) + labeled.append(_LabeledUri(p, label)) + return labeled + + +def _best_matches( + source: Sequence[_LabeledUri], + target: Sequence[_LabeledUri], + *, + model_name: str, + threshold: float, + id_type: str, +) -> List[ER_Match]: + if not source or not target: + return [] + + embedder = SentenceTransformerEmbedder(model_name=model_name) + src_texts = [_normalize_label(x.label) for x in source] + tgt_texts = [_normalize_label(x.label) for x in target] + + src_emb = embedder.encode(src_texts) + tgt_emb = embedder.encode(tgt_texts) + _validate_embedding_dimensions(src_emb, tgt_emb) + + sims = util.cos_sim(src_emb, tgt_emb) + matches: List[ER_Match] = [] + + for i, src in enumerate(source): + best_idx = int(sims[i].argmax()) + best_score = float(sims[i][best_idx]) + if best_score < float(threshold): + continue + tgt = target[best_idx] + matches.append( + ER_Match( + id_1=str(src.uri), + id_2=str(tgt.uri), + score=best_score, + id_type=id_type, + ) + ) + return matches + + +def _write_er_document(output_path: Path, matches: List[ER_Match]) -> None: + output_path.parent.mkdir(parents=True, exist_ok=True) + doc = ER_Document(matches=matches) + output_path.write_text(doc.model_dump_json(), encoding="utf-8") + + +def _label_embedding_match_two_graphs( + inputs: Dict[str, Data], + outputs: Dict[str, Data], + *, + model_name: str, + threshold: float, + include_entities: bool, + include_relations: bool, +) -> None: + source_graph = Graph() + source_graph.parse(inputs["source"].path, format="nt") + + target_graph = Graph() + target_graph.parse(inputs["target"].path, format="nt") + + matches: List[ER_Match] = [] + + if include_entities: + source_entities = _extract_labeled_entities(source_graph) + target_entities = _extract_labeled_entities(target_graph) + matches.extend( + _best_matches( + source_entities, + target_entities, + model_name=model_name, + threshold=threshold, + id_type="entity", + ) + ) + + if include_relations: + source_preds = _extract_labeled_predicates(source_graph) + target_preds = _extract_labeled_predicates(target_graph) + matches.extend( + _best_matches( + source_preds, + target_preds, + model_name=model_name, + threshold=threshold, + id_type="relation", + ) + ) + + _write_er_document(outputs["output"].path, matches) + + +def label_embedding_graph_alignment_match( + inputs: Dict[str, Data], + outputs: Dict[str, Data], + *, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + threshold: float = 0.5, +) -> None: + """ + Align two RDF graphs: match subject/object entities by rdfs:label and predicates by label. + + Writes `ER_Document` JSON with both entity and relation matches (same shape as `paris_lib`). + """ + _label_embedding_match_two_graphs( + inputs, + outputs, + model_name=model_name, + threshold=threshold, + include_entities=True, + include_relations=True, + ) + + +def label_embedding_entity_alignment_match( + inputs: Dict[str, Data], + outputs: Dict[str, Data], + *, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + threshold: float = 0.5, +) -> None: + """Entity alignment only: matches with id_type \"entity\".""" + _label_embedding_match_two_graphs( + inputs, + outputs, + model_name=model_name, + threshold=threshold, + include_entities=True, + include_relations=False, + ) + + +def label_embedding_relation_alignment_match( + inputs: Dict[str, Data], + outputs: Dict[str, Data], + *, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + threshold: float = 0.5, +) -> None: + """Relation / predicate alignment only: matches with id_type \"relation\".""" + _label_embedding_match_two_graphs( + inputs, + outputs, + model_name=model_name, + threshold=threshold, + include_entities=False, + include_relations=True, + ) + + +def label_embedding_graph_match( + inputs: Dict[str, Data], + outputs: Dict[str, Data], + *, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + threshold: float = 0.5, +) -> None: + """Backward-compatible alias for full graph alignment (entities + relations).""" + label_embedding_graph_alignment_match( + inputs, outputs, model_name=model_name, threshold=threshold + ) + diff --git a/experiments/param-opti/src/param_opti/tasks/corenlp.py b/experiments/param-opti/src/param_opti/tasks/corenlp.py new file mode 100644 index 0000000..e45299b --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/corenlp.py @@ -0,0 +1,36 @@ +from typing import Dict + +from pathlib import Path +from kgpipe.common import Data, DataFormat, Registry, KgTask +from kgpipe.common.model.configuration import ConfigurationDefinition + + +def corenlp_text_extraction_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + from param_opti.tasks.corenlp_lip import corenlp_openie_extraction, corenlp_exchange + + # Ensure parent directory exists for the TE JSON output path + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + input_path: Path = inputs["input"].path + final_te_output: Data = outputs["output"] + + # 1) Produce intermediate OpenIE JSON (file or directory) + if input_path.is_dir(): + openie_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie_out" + else: + openie_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie.json" + + openie_output = {"output": Data(openie_out_path, DataFormat.OPENIE_JSON)} + corenlp_openie_extraction({"input": inputs["input"]}, openie_output) + + # 2) Convert OpenIE JSON β†’ TE JSON (final output) + corenlp_exchange({"input": openie_output["output"]}, {"output": final_te_output}) + + +corenlp_text_extraction_task = KgTask( + name="corenlp_text_extraction", + input_spec={"input": DataFormat.TEXT}, + output_spec={"output": DataFormat.TE_JSON}, + function=corenlp_text_extraction_function, + description="Extract text using CoreNLP" +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/corenlp_lip.py b/experiments/param-opti/src/param_opti/tasks/corenlp_lip.py new file mode 100644 index 0000000..4ef7559 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/corenlp_lip.py @@ -0,0 +1,148 @@ +from kgpipe.common import TaskInput, TaskOutput, Data, DataFormat, Registry, BasicTaskCategoryCatalog +from kgpipe.common.model.configuration import ConfigurationDefinition, Parameter, ParameterType, ConfigurationProfile +from kgpipe.common.model.task import KgTask + +def openie_pipeline_task_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + Run the openie pipeline + """ + pass + +@Registry.task( + input_spec={"input": DataFormat.TEXT}, + output_spec={"output": DataFormat.OPENIE_JSON}, + description="Extract OpenIE triples using Stanford CoreNLP", + category=["TextProcessing", "TextExtraction"] +) +def openie_pipeline_task(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + Run the openie pipeline + """ + pass + +""" +Stanford CoreNLP Information Extraction + +This module provides information extraction using Stanford CoreNLP. +""" + +import json +import os +from pathlib import Path +from typing import Dict, Any, List + +from kgpipe.common import KgTask, Data, DataFormat, Registry +from kgpipe.common.io import get_docker_volume_bindings, remap_data_path_for_container +from kgpipe.execution import docker_client + + +CORENLP_ENTRYPOINT = ["java", "-cp", "*", "edu.stanford.nlp.pipeline.StanfordCoreNLP"] + + +def corenlp_openie_extraction(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """Extract OpenIE triples using Stanford CoreNLP.""" + # input_data = inputs["input"] + # output_data = outputs["output"] + + # Setup Docker + all_data = list(inputs.values()) + list(outputs.values()) + volumes, host_to_container = get_docker_volume_bindings(all_data) + + print(inputs["input"]) + print(outputs["output"]) + # Remap paths for container + input_path = remap_data_path_for_container(inputs["input"], host_to_container) + output_path = remap_data_path_for_container(outputs["output"], host_to_container) + + # Create command + command = ["bash", "openie.sh", str(input_path.path), str(output_path.path)] + # CORENLP_ENTRYPOINT + [ + # "-annotators", "tokenize,pos,lemma,ner,parse,coref,openie", + # "-file", str(input_path.path), + # "-outputFormat", "json", + # "-outputDirectory", str(output_path.path) + # ] + + # Run container + client = docker_client( + image="kgt/corenlp:latest", + command=command, + volumes=volumes + ) + client() + + +def corenlp_exchange(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """Convert OpenIE JSON to IE JSON format.""" + input_path = inputs["input"].path + output_path = outputs["output"].path + + # create output folder + os.makedirs(os.path.dirname(output_path), exist_ok=True) + + def __openiejson2tejson(openiedata) -> Dict[str, Any]: + """Convert OpenIE JSON to TE Document format.""" + doc = {"triples": [], "chains": []} + + # Convert to triples + triplets = [] + for sentence in openiedata.get('sentences', []): + for triple_span in sentence.get('openie', []): + triplet = { + "subject": {"surface_form": triple_span.get('subject', '')}, + "predicate": {"surface_form": triple_span.get('relation', '')}, + "object": {"surface_form": triple_span.get('object', '')} + } + triplets.append(triplet) + + # Get chains (simplified) + chains = get_coreference_chains(openiedata) + + doc["triples"] = triplets + doc["chains"] = chains + return doc + + if os.path.isdir(input_path): + os.makedirs(output_path, exist_ok=True) + for file in os.listdir(input_path): + # Read input json + with open(os.path.join(input_path, file), 'r') as f: + data = json.load(f) + te_doc = __openiejson2tejson(data) + outfile = os.path.join(output_path, file) + + with open(outfile, 'w') as of: + json.dump(te_doc, of) + # print(f"Converted {input_path} to {outfile}") + + else: + # Read input json + with open(input_path, 'r') as f: + data = json.load(f) + te_doc = __openiejson2tejson(data) + with open(output_path, 'w') as of: + json.dump(te_doc, of) + # print(f"Converted {input_path} to {output_path}") + + +def get_coreference_chains(response: dict) -> List[Dict[str, Any]]: + """Extract coreference chains from CoreNLP response.""" + result = [] + for _, coref in response.get('corefs', {}).items(): + if len(coref) > 1: + chain = {"main": coref[0].get('text', '')} + alias = [] + for chunk in coref[1:]: + sentence = response.get('sentences', [])[chunk.get('sentNum', 1) - 1] + start = sentence.get('tokens', [])[chunk.get('startIndex', 1) - 1].get('characterOffsetBegin', 0) + end = sentence.get('tokens', [])[chunk.get('endIndex', 2) - 2].get('characterOffsetEnd', 0) + alias.append({ + "surface_form": chunk.get('text', ''), + "text": chunk.get('text', ''), + "start": start, + "end": end + }) + chain["aliases"] = alias + result.append(chain) + return result + diff --git a/experiments/param-opti/src/param_opti/tasks/formats.py b/experiments/param-opti/src/param_opti/tasks/formats.py new file mode 100644 index 0000000..97d1a3e --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/formats.py @@ -0,0 +1,8 @@ + +# reimport and define of used formats + +from kgpipe.common import DataFormat +from kgpipe.common.model.default_catalog import BasicDataFormats, CustomDataFormats + +class ExtendedFormats(CustomDataFormats): + pass \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/fusion.py b/experiments/param-opti/src/param_opti/tasks/fusion.py new file mode 100644 index 0000000..0c6eec3 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/fusion.py @@ -0,0 +1,48 @@ +import os + +from kgpipe.common.model.configuration import ConfigurationProfile +from kgpipe.common.model.configuration import ConfigurationDefinition, Parameter, ParameterType +from kgpipe.common.models import TaskInput, TaskOutput, KgTask, DataFormat + +def fusion_first_value_function( + inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile | None = None +): + from param_opti.tasks.fusion_lib import fusion_first_value + + if config is not None: + ontology_path = config.get_parameter_value("ontology_path") + else: + ontology_path = os.environ.get("ONTOLOGY_PATH", "") + # TODO remove thresholds as they are applied by the matchers + fusion_first_value( + inputs, + outputs, + entity_matching_threshold=0.0, + relation_matching_threshold=0.0, + ontology_path=ontology_path, + ) + +fusion_first_value_task = KgTask( + name="fusion_first_value_task", + function=fusion_first_value_function, + input_spec={"source": DataFormat.RDF_NTRIPLES, "kg": DataFormat.RDF_NTRIPLES, "matches1": DataFormat.ER_JSON}, + output_spec={"output": DataFormat.RDF_NTRIPLES} + # config_spec=ConfigurationDefinition( + # name="fusion_first_value", + # parameters=[ + # # ontology path + # Parameter(name="ontology_path", native_keys=["--ontology-path"], datatype=ParameterType.string, default_value="", required=True), + # ] + # ) +) + +def fusion_union_function(inputs: TaskInput, outputs: TaskOutput): + # touch output file + outputs["output"].path.touch() + +fusion_union_task = KgTask( + name="fusion_union_task", + function=fusion_union_function, + input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES, "matches": DataFormat.ER_JSON}, + output_spec={"output": DataFormat.RDF_NTRIPLES}, +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/fusion_lib.py b/experiments/param-opti/src/param_opti/tasks/fusion_lib.py new file mode 100644 index 0000000..0581ad7 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/fusion_lib.py @@ -0,0 +1,205 @@ + +from kgpipe.common.models import KgTask, DataFormat, Data +from logging import getLogger + +from pydantic import BaseModel +from rdflib import OWL, Graph, URIRef, RDFS, RDF, SKOS +from pathlib import Path +import json +import os +from kgcore.api.ontology import OntologyUtil +from kgpipe.common.config import TARGET_ONTOLOGY_NAMESPACE +from typing import Dict, List +from kgpipe_tasks.entity_resolution.fusion.util import load_matches_from_file + +SINGLE_CANDIDATE_CHECK: bool=False + +logger = getLogger(__name__) + +class TrackRecord(BaseModel): + original_subject: str + subject: str + original_predicate: str + predicate: str + original_object: str + object: str + +def select_first_value(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """ + For two KGs A and B, merge A into B where for each s_p and + 1) p is fusable and B does not have any s_p_o or + 2) p is not fusable erge all s_p_o + """ + ontology_path = os.environ.get("ONTOLOGY_PATH", "false") + if ontology_path == "false": + raise ValueError("ONTOLOGY_PATH is not set") + + ontology = OntologyUtil.load_ontology_from_file(Path(ontology_path)) + allowed_predicates = set[str]([str(p.uri) for p in ontology.properties]+[str(RDFS.label), str(RDF.type), str(SKOS.altLabel)]) + fusable_properties = set[str]([str(p.uri) for p in ontology.properties if p.max_cardinality == 1]+[str(RDFS.label), str(RDF.type)]) + + def is_fusable(p): + return str(p) in fusable_properties + + source_graph = Graph() + source_graph.parse(inputs["source"].path, format="nt") + seed_graph = Graph() # seed graph + seed_graph.parse(inputs["target"].path, format="nt") + + current_subjects = set[str]([str(s) for s in seed_graph.subjects(unique=True)]) + + selected: List[TrackRecord] = [] + discarded: List[TrackRecord] = [] + + for s, p, o in source_graph: + s_can = s + p_can = p + o_can = o + + if not isinstance(p_can, URIRef) or str(p_can) not in allowed_predicates: + continue + + if p_can == RDF.type and not str(o_can).startswith(TARGET_ONTOLOGY_NAMESPACE): + continue + + if is_fusable(p_can): + # Add exactly one value if none exists yet + if not any(seed_graph.objects(s_can, p_can)): + seed_graph.add((s_can, p_can, o_can)) + selected.append( + TrackRecord(subject=s_can,predicate=p_can,object=o,original_subject=s,original_predicate=p,original_object=o)) + # keep subjects set fresh for subsequent matches + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + else: + discarded.append( + TrackRecord(subject=s_can,predicate=p_can,object=o,original_subject=s,original_predicate=p,original_object=o)) + else: + # Non-fusable: copy if not already present (avoid dupes) + if (s_can, p_can, o_can) not in seed_graph: + seed_graph.add((s_can, p_can, o_can)) + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + + # sel(ected) + selected_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".selected.json") + with open(selected_file_path, "w") as f: + json.dump(selected, f, default=lambda x: x.model_dump()) + # dis(carded) + discarded_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".discarded.json") + with open(discarded_file_path, "w") as f: + json.dump(discarded, f, default=lambda x: x.model_dump()) + + # prov graph is skipped here as no uris are replaced (is done in previouse steps) + seed_graph.serialize(outputs["output"].path, format="nt") + +def fusion_first_value(inputs: Dict[str, Data], outputs: Dict[str, Data], entity_matching_threshold: float, relation_matching_threshold: float, ontology_path: str): + """ + Fuse RDF entities + - replacing ids of target graph with ids of source graph based on matches + - only fusable properties are fused + - selects the first value from source graph if no target value exists (does not add values from target graph) + - also if target graph has multiple values for a property, the first value is selected (for new entities) + """ + ontology = OntologyUtil.load_ontology_from_file(Path(ontology_path)) + allowed_predicates = set[str]([str(p.uri) for p in ontology.properties]+[str(RDFS.label), str(RDF.type), str(SKOS.altLabel)]) + fusable_properties = set[str]([str(p.uri) for p in ontology.properties if p.max_cardinality == 1]+[str(RDFS.label), str(RDF.type)]) + + def is_fusable(p): + return str(p) in fusable_properties + + entity_matches = load_matches_from_file(inputs["matches1"].path, entity_matching_threshold, "entity") + relation_matches = load_matches_from_file(inputs["matches1"].path, relation_matching_threshold, "relation") + + source_graph = Graph() + source_graph.parse(inputs["source"].path, format="nt") + seed_graph = Graph() # seed graph + seed_graph.parse(inputs["kg"].path, format="nt") + + current_subjects = set[str]([str(s) for s in seed_graph.subjects(unique=True)]) + + sameAsProv = {} + + def canonicalize_entity_term(term): + """Map a URI from the target graph to the matching source URI, if any.""" + if isinstance(term, URIRef): + t_str = str(term) + cluster = entity_matches.get_cluster(t_str) + if cluster: + right_candidates = [c for c in cluster if not c == t_str] + if len(right_candidates) > 2 and SINGLE_CANDIDATE_CHECK: + raise ValueError(f"Multiple matches found for {t_str}") + else: + for m in right_candidates: + # if not m == t_str: + sameAsProv[str(term)] = str(m) + return URIRef(m) + return term + else: + return term + return term + + def canonicalize_property_term(term): + """Map a URI from the target graph to the matching source URI, if any.""" + if isinstance(term, URIRef): + t_str = str(term) + mapped = relation_matches.has_match_to_namespace(t_str, TARGET_ONTOLOGY_NAMESPACE) + if mapped: + return URIRef(mapped) + else: # TODO this is a workaround for the base pipelines... + mapped = relation_matches.has_match_to_namespace(t_str, str(RDFS)) + if mapped: + return URIRef(mapped) + return term + + selected: List[TrackRecord] = [] + discarded: List[TrackRecord] = [] + + for s, p, o in source_graph: + # Canonicalize + logger.debug(f"Canonicalizing {s}, {p}, {o}") + s_can = canonicalize_entity_term(s) + p_can = canonicalize_property_term(p) + o_can = canonicalize_entity_term(o) if isinstance(o, URIRef) else o # keep literals/bnodes as-is + + # Only work with properties that are in our ontology (after canonicalization) + if not isinstance(p_can, URIRef) or str(p_can) not in allowed_predicates: + logger.debug(f"Skipping {s}, {p}, {o} because it is not in the allowed predicates") + continue + + if p_can == RDF.type and not str(o_can).startswith(TARGET_ONTOLOGY_NAMESPACE): + continue + + if is_fusable(p_can): + # Add exactly one value if none exists yet + if not any(seed_graph.objects(s_can, p_can)): + seed_graph.add((s_can, p_can, o_can)) + selected.append( + TrackRecord(subject=s_can,predicate=p_can,object=o,original_subject=s,original_predicate=p,original_object=o)) + # keep subjects set fresh for subsequent matches + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + else: + discarded.append( + TrackRecord(subject=s_can,predicate=p_can,object=o,original_subject=s,original_predicate=p,original_object=o)) + else: + # Non-fusable: copy if not already present (avoid dupes) + if (s_can, p_can, o_can) not in seed_graph: + seed_graph.add((s_can, p_can, o_can)) + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + + # sel(ected) + selected_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".selected.json") + with open(selected_file_path, "w") as f: + json.dump(selected, f, default=lambda x: x.model_dump()) + # dis(carded) + discarded_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".discarded.json") + with open(discarded_file_path, "w") as f: + json.dump(discarded, f, default=lambda x: x.model_dump()) + + prov_graph = Graph() + for sid,gid in sameAsProv.items(): + prov_graph.add((URIRef(gid), OWL.sameAs, URIRef(sid))) + prov_graph.serialize(outputs["output"].path.as_posix() + ".prov", format="nt") + seed_graph.serialize(outputs["output"].path, format="nt") \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/genie.py b/experiments/param-opti/src/param_opti/tasks/genie.py new file mode 100644 index 0000000..a97b566 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/genie.py @@ -0,0 +1,33 @@ +from typing import Dict, Any +from kgpipe.common import Data, DataFormat, Registry, KgTask +from pathlib import Path + +def genie_text_extraction_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + from param_opti.tasks.genie_lib import genie_task_docker, genie_exchange + + # Ensure parent directory exists for the TE JSON output path + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + input_path: Path = inputs["input"].path + final_te_output: Data = outputs["output"] + + # 1) Produce intermediate OpenIE JSON (file or directory) + if input_path.is_dir(): + genie_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie_out" + else: + genie_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie.json" + + genie_outpit = {"output": Data(genie_out_path, DataFormat.OPENIE_JSON)} + genie_task_docker({"input": inputs["input"]}, genie_outpit) + + # 2) Convert OpenIE JSON β†’ TE JSON (final output) + genie_exchange({"input": genie_outpit["output"]}, {"output": final_te_output}) + + +genie_text_extraction_task = KgTask( + name="genie_text_extraction", + input_spec={"input": DataFormat.TEXT}, + output_spec={"output": DataFormat.TE_JSON}, + function=genie_text_extraction_function, + description="Extract text using Genie" +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/genie_lib.py b/experiments/param-opti/src/param_opti/tasks/genie_lib.py new file mode 100644 index 0000000..0806f37 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/genie_lib.py @@ -0,0 +1,89 @@ + +import re +import json +import os +from typing import Dict +from kgpipe.common import Data, TaskInput, TaskOutput +from kgpipe.common import KgTask, DataFormat, Data, Registry, TaskInput, TaskOutput +from kgpipe.common.io import get_docker_volume_bindings, remap_data_path_for_container +from kgpipe.execution import docker_client +from kgpipe_tasks.transform_interop.exchange.entity_matching import ER_Match, ER_Document + + +def genie_task_docker(inputs: TaskInput, outputs: TaskOutput): + """ + GenIE information extraction task that runs in a Docker container. + + Args: + inputs: Dictionary mapping input names to Data objects + outputs: Dictionary mapping output names to Data objects + """ + + all_data = list(inputs.values()) + list(outputs.values()) + volumes, host_to_container = get_docker_volume_bindings(all_data) + + source_path = remap_data_path_for_container(inputs["input"], host_to_container) + output_path = remap_data_path_for_container(outputs["output"], host_to_container) + + client = docker_client( + image="genie:latest", + command=["genie.sh", + str(source_path.path), + str(output_path.path)], + volumes=volumes, + ) + + result = client() + print(f"GenIE completed: {result}") + +def process_io(input_path, output_path, process_file_fn, extension): + if os.path.isdir(input_path): + os.makedirs(output_path, exist_ok=True) + + for filename in os.listdir(input_path): + input_file = os.path.join(input_path, filename) + + if not os.path.isfile(input_file): + continue + + output_file = os.path.join( + output_path, + os.path.splitext(filename)[0] + extension + ) + + process_file_fn(input_file, output_file) + + else: + process_file_fn(input_path, output_path) + +def genie_exchange(inputs: Dict[str, Data], outputs: Dict[str, Data]): + input_path = inputs["input"].path + output_path = outputs["output"].path + + triple_pattern = re.compile( + r"\s*(.*?)\s*\s*(.*?)\s*\s*(.*?)\s*" + ) + + def exchange_file(input_file, output_file): + triples = [] + chains = [] + + with open(input_file, "r", encoding="utf-8") as f: + genie_output = json.load(f) + + for sentence in genie_output: + for beam in sentence: + text = beam.get("text", "") + matches = triple_pattern.findall(text) + + for subj, pred, obj in matches: + triples.append({ + "subject": {"surface_form": subj.strip()}, + "predicate": {"surface_form": pred.strip()}, + "object": {"surface_form": obj.strip()} + }) + + with open(output_file, "w", encoding="utf-8") as f: + json.dump({"triples": triples, "chains": chains}, f, indent=2) + + process_io(input_path, output_path, exchange_file, ".te.json") \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/jedai.py b/experiments/param-opti/src/param_opti/tasks/jedai.py new file mode 100644 index 0000000..062fe48 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/jedai.py @@ -0,0 +1 @@ +# Skipped for now \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/matching_helpers.py b/experiments/param-opti/src/param_opti/tasks/matching_helpers.py new file mode 100644 index 0000000..fd301a0 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/matching_helpers.py @@ -0,0 +1,30 @@ +from pathlib import Path + +from kgpipe.common import Data, DataFormat, KgTask +from typing import Dict +from kgpipe_tasks.transform_interop.exchange.entity_matching import ER_Document +import json + + +def _load_er_document(path: Path) -> ER_Document: + """Parse ER JSON; empty or whitespace-only files yield an empty document (stub tasks may touch-only outputs).""" + raw = path.read_text(encoding="utf-8") + if not raw.strip(): + return ER_Document() + return ER_Document(**json.loads(raw)) + + +def aggregate_matching_results_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + er1 = _load_er_document(Path(inputs["json1"].path)) + er2 = _load_er_document(Path(inputs["json2"].path)) + er_comb = ER_Document(matches=er1.matches + er2.matches) + with open(outputs["output"].path, "w") as f: + json.dump(er_comb.model_dump(), f, indent=4) + + +aggregate_matching_results_task = KgTask( + name="aggregate_matching_results", + input_spec=dict({"json1": DataFormat.ER_JSON, "json2": DataFormat.ER_JSON}), + output_spec=dict({"output": DataFormat.ER_JSON}), + function=aggregate_matching_results_function +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/paris.py b/experiments/param-opti/src/param_opti/tasks/paris.py new file mode 100644 index 0000000..85e2b27 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/paris.py @@ -0,0 +1,126 @@ +from kgpipe.common import TaskInput, TaskOutput, KgTask, DataFormat, Data +from kgpipe.common.model.configuration import ConfigurationProfile, ConfigurationDefinition, Parameter, ParameterType +from pathlib import Path + + + +def paris_entity_alignment_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + matches entities between two RDF graphs + """ + # touch output file + from param_opti.tasks.paris_lib import paris_exchange, paris_entity_matching + entity_matching_threshold = float(config.get_parameter_value("entity_matching_threshold")) + relation_matching_threshold = float(2) # todo skip all matches + + # Ensure parent directory exists for the ER JSON output file + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + # 1 produce matches in paris csv format + matching_dir = outputs["output"].path.parent / f"{outputs['output'].path.stem}_paris_out" + matching_output = {"output": Data(matching_dir, DataFormat.PARIS_CSV)} + + # paris_entity_matching expects {"source": ..., "kg": ...} + paris_entity_matching({"source": inputs["source"], "kg": inputs["target"]}, matching_output) + + # 2 convert paris output dir to er.json format (file) + paris_exchange( + matching_output["output"].path, + outputs["output"].path, + entity_matching_threshold, + relation_matching_threshold, + ) + +paris_entity_alignment_task = KgTask( + name="paris_entity_alignment", + function=paris_entity_alignment_function, + input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.ER_JSON}, + config_spec=ConfigurationDefinition( + name="paris_entity_alignment", + parameters=[ + Parameter(name="entity_matching_threshold", native_keys=["--entity-matching-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) + +def paris_graph_alignment_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + matches both entities and relations between two RDF graphs + """ + # touch output file + from param_opti.tasks.paris_lib import paris_exchange, paris_entity_matching + entity_matching_threshold = float(config.get_parameter_value("entity_matching_threshold")) + relation_matching_threshold = float(config.get_parameter_value("relation_matching_threshold")) + + # Ensure parent directory exists for the ER JSON output file + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + # 1 produce matches in paris csv format + matching_dir = outputs["output"].path.parent / f"{outputs['output'].path.stem}_paris_out" + matching_output = {"output": Data(matching_dir, DataFormat.PARIS_CSV)} + + # paris_entity_matching expects {"source": ..., "kg": ...} + paris_entity_matching({"source": inputs["source"], "kg": inputs["target"]}, matching_output) + + # 2 convert paris output dir to er.json format (file) + paris_exchange( + matching_output["output"].path, + outputs["output"].path, + entity_matching_threshold, + relation_matching_threshold, + ) + +paris_graph_alignment_task = KgTask( + name="paris_graph_alignment", + function=paris_graph_alignment_function, + input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.ER_JSON}, + config_spec=ConfigurationDefinition( + name="paris_graph_alignment", + parameters=[ + Parameter(name="entity_matching_threshold", native_keys=["--entity-matching-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + Parameter(name="relation_matching_threshold", native_keys=["--relation-matching-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) + +def paris_ontology_matching_function(inputs: TaskInput, outputs: TaskOutput, config: ConfigurationProfile): + """ + matches ontologies between two RDF graphs + """ + # touch output file + from param_opti.tasks.paris_lib import paris_exchange, paris_entity_matching + entity_matching_threshold = float(2) # todo skip all matches + ontology_matching_threshold = float(config.get_parameter_value("ontology_matching_threshold")) + + # Ensure parent directory exists for the ER JSON output file + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + # 1 produce matches in paris csv format + matching_dir = outputs["output"].path.parent / f"{outputs['output'].path.stem}_paris_out" + matching_output = {"output": Data(matching_dir, DataFormat.PARIS_CSV)} + + # paris_entity_matching expects {"source": ..., "kg": ...} + paris_entity_matching({"source": inputs["source"], "kg": inputs["target"]}, matching_output) + + # 2 convert paris output dir to er.json format (file) + paris_exchange( + matching_output["output"].path, + outputs["output"].path, + entity_matching_threshold, + ontology_matching_threshold, + ) + +paris_ontology_matching_task = KgTask( + name="paris_ontology_matching", + function=paris_ontology_matching_function, + input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.ER_JSON}, + config_spec=ConfigurationDefinition( + name="paris_ontology_matching", + parameters=[ + Parameter(name="ontology_matching_threshold", native_keys=["--ontology-matching-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/paris_lib.py b/experiments/param-opti/src/param_opti/tasks/paris_lib.py new file mode 100644 index 0000000..2aa5fd5 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/paris_lib.py @@ -0,0 +1,152 @@ +""" +Paris RDF Matcher task implementation. +""" + +from pathlib import Path +from typing import Dict, Any +import pandas as pd +import os +import csv +from typing import List + +from kgpipe.common import KgTask, DataFormat, Data, Registry +from kgpipe.common.io import get_docker_volume_bindings, remap_data_path_for_container +from kgpipe.execution import docker_client +from kgpipe_tasks.transform_interop.exchange.entity_matching import ER_Match, ER_Document + + +def paris_entity_matching(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """ + Paris entity matching task that runs in a Docker container. + + Args: + inputs: Dictionary mapping input names to Data objects + outputs: Dictionary mapping output names to Data objects + """ + # print(f"Running Paris entity matching with inputs: {inputs}") + + all_data = list(inputs.values()) + list(outputs.values()) + volumes, host_to_container = get_docker_volume_bindings(all_data) + + # Extract input paths + source_path = remap_data_path_for_container(inputs["source"], host_to_container) + target_path = remap_data_path_for_container(inputs["kg"], host_to_container) + output_path = remap_data_path_for_container(outputs["output"], host_to_container) + + # Ensure output directory exists + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + # Get all data for Docker volume bindings + + # Create Docker client with proper volume bindings + client = docker_client( + image="kgt/paris:latest", + # command=["ls", "-la"], + command=["bash", "paris.sh", + str(source_path.path), + str(target_path.path), + str(output_path.path)], + volumes=volumes, + ) + + # Execute the container + result = client() + print(f"Paris entity matching completed: {result}") + + +PREFIX_MAP = { + "dbp": "http://dbpedia.org/", + "rdfs": "http://www.w3.org/2000/01/rdf-schema#", + "rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#", + "xsd" : "http://www.w3.org/2001/XMLSchema#", + "schema" : "http://schema.org/", + "dbo": "http://dbpedia.org/ontology/", + "foaf": "http://xmlns.com/foaf/0.1/", + "skos": "http://www.w3.org/2004/02/skos/core#", +} + +def resolvePrefixedUri(uri): + if not uri.startswith("http://") and not uri.startswith("https://"): + prefix, suffix = uri.split(":", 1) + # try: + prefix = PREFIX_MAP[prefix] + # except Exception as e: + # print(f"Unknown prefix: {prefix} for {uri}") + # raise Exception(f"Unknown prefix: {prefix} for {uri}") + return prefix + suffix + else: + return uri + + + +def paris_exchange(input_path: Path, output_path: Path, entity_matching_threshold: float, relation_matching_threshold: float): + """ + Convert Paris CSV output to standard RDF matching format. + + Args: + inputs: Dictionary mapping input names to Data objects (Paris CSV) + outputs: Dictionary mapping output names to Data objects (RDF) + """ + print(f"Converting Paris CSV to matching format with input_path: {input_path} and output_path: {output_path}") + + files = [str(f) for f in os.listdir(input_path)] + + iteration_ids = [ int(f.split("_")[0]) for f in files if f.endswith(".tsv") ] + + iteration_ids.sort() + + last_eqv_it = iteration_ids[-1] + + def getEqvFileName(id): return f"{id}_eqv.tsv" + def getRelFileNames(id): return [f"{id}_superrelations1.tsv",f"{id}_superrelations2.tsv"] + + def check_file_exists(last_eqv_it): + try: + return os.stat(os.path.join(input_path, getEqvFileName(last_eqv_it))).st_size > 0 + except FileNotFoundError: + return -1 + + while 0 == check_file_exists(last_eqv_it) : + last_eqv_it -= 1 + + last_relation_it = last_eqv_it - 1 + + matches : List[ER_Match] = [] + + def extract_matches(file,id_type): + with open(file, newline='', encoding='utf-8') as csvfile: + reader = csv.reader(csvfile, delimiter='\t') + for row in reader: + if len(row) == 3: + er_match = ER_Match( + id_1=resolvePrefixedUri(row[0]), + id_2=resolvePrefixedUri(row[1]), + score=float(row[2]), + id_type=id_type + ) + matches.append(er_match) + + def filter_matches(matches: List[ER_Match]): + + for match in matches: + if match.id_type == "entity" and match.score > entity_matching_threshold: + yield match + if match.id_type == "relation" and match.score > relation_matching_threshold: + yield match + + if last_eqv_it == -1: + doc = ER_Document(matches=list(filter_matches([]))) + with open(output_path, 'w', encoding='utf-8') as jsonfile: + jsonfile.write(doc.model_dump_json()) + else: + eqv_file = getEqvFileName(last_eqv_it) + rel_files = getRelFileNames(last_relation_it) + + extract_matches(os.path.join(input_path,eqv_file),"entity") + [ extract_matches(os.path.join(input_path,f), "relation") for f in rel_files ] + + + doc = ER_Document(matches=list(filter_matches(matches))) + + with open(output_path, 'w', encoding='utf-8') as jsonfile: + jsonfile.write(doc.model_dump_json()) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/select_lib.py b/experiments/param-opti/src/param_opti/tasks/select_lib.py new file mode 100644 index 0000000..7dd390f --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/select_lib.py @@ -0,0 +1,119 @@ +import json +import os +from logging import getLogger +from pathlib import Path +from typing import Dict, List + +from kgcore.api.ontology import OntologyUtil +from kgpipe.common.config import TARGET_ONTOLOGY_NAMESPACE +from kgpipe.common.model.configuration import ConfigurationDefinition +from kgpipe.common.models import Data, DataFormat, KgTask +from pydantic import BaseModel +from rdflib import Graph, RDF, RDFS, SKOS, URIRef + +logger = getLogger(__name__) + +class TrackRecord(BaseModel): + original_subject: str + subject: str + original_predicate: str + predicate: str + original_object: str + object: str + + +def select_first_value_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """ + For two KGs A and B, merge A into B where for each s_p and + 1) p is fusable and B does not have any s_p_o or + 2) p is not fusable erge all s_p_o + """ + ontology_path = os.environ.get("ONTOLOGY_PATH", "false") + if ontology_path == "false": + raise ValueError("ONTOLOGY_PATH is not set") + + ontology = OntologyUtil.load_ontology_from_file(Path(ontology_path)) + allowed_predicates = set[str]([str(p.uri) for p in ontology.properties]+[str(RDFS.label), str(RDF.type), str(SKOS.altLabel)]) + fusable_properties = set[str]([str(p.uri) for p in ontology.properties if p.max_cardinality == 1]+[str(RDFS.label), str(RDF.type)]) + + def is_fusable(p): + return str(p) in fusable_properties + + source_graph = Graph() + source_graph.parse(inputs["source"].path, format="nt") + seed_graph = Graph() # seed graph + seed_graph.parse(inputs["target"].path, format="nt") + + current_subjects = set[str]([str(s) for s in seed_graph.subjects(unique=True)]) + + selected: List[TrackRecord] = [] + discarded: List[TrackRecord] = [] + + for s, p, o in source_graph: + s_can = s + p_can = p + o_can = o + + if not isinstance(p_can, URIRef) or str(p_can) not in allowed_predicates: + continue + + if p_can == RDF.type and not str(o_can).startswith(TARGET_ONTOLOGY_NAMESPACE): + continue + + if is_fusable(p_can): + # Add exactly one value if none exists yet + if not any(seed_graph.objects(s_can, p_can)): + seed_graph.add((s_can, p_can, o_can)) + selected.append( + TrackRecord( + subject=str(s_can), + predicate=str(p_can), + object=str(o_can), + original_subject=str(s), + original_predicate=str(p), + original_object=str(o), + ) + ) + # keep subjects set fresh for subsequent matches + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + else: + discarded.append( + TrackRecord( + subject=str(s_can), + predicate=str(p_can), + object=str(o_can), + original_subject=str(s), + original_predicate=str(p), + original_object=str(o), + ) + ) + else: + # Non-fusable: copy if not already present (avoid dupes) + if (s_can, p_can, o_can) not in seed_graph: + seed_graph.add((s_can, p_can, o_can)) + if isinstance(s_can, URIRef): + current_subjects.add(str(s_can)) + + # sel(ected) + selected_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".selected.json") + with open(selected_file_path, "w") as f: + json.dump(selected, f, default=lambda x: x.model_dump()) + # dis(carded) + discarded_file_path = outputs["output"].path.parent / (outputs["output"].path.stem + ".discarded.json") + with open(discarded_file_path, "w") as f: + json.dump(discarded, f, default=lambda x: x.model_dump()) + + # prov graph is skipped here as no uris are replaced (is done in previouse steps) + seed_graph.serialize(outputs["output"].path, format="nt") + +select_first_value_task = KgTask( + name="select_first_value", + input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES}, + output_spec={"output": DataFormat.RDF_NTRIPLES}, + function=select_first_value_function, + config_spec=ConfigurationDefinition( + name="select_first_value", + parameters=[] + ) +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/spotlight.py b/experiments/param-opti/src/param_opti/tasks/spotlight.py new file mode 100644 index 0000000..3129ad0 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/spotlight.py @@ -0,0 +1,42 @@ +from typing import Dict, Any +from kgpipe.common import Data, DataFormat, Registry, KgTask +from kgpipe.common.model.configuration import ConfigurationDefinition, ConfigurationProfile, Parameter, ParameterType +from pathlib import Path + +def spotlight_entity_linking_function(inputs: Dict[str, Data], outputs: Dict[str, Data], config: ConfigurationProfile ): + from param_opti.tasks.spotlight_lib import dbpedia_spotlight_ner_nel, dbpedia_spotlight_exchange + + # Ensure parent directory exists for the TE JSON output path + outputs["output"].path.parent.mkdir(parents=True, exist_ok=True) + + input_path: Path = inputs["input"].path + final_te_output: Data = outputs["output"] + + # 1) Produce intermediate OpenIE JSON (file or directory) + if input_path.is_dir(): + spotlight_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie_out" + else: + spotlight_out_path = final_te_output.path.parent / f"{final_te_output.path.stem}_corenlp_openie.json" + + spotlight_out = {"output": Data(spotlight_out_path, DataFormat.OPENIE_JSON)} + if not spotlight_out_path.exists(): + dbpedia_spotlight_ner_nel({"input": inputs["input"]}, spotlight_out) + + # 2) Convert OpenIE JSON β†’ TE JSON (final output) + dbpedia_spotlight_exchange({"input": spotlight_out["output"]}, {"output": final_te_output}, config.get_parameter_value("similarity_threshold")) + + + +spotlight_entity_linking_task = KgTask( + name="spotlight_entity_linking", + input_spec={"input": DataFormat.TEXT}, + output_spec={"output": DataFormat.TE_JSON}, + function=spotlight_entity_linking_function, + description="Link entities using Spotlight", + config_spec=ConfigurationDefinition( + name="spotlight_entity_linking", + parameters=[ + Parameter(name="similarity_threshold", native_keys=["--similarity-threshold"], datatype=ParameterType.number, default_value=0.5, required=True, allowed_values=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]), + ] + ) +) \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/spotlight_lib.py b/experiments/param-opti/src/param_opti/tasks/spotlight_lib.py new file mode 100644 index 0000000..0bba12b --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/spotlight_lib.py @@ -0,0 +1,128 @@ +""" +DBpedia Spotlight Entity Linking + +This module provides entity linking using DBpedia Spotlight. +""" + +import json +import os +import requests +from pathlib import Path +from typing import Dict, Any + +from kgpipe.common import KgTask, Data, DataFormat, Registry +from kgpipe.common.io import get_docker_volume_bindings +from kgpipe.execution import docker_client +from tqdm import tqdm + +import os + + +CONFIDENCE = 0.35 +HEADERS = { + "Accept": "application/json" +} +DEFAULT_API_URL = "http://localhost:2222/rest/annotate" + +def api_request(url: str, text: str) -> Dict[str, Any]: + """Make API request to DBpedia Spotlight.""" + data = { + "text": text, + "confidence": str(CONFIDENCE) + } + response = requests.post(url, data=data, headers=HEADERS, verify=False) + + if response.status_code == 200: + result = response.json() + else: + result = { + "error": f"Request failed with status code {response.status_code}", + "text": text + } + return result + + +def dbpedia_spotlight_ner_nel(inputs: Dict[str, Data], outputs: Dict[str, Data]): + """Link entities using DBpedia Spotlight API.""" + input_data = inputs["input"] + output_data = outputs["output"] + + DBPEDIA_ANNOTATE_URL = os.getenv("DBPEDIA_ANNOTATE_URL", DEFAULT_API_URL) + if not DBPEDIA_ANNOTATE_URL: + raise ValueError("Missing DBpedia ANnotate URL") + + dir_or_file = input_data.path + if os.path.isdir(dir_or_file): + os.makedirs(output_data.path, exist_ok=True) + for file in tqdm(os.listdir(dir_or_file)): + with open(os.path.join(dir_or_file, file), encoding='utf-8') as f: + input_text = f.read() + + results = api_request(DBPEDIA_ANNOTATE_URL, input_text) + + with open(os.path.join(output_data.path, file+".json"), 'w', encoding='utf-8') as f: + f.write(json.dumps(results)) + # print(f"Converted {file} to {os.path.join(output_data.path, file)}") + else: + with open(input_data.path, encoding='utf-8') as f: + input_text = f.read() + + results = api_request(DBPEDIA_ANNOTATE_URL, input_text) + + with open(output_data.path, 'w', encoding='utf-8') as f: + f.write(json.dumps(results)) + + +# @Registry.task( +# input_spec={"source": DataFormat.SPOTLIGHT_JSON}, +# output_spec={"output": DataFormat.TE_JSON}, +# description="Convert Spotlight JSON to TE JSON format, with seed filter", +# category=["TextProcessing", "EntityLinking"] +# ) +def dbpedia_spotlight_exchange(inputs: Dict[str, Data], outputs: Dict[str, Data], threshold: float = 0.5): + """Convert Spotlight JSON to TE JSON format.""" + input_path = inputs["input"].path + output_path = outputs["output"].path + + # create output folder + os.makedirs(os.path.normpath(output_path), exist_ok=True) + + def __spotlightjson2tejson(data) -> Dict[str, Any]: + """Convert Spotlight JSON to TE Document format.""" + links = [] + + for result in data.get('Resources', []): + if float(result.get('@similarityScore', 0.0)) < threshold: + continue + link = { + "span": result.get('@surfaceForm', ''), + "mapping": result.get('@URI', ''), + "score": float(result.get('@similarityScore', 0.0)), + "link_type": "entity" + } + links.append(link) + + text = data.get('@text', '') + return {"text": text, "links": links} + + if os.path.isdir(input_path): + for file in os.listdir(input_path): + # Read input json + with open(os.path.join(input_path, file), 'r') as f: + data = json.load(f) + te_doc = __spotlightjson2tejson(data) + outfile = os.path.join(output_path, file) + + with open(outfile, 'w') as of: + json.dump(te_doc, of) + # print(f"Converted {file} to {outfile}") + + else: + # Read input json + with open(input_path, 'r') as f: + data = json.load(f) + te_doc = __spotlightjson2tejson(data) + outfile = os.path.join(output_path, 'output.te.json') + with open(outfile, 'w') as of: + json.dump(te_doc, of) + \ No newline at end of file diff --git a/experiments/param-opti/src/param_opti/tasks/text_helpers.py b/experiments/param-opti/src/param_opti/tasks/text_helpers.py new file mode 100644 index 0000000..0bef3f4 --- /dev/null +++ b/experiments/param-opti/src/param_opti/tasks/text_helpers.py @@ -0,0 +1,375 @@ +import json +import logging +import os +from pathlib import Path +from typing import Dict, List + +from kgcore.api.ontology import Ontology, OntologyUtil +from kgpipe.common import Data, DataFormat, KgTask +from kgpipe.common.model.configuration import ConfigurationDefinition +from kgpipe_tasks.common.benchutils import hash_uri +from kgpipe_tasks.transform_interop.exchange.text_extraction import ( + TE_Chains, + TE_Document, + TE_Pair, + TE_Triple, +) +from rdflib import Graph, Literal, RDF, RDFS, URIRef, XSD + +logger = logging.getLogger(__name__) + + + +def __aggregate_x_te_json(input_paths: List[Path], output_path: Path): + + if len(input_paths) == 0: + raise Exception("No input paths provided") + if not all(os.path.exists(path) for path in input_paths): + raise Exception("All input paths must exist") + + path_is_dir_list = [os.path.isdir(path) for path in input_paths] + if all(path_is_dir_list): + os.makedirs(output_path, exist_ok=True) + for file in os.listdir(input_paths[0]): + sub_file_paths = [Path(os.path.join(path, file)) for path in input_paths] + file_exists = [os.path.exists(path) for path in sub_file_paths] + if all(file_exists): + __aggregate_x_te_json(sub_file_paths, Path(os.path.join(output_path, file))) + else: + logger.warning(f"File {file} does not exist in all input paths") + filtered_sub_file_paths = [path for path in sub_file_paths if os.path.exists(path)] + __aggregate_x_te_json(filtered_sub_file_paths, Path(os.path.join(output_path, file))) + elif not all(path_is_dir_list): + merged_doc = TE_Document() + for file in input_paths: + doc = TE_Document(**json.load(open(file))) + merged_doc.chains += doc.chains + merged_doc.links += doc.links + merged_doc.triples += doc.triples + with open(output_path, "w") as f: + f.write(merged_doc.model_dump_json()) + logger.info(f"Aggregated {", ".join([str(path) for path in input_paths])} to {output_path}") + else: + raise Exception("All inputs must be either directories or files") + + +# @Registry.task( +# input_spec={"json1": DataFormat.TE_JSON, "json2": DataFormat.TE_JSON}, +# output_spec={"output": DataFormat.TE_JSON}, +# description="Aggregate 2 TE_Document JSON files", +# category=["Aggregation"] +# ) +# def aggregate2_te_json(inputs: Dict[str, Data], outputs: Dict[str, Data]): +# __aggregate_x_te_json([inputs["json1"].path, inputs["json2"].path], outputs["output"].path) + + +def aggregate3_text_tasks_task_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + __aggregate_x_te_json([inputs["json1"].path, inputs["json2"].path, inputs["json3"].path], outputs["output"].path) + +aggregate_text_tasks_task = KgTask( + name="aggregate_text_tasks_task", + input_spec={"json1": DataFormat.TE_JSON, "json2": DataFormat.TE_JSON, "json3": DataFormat.TE_JSON}, + output_spec={"output": DataFormat.TE_JSON}, + function=aggregate3_text_tasks_task_function +) + +def aggregate2_text_tasks_task_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + __aggregate_x_te_json([inputs["json1"].path, inputs["json2"].path], outputs["output"].path) + +aggregate_entity_linking_task = KgTask( + name="aggregate_entity_linking_task", + input_spec={"json1": DataFormat.TE_JSON, "json2": DataFormat.TE_JSON}, + output_spec={"output": DataFormat.TE_JSON}, + function=aggregate2_text_tasks_task_function +) + +aggregate_relation_linking_task = KgTask( + name="aggregate_relation_linking_task", + input_spec={"json1": DataFormat.TE_JSON, "json2": DataFormat.TE_JSON}, + output_spec={"output": DataFormat.TE_JSON}, + function=aggregate2_text_tasks_task_function +) + +def generatePredicate(surface_form, namespace): + return URIRef(namespace + surface_form.replace(" ", "_")) + +def __hash_dbpedia_uri(uri: URIRef, namespace: str = "http://kg.org/resource/"): + if uri.startswith("http://dbpedia.org/"): + return URIRef(namespace + hash_uri(str(uri))) + else: + return uri + +def __generateRDF(doc: TE_Document, ontology: Ontology, newP: bool = False, newE: bool = False, namespace: str = "http://kg.org/text/"): + """ + A processing node, part of a pipeline + collects information from extractors, linkers, and resolvers and then it produces the final triples + """ + + def process_chains(triples, chains: List[TE_Chains]): + new_triples = triples + chain_dict = {} + for chain in chains: + for alias in chain.aliases: + chain_dict[alias.surface_form] = chain.main + if len(chain_dict) > 0: + # TODO check if chain_dict should be a dict of TE_SPANS to avoid merging + for triple in new_triples: + if triple.subject.surface_form in chain_dict: + triple.subject.surface_form = chain_dict[triple.subject.surface_form] + if triple.object.surface_form in chain_dict: + triple.object.surface_form = chain_dict[triple.object.surface_form] + return new_triples + + def process_links(triples, links: List[TE_Pair]): + new_triples = triples + try: + if len(links) > 0: + so_spans = {} + p_spans = {} + for triple in triples: + # Add Subject spans + if triple.subject.surface_form.lower().startswith("http://"): + pass + elif triple.subject.surface_form.lower() not in so_spans: + so_spans[triple.subject.surface_form.lower()] = [triple.subject] + else: + so_spans[triple.subject.surface_form.lower()].append(triple.subject) + # Add object spans + if triple.object.surface_form.lower().startswith("http://"): + pass + elif triple.object.surface_form.lower() not in so_spans: + so_spans[triple.object.surface_form.lower()] = [triple.object] + else: + so_spans[triple.object.surface_form.lower()].append(triple.object) + # add predicate spans + if triple.predicate.surface_form.lower().startswith("http://"): + pass + elif triple.predicate.surface_form.lower() not in p_spans: + p_spans[triple.predicate.surface_form.lower()] = [triple.predicate] + else: + p_spans[triple.predicate.surface_form.lower()].append(triple.predicate) + for link in links: + if link.link_type == 'entity': + spans = so_spans + else: + spans = p_spans + if link.span and link.span.lower() in spans: + for span in spans[link.span.lower()]: + span.mapping = link.mapping + # span.surface_form = link.mapping + except Exception as exp: + raise exp + finally: + return new_triples + + + triples: List[TE_Triple] = doc.triples + links: List[TE_Pair] = doc.links + chains: List[TE_Chains] = doc.chains + + dereferenced_tiples = process_chains(triples, chains) + linked_triples: List[TE_Triple] = process_links(dereferenced_tiples, links) + finalGraph = Graph() + + for triple in linked_triples: + subject = None + if triple.subject.mapping: + subject = URIRef(triple.subject.mapping) + # else: + # subject = triple.subject.surface_form + + predicate = None + if triple.predicate.mapping: + predicate = URIRef(triple.predicate.mapping) + elif newP: + predicate = generatePredicate(triple.predicate.surface_form, namespace) + + object = None + # TODO if predicate is a datatype or object property + if triple.object.mapping: + object = URIRef(triple.object.mapping) + # else: + # object = Literal(triple.object.surface_form) + # if(subject and predicate and object): + # finalGraph.add((subject, predicate, object)) + + # new entities + if(predicate): + # print(f"new subject: {subject} {triple.subject.surface_form}") + + domain, range = ontology.get_domain_range(str(predicate)) + isObjectProperty = True if range and range.startswith("http://kg.org") else False + # print(f"predicate: {predicate}, domain: {domain}, range: {range}, isObjectProperty: {isObjectProperty}") + # print(f"predicate: {predicate}, domain: {domain}, range: {range}") + + if subject and subject.startswith("http://dbpedia.org"): # TODO workaround for dbpedia... + finalGraph.add((__hash_dbpedia_uri(subject), RDFS.label, Literal(triple.subject.surface_form))) + + + if not subject and triple.subject.surface_form and newE: + subject = URIRef(namespace+hash_uri(triple.subject.surface_form)) + finalGraph.add((subject, RDFS.label, Literal(triple.subject.surface_form))) + print(f"new subject: {subject} {triple.subject.surface_form}") + else: + print(f"subject: {subject} {triple.subject.surface_form}") + + if domain and subject: + finalGraph.add((__hash_dbpedia_uri(subject), RDF.type, URIRef(domain))) + + if object and isObjectProperty and object.startswith("http://dbpedia.org"): # TODO workaround for dbpedia... + finalGraph.add((__hash_dbpedia_uri(object), RDFS.label, Literal(triple.object.surface_form))) + + if not object and triple.object.surface_form and newE: + if isObjectProperty: + object = URIRef(namespace+hash_uri(triple.object.surface_form)) + finalGraph.add((object, RDFS.label, Literal(triple.object.surface_form))) + if range: + finalGraph.add((object, RDF.type, URIRef(range))) + else: + datatype = range if range else str(XSD.string) + object = Literal(triple.object.surface_form, datatype=datatype) + else: + if not isObjectProperty: + datatype = range if range else str(XSD.string) + object = Literal(triple.object.surface_form, datatype=datatype) + + if(subject and predicate and object): + finalGraph.add((__hash_dbpedia_uri(subject), predicate, __hash_dbpedia_uri(object))) + + return finalGraph + + +def generate_rdf(inputs: Dict[str, Data], outputs: Dict[str, Data], ontology: Ontology, newP: bool, newE: bool): + dir_or_file = inputs["source"].path + graph = Graph() + if os.path.isdir(dir_or_file): + for file in os.listdir(dir_or_file): + json_data = json.load(open(os.path.join(dir_or_file, file))) + doc = TE_Document(**json_data) + print(f"doc: {doc}") + for s, p, o in __generateRDF(doc, ontology, newP=newP, newE=newE): + graph.add(triple=(s, p, o)) + else: + doc = TE_Document(**json.load(open(dir_or_file))) + graph = __generateRDF(doc, ontology, newP=newP, newE=newE) + + graph.serialize(outputs["output"].path, format="nt") + print(f"RDF written to {outputs['output'].path}") + + +def generate_rdf_from_text_results_function(inputs: Dict[str, Data], outputs: Dict[str, Data]): + + ontology_path = os.environ.get("ONTOLOGY_PATH", "false") + if ontology_path == "false": + raise ValueError("ONTOLOGY_PATH is not set") + + ontology = OntologyUtil.load_ontology_from_file(Path(ontology_path)) + + generate_rdf(inputs, outputs, ontology, newP=False, newE=True) + + +generate_rdf_from_text_results_task = KgTask( + name="construct_rdf_from_text_tasks_task", + input_spec={"source": DataFormat.TE_JSON}, + output_spec={"output": DataFormat.RDF_NTRIPLES}, + function=generate_rdf_from_text_results_function +) + + +# ------------------------------------------------------------ + + +# def aggregate_3iejson_with_filter(inputs: Dict[str, Data], outputs: Dict[str, Data]): +# json1_path = inputs["json1"].path +# json2_path = inputs["json2"].path +# json3_path = inputs["json3"].path + +# def load_kg_uris_from_shades(): +# """ +# Loads the URIs of the entities in the current KG. +# """ +# shade_file = "/home/marvin/project/data/current/shade_seed.json" +# with open(shade_file, "r") as f: +# return json.load(f) + +# shade_dict = load_kg_uris_from_shades() +# reverse_shade_dict = {v: k for k, v in shade_dict.items()} +# kg_uris = set(shade_dict.values()) + + +# def filter_ie_doc(doc: TE_Document): +# """ +# Removes links to entities that are not in the current KG. +# """ + +# # for uri in kg_uris: +# # print(uri) + +# # Create a new list instead of modifying while iterating +# filtered_links = [] +# for link in doc.links: +# if link.link_type == "entity": +# if link.mapping not in kg_uris: +# # print(f"Removing entity link to {link.mapping} because it is not in the current KG") +# continue # Skip this link +# else: +# tmp = link.mapping +# try: +# link.mapping = reverse_shade_dict[tmp] +# # print(f"Replacing entity link {tmp} with {link.mapping}") +# except KeyError: +# print(f"KeyError: {tmp} not found in reverse_shade_dict, skipping") +# continue # Skip this link +# # elif link.link_type == "relation": +# # if link.mapping not in kg_uris: +# # print(f"Removing relation link to {link.mapping} because it is not in the current KG") +# # continue # Skip this link + +# # Add the link to the filtered list (either it passed all checks or it's not an entity link) +# filtered_links.append(link) + +# doc.links = filtered_links +# return doc + + +# if os.path.isdir(json1_path) and os.path.isdir(json2_path) and os.path.isdir(json3_path): +# # list files in each directory +# json1_files = set(os.listdir(json1_path)) +# json2_files = set(os.listdir(json2_path)) +# json3_files = set(os.listdir(json3_path)) + +# # check for mismatches +# if json1_files == json2_files == json3_files: +# os.makedirs(outputs["output"].path, exist_ok=True) +# for file in json1_files: +# json1_doc = TE_Document(**json.load(open(os.path.join(json1_path, file)))) +# json2_doc = TE_Document(**json.load(open(os.path.join(json2_path, file)))) +# json3_doc = TE_Document(**json.load(open(os.path.join(json3_path, file)))) + +# merged_doc = TE_Document() +# merged_doc.chains = json1_doc.chains + json2_doc.chains + json3_doc.chains +# merged_doc.links = json1_doc.links + json2_doc.links + json3_doc.links +# merged_doc.triples = json1_doc.triples + json2_doc.triples + json3_doc.triples + +# merged_doc = filter_ie_doc(merged_doc) + +# with open(os.path.join(outputs["output"].path, file), "w") as f: +# f.write(merged_doc.model_dump_json()) +# # print(f"Converted {file} to {os.path.join(outputs['output'].path, file)}") +# else: +# print("File mismatch detected:") +# print("Files only in json1:", json1_files - json2_files - json3_files) +# print("Files only in json2:", json2_files - json1_files - json3_files) +# print("Files only in json3:", json3_files - json1_files - json2_files) +# print("Common files in all:", json1_files & json2_files & json3_files) +# raise Exception("All input directories must contain the same file names") +# else: +# raise Exception("All inputs must be directories") + +# aggregate_3iejson_with_filter_task = KgTask( +# name="aggregate_iejson_with_filter_task", +# input_spec={"json1": DataFormat.TE_JSON, "json2": DataFormat.TE_JSON, "json3": DataFormat.TE_JSON}, +# output_spec={"output": DataFormat.TE_JSON}, +# function=aggregate_3iejson_with_filter +# ) + diff --git a/experiments/param-opti/src/qap/__init__.py b/experiments/param-opti/src/qap/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/param-opti/src/qap/fixtures/rdf_sampled_pipeline_configs.json b/experiments/param-opti/src/qap/fixtures/rdf_sampled_pipeline_configs.json new file mode 100644 index 0000000..5a745c1 --- /dev/null +++ b/experiments/param-opti/src/qap/fixtures/rdf_sampled_pipeline_configs.json @@ -0,0 +1,147 @@ +{ + "samples": [ + { + "profiles": { + "graph_alignment_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.8 + } + ], + "profile_name": "graph_alignment_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.8" + } + }, + "task_keys": [ + "graph_alignment_label_alias_embedding_transformer_task", + "fusion_first_value_task" + ] + }, + { + "profiles": { + "entity_matcher_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.8 + } + ], + "profile_name": "entity_matcher_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.8" + }, + "relation_matcher_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_matcher_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + } + }, + "task_keys": [ + "relation_matcher_label_alias_embedding_transformer_task", + "entity_matcher_label_alias_embedding_transformer_task", + "aggregate_matching_results_task", + "fusion_first_value_task" + ] + }, + { + "profiles": { + "paris_entity_alignment": { + "bindings": [ + { + "parameter": "entity_matching_threshold", + "value": 0.9 + } + ], + "profile_name": "paris_entity_alignment_entity_matching_threshold=0.9" + }, + "relation_matcher_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_matcher_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + } + }, + "task_keys": [ + "relation_matcher_label_alias_embedding_transformer_task", + "paris_entity_alignment_task", + "aggregate_matching_results_task", + "fusion_first_value_task" + ] + }, + { + "profiles": { + "entity_matcher_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.9 + } + ], + "profile_name": "entity_matcher_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.9" + }, + "paris_ontology_matching": { + "bindings": [ + { + "parameter": "ontology_matching_threshold", + "value": 0.5 + } + ], + "profile_name": "paris_ontology_matching_ontology_matching_threshold=0.5" + } + }, + "task_keys": [ + "paris_ontology_matching_task", + "entity_matcher_label_alias_embedding_transformer_task", + "aggregate_matching_results_task", + "fusion_first_value_task" + ] + }, + { + "profiles": { + "paris_graph_alignment": { + "bindings": [ + { + "parameter": "entity_matching_threshold", + "value": 0.9 + }, + { + "parameter": "relation_matching_threshold", + "value": 0.5 + } + ], + "profile_name": "paris_graph_alignment_entity_matching_threshold=0.9,relation_matching_threshold=0.5" + } + }, + "task_keys": [ + "paris_graph_alignment_task", + "fusion_first_value_task" + ] + } + ], + "version": 1 +} diff --git a/experiments/param-opti/src/qap/fixtures/text_sampled_pipeline_configs.json b/experiments/param-opti/src/qap/fixtures/text_sampled_pipeline_configs.json new file mode 100644 index 0000000..e5336cc --- /dev/null +++ b/experiments/param-opti/src/qap/fixtures/text_sampled_pipeline_configs.json @@ -0,0 +1,153 @@ +{ + "samples": [ + { + "profiles": { + "relation_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + }, + "spotlight_entity_linking": { + "bindings": [ + { + "parameter": "similarity_threshold", + "value": 0.8 + } + ], + "profile_name": "spotlight_entity_linking_similarity_threshold=0.8" + } + }, + "task_keys": [ + "corenlp_text_extraction_task", + "spotlight_entity_linking_task", + "aggregate_entity_linking_task", + "relation_linker_label_alias_embedding_transformer_task", + "aggregate_relation_linking_task", + "generate_rdf_from_text_results_task", + "select_first_value_task" + ] + }, + { + "profiles": { + "entity_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.9 + } + ], + "profile_name": "entity_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.9" + }, + "relation_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + } + }, + "task_keys": [ + "corenlp_text_extraction_task", + "entity_linker_label_alias_embedding_transformer_task", + "aggregate_entity_linking_task", + "relation_linker_label_alias_embedding_transformer_task", + "aggregate_relation_linking_task", + "generate_rdf_from_text_results_task", + "select_first_value_task" + ] + }, + { + "profiles": { + "relation_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + }, + "spotlight_entity_linking": { + "bindings": [ + { + "parameter": "similarity_threshold", + "value": 0.8 + } + ], + "profile_name": "spotlight_entity_linking_similarity_threshold=0.8" + } + }, + "task_keys": [ + "genie_text_extraction_task", + "spotlight_entity_linking_task", + "aggregate_entity_linking_task", + "relation_linker_label_alias_embedding_transformer_task", + "aggregate_relation_linking_task", + "generate_rdf_from_text_results_task", + "select_first_value_task" + ] + }, + { + "profiles": { + "entity_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.9 + } + ], + "profile_name": "entity_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.9" + }, + "relation_linker_label_alias_embedding_transformer": { + "bindings": [ + { + "parameter": "model_name", + "value": "intfloat/e5-base-v2" + }, + { + "parameter": "similarity_threshold", + "value": 0.5 + } + ], + "profile_name": "relation_linker_label_alias_embedding_transformer_model_name=intfloat/e5-base-v2,similarity_threshold=0.5" + } + }, + "task_keys": [ + "genie_text_extraction_task", + "entity_linker_label_alias_embedding_transformer_task", + "aggregate_entity_linking_task", + "relation_linker_label_alias_embedding_transformer_task", + "aggregate_relation_linking_task", + "generate_rdf_from_text_results_task", + "select_first_value_task" + ] + } + ], + "version": 1 +} diff --git a/experiments/param-opti/src/qap/sge_metrics.py b/experiments/param-opti/src/qap/sge_metrics.py new file mode 100644 index 0000000..b4e0e33 --- /dev/null +++ b/experiments/param-opti/src/qap/sge_metrics.py @@ -0,0 +1,17 @@ +from kg_sge.api.correctness import SourceGroundCorrectenss, SourceGroundCorrectnessConfig +from kg_sge.api.coverage import SourceGroundedCoverage, SourceGroundedCoverageConfig +from kgpipe_eval.utils.kg_utils import KgManager, KG, KgLike + +class SourceGroundedCorrectnessMetric: + def __init__(self): + self.correctness = SourceGroundCorrectenss() + + def compute(self, kg: KG, config: SourceGroundCorrectnessConfig): + pass + +class SourceGroundedCoverageMetric: + def __init__(self): + self.coverage = SourceGroundedCoverage() + + def compute(self, kg: KG, config: SourceGroundedCoverageConfig): + pass \ No newline at end of file diff --git a/experiments/param-opti/src/qap/test_conf_pipelines.py b/experiments/param-opti/src/qap/test_conf_pipelines.py new file mode 100644 index 0000000..89fe3ee --- /dev/null +++ b/experiments/param-opti/src/qap/test_conf_pipelines.py @@ -0,0 +1,708 @@ +from typing import List, Dict, Any, Optional +import itertools +import json +import random +from kgpipe.common import KgPipe, Data, DataFormat, Registry +from kgpipe.common.model.configuration import ConfigurationProfile, ParameterBinding +from kgpipe.common.model.task import KgTask +from pydantic import BaseModel + +from param_opti.tasks.paris import paris_graph_alignment_task, paris_entity_alignment_task, paris_ontology_matching_task +from param_opti.tasks.fusion import fusion_first_value_task +from param_opti.tasks.base_linker import relation_linker_label_alias_embedding_transformer_task, entity_linker_label_alias_embedding_transformer_task +from param_opti.tasks.base_matcher import ( + graph_alignment_label_alias_embedding_transformer_task, + relation_matcher_label_alias_embedding_transformer_task, + entity_matcher_label_alias_embedding_transformer_task, +) +from param_opti.tasks.corenlp import corenlp_text_extraction_task +from param_opti.tasks.genie import genie_text_extraction_task +from param_opti.tasks.spotlight import spotlight_entity_linking_task +from param_opti.tasks.matching_helpers import aggregate_matching_results_task +from param_opti.tasks.text_helpers import aggregate_entity_linking_task, aggregate_relation_linking_task +from param_opti.tasks.text_helpers import generate_rdf_from_text_results_task +from param_opti.tasks.select_lib import select_first_value_task +from kgpipe.generation.loaders import build_from_conf +from pathlib import Path +# for given tasks and config parameters, generate a pipeline (KGpipe) + +tmp_base_dir = Path("tmp") +if not tmp_base_dir.exists(): + tmp_base_dir.mkdir(parents=True, exist_ok=True) + +RDF_SAMPLED_PIPELINE_CONFIGS_FIXTURE = Path(__file__).resolve().parent / "fixtures" / "rdf_sampled_pipeline_configs.json" +_RDF_PIPELINE_CONFIG_SNAPSHOT_VERSION = 1 + +TEXT_SAMPLED_PIPELINE_CONFIGS_FIXTURE = Path(__file__).resolve().parent / "fixtures" / "text_sampled_pipeline_configs.json" +_TEXT_PIPELINE_CONFIG_SNAPSHOT_VERSION = 1 + + +class PipelineConfig(BaseModel): + tasks: List[KgTask] + config_catalog: Dict[str, ConfigurationProfile] + +RDF_SEARCH_SPACE = { + "graph_alignment_label_alias_embedding_transformer_task": { + "category": ["ontology_matching", "entity_matching", "aggregate_matching_results"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "relation_matcher_label_alias_embedding_transformer_task": { + "category": ["ontology_matching"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "entity_matcher_label_alias_embedding_transformer_task": { + "category": ["entity_matching"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "paris_ontology_matching_task": { + "category": ["ontology_matching"], + "ontology_matching_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "paris_entity_alignment_task": { + "category": ["entity_matching"], + "entity_matching_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "paris_graph_alignment_task": { + "category": ["ontology_matching", "entity_matching", "aggregate_matching_results"], + "entity_matching_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + "relation_matching_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "aggregate_matching_results_task": { + "category": ["aggregate_matching_results"], + }, + "fusion_first_value_task": { + "category": ["fusion"], + # "fusion_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "relation_linker_label_alias_embedding_transformer_task": { + "category": ["entity_linking"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "entity_linker_label_alias_embedding_transformer_task": { + "category": "entity_linking", + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, +} + +TEXT_SEARCH_SPACE = { + "corenlp_text_extraction_task": { + "category": ["information_extraction"], + # does not have config parameters + }, + "genie_text_extraction_task": { + "category": ["information_extraction"], + # does not have config parameters + }, + "spotlight_entity_linking_task": { + "category": ["entity_linking"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "relation_linker_label_alias_embedding_transformer_task": { + "category": ["relation_linking"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "entity_linker_label_alias_embedding_transformer_task": { + "category": ["entity_linking"], + "model_name": ["sentence-transformers/all-MiniLM-L6-v2", "sentence-transformers/all-mpnet-base-v2", "infloat/e5-base-v2"], + "similarity_threshold": [0.5, 0.6, 0.7, 0.8, 0.9], + }, + "aggregate_entity_linking_task": { + "category": ["aggregate_entity_linking"], + }, + "aggregate_relation_linking_task": { + "category": ["aggregate_relation_linking"], + }, + "generate_rdf_from_text_results_task": { + "category": ["construct_rdf"], + }, + "select_first_value_task": { + "category": ["fusion"], + }, +} + +TEXT_TASK_DICT = { + "corenlp_text_extraction_task": corenlp_text_extraction_task, + "genie_text_extraction_task": genie_text_extraction_task, + "spotlight_entity_linking_task": spotlight_entity_linking_task, + "relation_linker_label_alias_embedding_transformer_task": relation_linker_label_alias_embedding_transformer_task, + "entity_linker_label_alias_embedding_transformer_task": entity_linker_label_alias_embedding_transformer_task, + "select_first_value_task": select_first_value_task, + "aggregate_entity_linking_task": aggregate_entity_linking_task, + "aggregate_relation_linking_task": aggregate_relation_linking_task, + "generate_rdf_from_text_results_task": generate_rdf_from_text_results_task, +} + +RDF_TASK_DICT = { + "graph_alignment_label_alias_embedding_transformer_task": graph_alignment_label_alias_embedding_transformer_task, + "relation_matcher_label_alias_embedding_transformer_task": relation_matcher_label_alias_embedding_transformer_task, + "entity_matcher_label_alias_embedding_transformer_task": entity_matcher_label_alias_embedding_transformer_task, + "paris_ontology_matching_task": paris_ontology_matching_task, + "paris_entity_alignment_task": paris_entity_alignment_task, + "paris_graph_alignment_task": paris_graph_alignment_task, + "fusion_first_value_task": fusion_first_value_task, + "relation_linker_label_alias_embedding_transformer_task": relation_linker_label_alias_embedding_transformer_task, + "entity_linker_label_alias_embedding_transformer_task": entity_linker_label_alias_embedding_transformer_task, + "aggregate_matching_results_task": aggregate_matching_results_task, + # "fusion_union_task": fusion_union_task, +} + + + +task_dict = {**TEXT_TASK_DICT, **RDF_TASK_DICT} + +for task_name, task in RDF_TASK_DICT.items(): + Registry.add_task(task_name, task) + +class PipelineLayout(BaseModel): + """ + allowed task categories in the pipeline + """ + allowed_task_categories: List[str] + +TEXT_PIPELINE_LAYOUT = PipelineLayout( + allowed_task_categories=["information_extraction", "entity_linking", "aggregate_entity_linking", "relation_linking", "aggregate_relation_linking", "construct_rdf", "fusion"] +) + +RDF_PIPELINE_LAYOUT = PipelineLayout( + allowed_task_categories=["ontology_matching", "entity_matching", "aggregate_matching_results", "fusion"] +) + +def _task_categories_list(search_space: Dict[str, Dict[str, Any]], task_name: str) -> List[str]: + raw = search_space.get(task_name, {}).get("category") + if isinstance(raw, list): + return [c for c in raw if isinstance(c, str)] + if isinstance(raw, str): + return [raw] + return [] + + +def enumerate_valid_task_combinations( + search_space: Dict[str, Dict[str, Any]], + pipeline_layout: PipelineLayout, +) -> List[List[str]]: + """ + Enumerate all possible task-name combinations for the given pipeline layout, + respecting category order and multi-category coverage, without sampling config options. + + A task is only eligible for the current category if its declared categories are + disjoint from categories already covered by earlier tasks. That avoids pairing e.g. + Paris ontology matching with a dual-category embedding matcher that would repeat + ontology coverage when only entity matching is still needed. + """ + all_task_names = list(search_space.keys()) + + combos: List[List[str]] = [[]] + covered_sets: List[set[str]] = [set()] + + for category in pipeline_layout.allowed_task_categories: + next_combos: List[List[str]] = [] + next_covered_sets: List[set[str]] = [] + + for combo, covered in zip(combos, covered_sets): + if category in covered: + next_combos.append(combo) + next_covered_sets.append(covered) + continue + + eligible: List[str] = [] + for tn in all_task_names: + cats = _task_categories_list(search_space, tn) + if category not in cats: + continue + if set(cats) & covered: + continue + eligible.append(tn) + for tn in eligible: + new_combo = combo + [tn] + new_covered = set(covered) + new_covered.update(_task_categories_list(search_space, tn)) + next_combos.append(new_combo) + next_covered_sets.append(new_covered) + + combos, covered_sets = next_combos, next_covered_sets + + # De-duplicate while keeping stable order. + seen: set[tuple[str, ...]] = set() + unique: List[List[str]] = [] + for c in combos: + t = tuple(c) + if t in seen: + continue + seen.add(t) + unique.append(c) + return unique + +def _get_param(definition: Any, param_name: str): + params = getattr(definition, "parameters", None) + if params is None: + raise KeyError(f"Task config_spec has no parameters field (missing {param_name})") + + # common shapes: dict-like or list of Parameter + if hasattr(params, "get"): + p = params.get(param_name) + if p is None: + raise KeyError(f"Parameter {param_name} not found in config_spec.parameters") + return p + + for p in params: + if getattr(p, "name", None) == param_name: + return p + raise KeyError(f"Parameter {param_name} not found in config_spec.parameters") + + +def pipeline_config_to_snapshot(task_keys: List[str], pipeline_config: PipelineConfig) -> Dict[str, Any]: + profiles: Dict[str, Any] = {} + for task in pipeline_config.tasks: + prof = pipeline_config.config_catalog.get(task.name) + if prof is None: + continue + profiles[task.name] = { + "profile_name": prof.name, + "bindings": [ + {"parameter": binding.parameter.name, "value": binding.value} + for binding in prof.bindings + ], + } + return {"task_keys": task_keys, "profiles": profiles} + + +def pipeline_config_from_snapshot(snapshot: Dict[str, Any]) -> PipelineConfig: + task_keys: List[str] = snapshot["task_keys"] + profiles: Dict[str, Any] = snapshot.get("profiles") or {} + tasks: List[KgTask] = [] + config_catalog: Dict[str, ConfigurationProfile] = {} + + for task_key in task_keys: + task = task_dict[task_key] + tasks.append(task) + prof_data = profiles.get(task.name) + if prof_data is None: + continue + if getattr(task, "config_spec", None) is None: + continue + bindings = [ + ParameterBinding( + parameter=_get_param(task.config_spec, b["parameter"]), + value=b["value"], + ) + for b in prof_data["bindings"] + ] + config_catalog[task.name] = ConfigurationProfile( + name=prof_data["profile_name"], + definition=task.config_spec, + bindings=bindings, + ) + + return PipelineConfig(tasks=tasks, config_catalog=config_catalog) + + +def load_rdf_sampled_pipeline_configs(path: Optional[Path] = None) -> List[PipelineConfig]: + fixture_path = path or RDF_SAMPLED_PIPELINE_CONFIGS_FIXTURE + raw = json.loads(fixture_path.read_text(encoding="utf-8")) + if raw.get("version") != _RDF_PIPELINE_CONFIG_SNAPSHOT_VERSION: + raise ValueError( + f"Unsupported rdf sampled configs snapshot version {raw.get('version')!r}; " + f"expected {_RDF_PIPELINE_CONFIG_SNAPSHOT_VERSION}" + ) + return [pipeline_config_from_snapshot(item) for item in raw["samples"]] + + +def load_text_sampled_pipeline_configs(path: Optional[Path] = None) -> List[PipelineConfig]: + fixture_path = path or TEXT_SAMPLED_PIPELINE_CONFIGS_FIXTURE + raw = json.loads(fixture_path.read_text(encoding="utf-8")) + if raw.get("version") != _TEXT_PIPELINE_CONFIG_SNAPSHOT_VERSION: + raise ValueError( + f"Unsupported text sampled configs snapshot version {raw.get('version')!r}; " + f"expected {_TEXT_PIPELINE_CONFIG_SNAPSHOT_VERSION}" + ) + return [pipeline_config_from_snapshot(item) for item in raw["samples"]] + + +# TODO rules for valid pipeline config: +def sample_valid_pipeline_config( + search_space: Dict[str, Dict[str, Any]], + pipeline_layout: PipelineLayout, +) -> PipelineConfig: + """ + Randomly sample a valid pipeline config from the search space, + respecting the order of categories in the pipeline layout. + """ + tasks: List[KgTask] = [] + config_catalog: Dict[str, ConfigurationProfile] = {} + covered_categories: set[str] = set() + + for category in pipeline_layout.allowed_task_categories: + if category in covered_categories: + continue + + eligible_task_names = [ + tn + for tn, space in search_space.items() + if ( + space.get("category") == category + or ( + isinstance(space.get("category"), list) + and category in (space.get("category") or []) + ) + ) + ] + if not eligible_task_names: + continue + + eligible_task_names = [ + tn + for tn in eligible_task_names + if not (set(_task_categories_list(search_space, tn)) & covered_categories) + ] + if not eligible_task_names: + raise ValueError( + f"No task can cover category {category!r} without overlapping already covered " + f"categories {sorted(covered_categories)}. Adjust search_space or pipeline_layout." + ) + + task_key = random.choice(eligible_task_names) + task = task_dict[task_key] + covered_categories.update(_task_categories_list(search_space, task_key)) + tasks.append(task) + + # metadata only or task has no config spec + if getattr(task, "config_spec", None) is None: + continue + + bindings: List[ParameterBinding] = [] + name_parts: List[str] = [] + for config_name, config_values in search_space[task_key].items(): + if config_name == "category": + continue + if not isinstance(config_values, list): + raise TypeError( + f"Search space values must be lists; got {task_key}.{config_name}={type(config_values)}" + ) + if not config_values: + raise ValueError(f"Empty search space for {task_key}.{config_name}") + + config_value = random.choice(config_values) + name_parts.append(f"{config_name}={config_value}") + bindings.append( + ParameterBinding( + parameter=_get_param(task.config_spec, config_name), + value=config_value, + ) + ) + + if bindings: + config_catalog[task.name] = ConfigurationProfile( + name=f"{task.name}_" + ",".join(name_parts), + definition=task.config_spec, + bindings=bindings, + ) + + return PipelineConfig(tasks=tasks, config_catalog=config_catalog) + +def print_pipeline_config_short(pipeline_config: PipelineConfig): + """ + print the pipeline config in a short format + """ + print() + print("================") + for task in pipeline_config.tasks: + task_name = task.name + profile: Optional[ConfigurationProfile] = pipeline_config.config_catalog.get(task_name) + if profile is None: + print(f"- {task_name}") + continue + + parts: List[str] = [] + for binding in profile.bindings: + parts.append(f"{binding.parameter.name}={binding.value}") + params = ", ".join(parts) + print(f"- {task_name}({params})") + +def sample_config_catalog_for_task_combo( + search_space: Dict[str, Dict[str, Any]], + task_name_combo: List[str], + *, + rng: random.Random, +) -> PipelineConfig: + tasks: List[KgTask] = [] + config_catalog: Dict[str, ConfigurationProfile] = {} + + for task_key in task_name_combo: + task = task_dict[task_key] + tasks.append(task) + + if getattr(task, "config_spec", None) is None: + continue + + bindings: List[ParameterBinding] = [] + name_parts: List[str] = [] + for config_name, config_values in search_space[task_key].items(): + if config_name == "category": + continue + if not isinstance(config_values, list): + raise TypeError( + f"Search space values must be lists; got {task_key}.{config_name}={type(config_values)}" + ) + if not config_values: + raise ValueError(f"Empty search space for {task_key}.{config_name}") + + config_value = rng.choice(config_values) + name_parts.append(f"{config_name}={config_value}") + bindings.append( + ParameterBinding( + parameter=_get_param(task.config_spec, config_name), + value=config_value, + ) + ) + + if bindings: + config_catalog[task.name] = ConfigurationProfile( + name=f"{task.name}_" + ",".join(name_parts), + definition=task.config_spec, + bindings=bindings, + ) + + return PipelineConfig(tasks=tasks, config_catalog=config_catalog) + +def enumerate_exhaustive_pipeline_config_snapshots( + search_space: Dict[str, Dict[str, Any]], + pipeline_layout: PipelineLayout, +) -> List[Dict[str, Any]]: + combos = enumerate_valid_task_combinations(search_space, pipeline_layout) + + def _task_param_assignments(task_key: str) -> List[Dict[str, Any]]: + space = search_space.get(task_key, {}) + param_space: Dict[str, List[Any]] = {k: v for k, v in space.items() if k != "category"} + if not param_space: + return [{}] + keys = list(param_space.keys()) + values_lists = [param_space[k] for k in keys] + return [dict(zip(keys, values)) for values in itertools.product(*values_lists)] + + all_snapshots: List[Dict[str, Any]] = [] + total_expected = 0 + + for combo in combos: + per_task_assignments = [_task_param_assignments(task_key) for task_key in combo] + + expected_for_combo = 1 + for assignments in per_task_assignments: + expected_for_combo *= len(assignments) + total_expected += expected_for_combo + + produced_for_combo = 0 + print() + print("combo:", combo) + print("expected configs:", expected_for_combo) + + for assignment_tuple in itertools.product(*per_task_assignments): + produced_for_combo += 1 + if produced_for_combo % 100 == 1 or produced_for_combo == expected_for_combo: + print(f"config {produced_for_combo}/{expected_for_combo}") + + tasks: List[KgTask] = [] + config_catalog: Dict[str, ConfigurationProfile] = {} + + for task_key, params in zip(combo, assignment_tuple): + task = task_dict[task_key] + tasks.append(task) + + if not params: + continue + if getattr(task, "config_spec", None) is None: + continue + + bindings: List[ParameterBinding] = [] + name_parts: List[str] = [] + + # Iterate in search_space order for stable snapshots. + for config_name, _config_values in search_space[task_key].items(): + if config_name == "category": + continue + if config_name not in params: + continue + config_value = params[config_name] + name_parts.append(f"{config_name}={config_value}") + bindings.append( + ParameterBinding( + parameter=_get_param(task.config_spec, config_name), + value=config_value, + ) + ) + + config_catalog[task.name] = ConfigurationProfile( + name=f"{task.name}_" + ",".join(name_parts), + definition=task.config_spec, + bindings=bindings, + ) + + pipeline_config = PipelineConfig(tasks=tasks, config_catalog=config_catalog) + all_snapshots.append(pipeline_config_to_snapshot(combo, pipeline_config)) + + assert produced_for_combo == expected_for_combo + + print() + print("TOTAL expected configs:", total_expected) + print("TOTAL generated snapshots:", len(all_snapshots)) + return all_snapshots + + + +def test_sample_valid_rdf_pipeline_config(): + pipeline_layout = PipelineLayout( + allowed_task_categories=["ontology_matching", "entity_matching", "aggregate_matching_results", "fusion"] + ) + pipeline_config = sample_valid_pipeline_config(RDF_SEARCH_SPACE, pipeline_layout) + print_pipeline_config_short(pipeline_config) + +def test_enumerate_all_valid_rdf_task_combinations_no_config_sampling(): + print("enumerate_all_valid_rdf_task_combinations_no_config_sampling") + combos = enumerate_valid_task_combinations(RDF_SEARCH_SPACE, RDF_PIPELINE_LAYOUT) + + for combo in combos: + print(combo) + + # With current SEARCH_SPACE: + # - ontology_matching can be satisfied by paris_ontology_matching_task, paris_entity_alignment_task, paris_graph_alignment_task + # - entity_matching can be satisfied by paris_entity_alignment_task, paris_graph_alignment_task (and may be skipped if already covered) + # - fusion must be satisfied by fusion_first_value_task + # expected = { + # ("paris_ontology_matching_task", "paris_entity_alignment_task", "fusion_first_value_task"), + # ("paris_ontology_matching_task", "paris_graph_alignment_task", "fusion_first_value_task"), + # ("paris_graph_alignment_task", "fusion_first_value_task"), + # } + + # assert set(tuple(c) for c in combos) == expected + + + +def test_enumerate_all_valid_rdf_task_combinations_with_config_sampling(): + print("enumerate_all_valid_rdf_task_combinations_with_config_sampling") + n = 1 + rng = random.Random(0) + + combos = enumerate_valid_task_combinations(RDF_SEARCH_SPACE, RDF_PIPELINE_LAYOUT) + + total_config_count = 0 + snapshots: List[Dict[str, Any]] = [] + + for combo in combos: + print() + print("combo:", combo) + for i in range(n): + total_config_count += 1 + print(f"sample {total_config_count}/{len(combos) * n}") + pipeline_config = sample_config_catalog_for_task_combo( + RDF_SEARCH_SPACE, combo, rng=rng + ) + + print_pipeline_config_short(pipeline_config) + snapshots.append(pipeline_config_to_snapshot(combo, pipeline_config)) + + RDF_SAMPLED_PIPELINE_CONFIGS_FIXTURE.parent.mkdir(parents=True, exist_ok=True) + RDF_SAMPLED_PIPELINE_CONFIGS_FIXTURE.write_text( + json.dumps( + {"version": _RDF_PIPELINE_CONFIG_SNAPSHOT_VERSION, "samples": snapshots}, + indent=2, + sort_keys=True, + ) + + "\n", + encoding="utf-8", + ) + + +def test_sample_valid_text_pipeline_config(): + pipeline_config = sample_valid_pipeline_config(TEXT_SEARCH_SPACE, TEXT_PIPELINE_LAYOUT) + print_pipeline_config_short(pipeline_config) + + +def test_enumerate_all_valid_text_task_combinations_no_config_sampling(): + print("enumerate_all_valid_text_task_combinations_no_config_sampling") + combos = enumerate_valid_task_combinations(TEXT_SEARCH_SPACE, TEXT_PIPELINE_LAYOUT) + for combo in combos: + print(combo) + +def test_enumerate_all_valid_text_task_combinations_with_config_sampling(): + print("enumerate_all_valid_text_task_combinations_with_config_sampling") + n = 1 + rng = random.Random(0) + + combos = enumerate_valid_task_combinations(TEXT_SEARCH_SPACE, TEXT_PIPELINE_LAYOUT) + + total_config_count = 0 + snapshots: List[Dict[str, Any]] = [] + + for combo in combos: + print() + print("combo:", combo) + for i in range(n): + total_config_count += 1 + print(f"sample {total_config_count}/{len(combos) * n}") + pipeline_config = sample_config_catalog_for_task_combo( + TEXT_SEARCH_SPACE, combo, rng=rng + ) + print_pipeline_config_short(pipeline_config) + snapshots.append(pipeline_config_to_snapshot(combo, pipeline_config)) + + TEXT_SAMPLED_PIPELINE_CONFIGS_FIXTURE.parent.mkdir(parents=True, exist_ok=True) + TEXT_SAMPLED_PIPELINE_CONFIGS_FIXTURE.write_text( + json.dumps( + {"version": _TEXT_PIPELINE_CONFIG_SNAPSHOT_VERSION, "samples": snapshots}, + indent=2, + sort_keys=True, + ) + + "\n", + encoding="utf-8", + ) + +def test_enumerate_all_valid_text_task_combinations_with_config_sampling_exhaustive(): + print("enumerate_all_valid_text_task_combinations_with_config_sampling_exhaustive") + all_snapshots = enumerate_exhaustive_pipeline_config_snapshots( + TEXT_SEARCH_SPACE, TEXT_PIPELINE_LAYOUT + ) + serialized = [json.dumps(s, sort_keys=True) for s in all_snapshots] + assert len(set(serialized)) == len(serialized) + + +def test_enumerate_all_valid_rdf_task_combinations_with_config_sampling_exhaustive(): + print("enumerate_all_valid_rdf_task_combinations_with_config_sampling_exhaustive") + all_snapshots = enumerate_exhaustive_pipeline_config_snapshots( + RDF_SEARCH_SPACE, RDF_PIPELINE_LAYOUT + ) + serialized = [json.dumps(s, sort_keys=True) for s in all_snapshots] + assert len(set(serialized)) == len(serialized) + + + +# def test_rdf_pipeline_from_config(): +# pipeline_config = sample_valid_pipeline_config(RDF_SEARCH_SPACE, PipelineLayout(allowed_task_categories=["entity_matching", "fusion"])) + +# seed_path = tmp_base_dir / "seed.nt" +# source_path = tmp_base_dir / "source.nt" +# result_path = tmp_base_dir / "result.nt" +# tasks_tmp_dir = tmp_base_dir / "tasks_tmp" +# tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + +# # Ensure inputs exist for pipeline execution. +# seed_path.write_text(" .\n") +# source_path.write_text(" .\n") + +# pipeline = KgPipe( +# tasks=pipeline_config.tasks, +# seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), +# data_dir=tasks_tmp_dir, +# name="test_pipeline") + +# pipeline.build( +# stable_files=True, +# configCatalog=pipeline_config.config_catalog, +# source=Data(path=source_path, format=DataFormat.RDF_NTRIPLES), +# result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES)) + +# pipeline.run(configCatalog=pipeline_config.config_catalog, stable_files_override=True) \ No newline at end of file diff --git a/experiments/param-opti/src/qap/test_eval_pipelines.py b/experiments/param-opti/src/qap/test_eval_pipelines.py new file mode 100644 index 0000000..bcc5012 --- /dev/null +++ b/experiments/param-opti/src/qap/test_eval_pipelines.py @@ -0,0 +1,122 @@ +from pathlib import Path +import json +import pytest + +from kgpipe_eval.utils.kg_utils import KgManager +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentMetric, TripleAlignmentConfig +from kgpipe_eval.metrics.entity_alignment import EntityAlignmentMetric, EntityAlignmentConfig +from kgpipe_eval.api import MetricResult +from kgpipe_eval.test.utils import render_metric_result + +rdf_base_dir = Path("data/tmp/rdf_pipelines/") +text_base_dir = Path("data/tmp/text_pipelines/") +result_dir = Path("data/output/reference_eval") + +def get_rdf_final_kgs(): + """ + get all files matching rdf_result_saved_sample_config_idx_*.nt in dir + """ + BASE_DIR = rdf_base_dir + return [f for f in BASE_DIR.glob("*eval.nt")] + +def get_text_final_kgs(): + """ + get all files matching text_result_saved_sample_config_idx_*.nt in dir + """ + BASE_DIR = text_base_dir + return [f for f in BASE_DIR.glob("*eval.nt")] + +def test_get_rdf_final_kgs(): + """ + test the get_final_kgs function + """ + final_kgs = get_rdf_final_kgs() + for final_kg in final_kgs: + print(final_kg) + +def test_get_text_final_kgs(): + """ + test the get_final_kgs function + """ + final_kgs = get_text_final_kgs() + for final_kg in final_kgs: + print(final_kg) + +def _write_to_file(string: str, path: Path): + with open(path, "w") as f: + f.write(string) + print(f"wrote to {path}") + +def _metric_result_to_jsonable(metric_result: MetricResult) -> dict: + metric = metric_result.metric + metric_key = getattr(metric, "key", metric.__class__.__name__) + return { + "metric": metric_key, + "summary": metric_result.summary, + "measurements": [ + {"name": m.name, "value": m.value, "unit": m.unit} + for m in metric_result.measurements + ], + } + +def _write_json(obj: object, path: Path): + with open(path, "w") as f: + json.dump(obj, f, indent=2, sort_keys=True, default=str) + f.write("\n") + print(f"wrote to {path}") + +# seed_kg = KgManager.load_kg(Path("data/input_final/target_kg/graph.nt")) + +def eval_pipeline(final_kg, reference_kg_path): + + print(f"evaluating {final_kg}") + + ref_kg_path = reference_kg_path + gen_kg_path = final_kg + + entity_alignment_config = EntityAlignmentConfig( + method="label_embedding", + reference_kg=ref_kg_path, + verified_entities_path=None, + verified_entities_delimiter="\t", + entity_sim_threshold=0.95 + ) + + + gen_kg = KgManager.load_kg(gen_kg_path) + # test_kg = KgManager.substract_kg(gen_kg, seed_kg) # TODO add back labels and types + test_kg = gen_kg + + metric_result : MetricResult = EntityAlignmentMetric().compute(test_kg, entity_alignment_config) + result_string = render_metric_result(metric_result) + _write_to_file(result_string, result_dir / (final_kg.name + ".entity_alignment.txt")) + _write_json(_metric_result_to_jsonable(metric_result), result_dir / (final_kg.name + ".entity_alignment.json")) + + + triple_alignment_config = TripleAlignmentConfig( + reference_kg=ref_kg_path, + entity_alignment_config=entity_alignment_config, + value_sim_threshold=0.5, + cache_literal_embeddings=True + ) + + metric_result : MetricResult = TripleAlignmentMetric().compute(test_kg, triple_alignment_config) + result_string = render_metric_result(metric_result) + _write_to_file(result_string, result_dir / (final_kg.name + ".triple_alignment.txt")) + _write_json(_metric_result_to_jsonable(metric_result), result_dir / (final_kg.name + ".triple_alignment.json")) + +@pytest.mark.parametrize("final_kg", get_rdf_final_kgs()) +def test_eval_rdf_pipeline_runs(final_kg): + """ + evaluate all runs of the rdf pipelines + """ + eval_pipeline(final_kg, Path("data/input_final/reference_kg/data_no_seed.nt")) + +@pytest.mark.parametrize("final_kg", get_text_final_kgs()) +def test_eval_text_pipeline_runs(final_kg): + """ + evaluate all runs of the text pipelines + """ + # data/input_final/txt_source/ref + + eval_pipeline(final_kg, Path("/data/datasets/params_experiments/latest/input_final/txt_source/tmp_reference/reference_kg_noseed.nt")) \ No newline at end of file diff --git a/experiments/param-opti/src/qap/test_exec_pipelines.py b/experiments/param-opti/src/qap/test_exec_pipelines.py new file mode 100644 index 0000000..1804e86 --- /dev/null +++ b/experiments/param-opti/src/qap/test_exec_pipelines.py @@ -0,0 +1,202 @@ +from kgpipe.common import KgPipe, Data, DataFormat +from kgpipe.common.model.configuration import ConfigurationProfile, ParameterBinding, ConfigurationDefinition +from param_opti.tasks.paris import paris_graph_alignment_task, paris_entity_alignment_task, paris_ontology_matching_task +from param_opti.tasks.fusion import fusion_first_value_task +from param_opti.tasks.base_linker import relation_linker_label_alias_embedding_transformer_task, entity_linker_label_alias_embedding_transformer_task +from param_opti.tasks.corenlp import corenlp_text_extraction_task +from param_opti.tasks.genie import genie_text_extraction_task +from param_opti.tasks.spotlight import spotlight_entity_linking_task +from param_opti.tasks.text_helpers import aggregate_text_tasks_task, generate_rdf_from_text_results_task +from param_opti.tasks.select_lib import select_first_value_task +from qap.test_conf_pipelines import ( + PipelineConfig, + _get_param, + load_rdf_sampled_pipeline_configs, + load_text_sampled_pipeline_configs, +) +from pathlib import Path +import pytest +import os + +from dotenv import load_dotenv +load_dotenv() + +tmp_base_dir = Path("data/tmp/text_pipelines") +if not tmp_base_dir.exists(): + tmp_base_dir.mkdir(parents=True, exist_ok=True) + + +ontology_path = "data/input_final/target_kg/ontology.ttl" +os.environ["ONTOLOGY_PATH"] = ontology_path + + +def get_default_rdf_pipeline_config() -> PipelineConfig: + return PipelineConfig( + tasks=[ + paris_graph_alignment_task, + fusion_first_value_task, + ], + config_catalog={ + # Key must match KgTask.name because KgPipe delegates by task.name + "paris_graph_alignment": ConfigurationProfile( + name="paris_graph_alignment", + definition=paris_graph_alignment_task.config_spec, + bindings=[ + ParameterBinding(parameter=_get_param(paris_graph_alignment_task.config_spec, "entity_matching_threshold"), value=0.5), + ParameterBinding(parameter=_get_param(paris_graph_alignment_task.config_spec, "relation_matching_threshold"), value=0.5), + ], + ) + }, + ) + +def test_rdf_pipeline_from_default_config(): + pipeline_config = get_default_rdf_pipeline_config() + + seed_path = tmp_base_dir / "seed.nt" + source_path = tmp_base_dir / "source.nt" + result_path = tmp_base_dir / "result.nt" + tasks_tmp_dir = tmp_base_dir / "tasks_tmp" + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + # Ensure inputs exist for pipeline execution. + seed_path.write_text(" .\n") + source_path.write_text(" .\n") + + pipeline = KgPipe( + tasks=pipeline_config.tasks, + seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), + data_dir=tasks_tmp_dir, + name="test_pipeline") + + pipeline.build( + stable_files=True, + configCatalog=pipeline_config.config_catalog, + source=Data(path=source_path, format=DataFormat.RDF_NTRIPLES), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES)) + + pipeline.run(configCatalog=pipeline_config.config_catalog, stable_files_override=True) + + +@pytest.mark.parametrize("config_idx", range(len(load_rdf_sampled_pipeline_configs()))) +def test_rdf_pipeline_from_saved_sampled_configs(config_idx): + """Runs KGpipe using PipelineConfigs materialized from the JSON fixture written by test_pipeline_config.""" + configs = load_rdf_sampled_pipeline_configs() + assert configs, "fixtures/rdf_sampled_pipeline_configs.json is missing or empty; run test_enumerate_all_valid_rdf_task_combinations_with_config_sampling" + + pipeline_config = configs[config_idx] + + seed_path = Path("data/input_final/target_kg/graph.nt") + source_path = Path("data/input_final/rdf_source/graph.nt") + result_path = tmp_base_dir / f"rdf_result_saved_sample_config_idx_{config_idx}.nt" + tasks_tmp_dir = tmp_base_dir / f"rdf_tasks_tmp_saved_sample_config_idx_{config_idx}" + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + pipeline = KgPipe( + tasks=pipeline_config.tasks, + seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), + data_dir=tasks_tmp_dir, + name="test_pipeline_saved_sample", + ) + + pipeline.build( + stable_files=True, + configCatalog=pipeline_config.config_catalog, + source=Data(path=source_path, format=DataFormat.RDF_NTRIPLES), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES), + ) + + pipeline.run(configCatalog=pipeline_config.config_catalog, stable_files_override=True) + + +def get_default_text_pipeline_config() -> PipelineConfig: + return PipelineConfig( + tasks=[ + corenlp_text_extraction_task, + entity_linker_label_alias_embedding_transformer_task, + relation_linker_label_alias_embedding_transformer_task, + aggregate_text_tasks_task, + generate_rdf_from_text_results_task, + select_first_value_task, + ], + config_catalog={ + "entity_linker_label_alias_embedding_transformer": ConfigurationProfile( + name="entity_linker_label_alias_embedding_transformer", + definition=entity_linker_label_alias_embedding_transformer_task.config_spec, + bindings=[ + ParameterBinding(parameter=_get_param(entity_linker_label_alias_embedding_transformer_task.config_spec, "model_name"), value="sentence-transformers/all-MiniLM-L6-v2"), + ParameterBinding(parameter=_get_param(entity_linker_label_alias_embedding_transformer_task.config_spec, "similarity_threshold"), value=0.5), + ], + ), + "relation_linker_label_alias_embedding_transformer": ConfigurationProfile( + name="relation_linker_label_alias_embedding_transformer", + definition=relation_linker_label_alias_embedding_transformer_task.config_spec, + bindings=[ + ParameterBinding(parameter=_get_param(relation_linker_label_alias_embedding_transformer_task.config_spec, "model_name"), value="sentence-transformers/all-MiniLM-L6-v2"), + ParameterBinding(parameter=_get_param(relation_linker_label_alias_embedding_transformer_task.config_spec, "similarity_threshold"), value=0.5), + ], + ), + }, + ) + +def test_text_pipeline_from_default_config(): + pipeline_config = get_default_text_pipeline_config() + + import os + os.environ["ONTOLOGY_PATH"] = "data/input_final/target_kg/ontology.ttl" + + seed_path = Path("data/input_final/target_kg/graph.nt") + source_path = Path("data/input_final/txt_source/docs") + result_path = Path("data/tmp/text_pipelines/result.nt") + tasks_tmp_dir = Path("data/tmp/text_pipelines/tasks_tmp") + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + pipeline = KgPipe( + tasks=pipeline_config.tasks, + seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), + data_dir=tasks_tmp_dir, + name="test_text_pipeline") + + pipeline.build( + stable_files=True, + configCatalog=pipeline_config.config_catalog, + source=Data(path=source_path, format=DataFormat.TEXT), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES)) + + pipeline.run(configCatalog=pipeline_config.config_catalog, stable_files_override=False) + + +@pytest.mark.parametrize("config_idx", range(len(load_text_sampled_pipeline_configs()))) +def test_text_pipeline_from_saved_sampled_configs(config_idx): + """Runs KGpipe using PipelineConfigs materialized from the JSON fixture written by test_pipeline_config.""" + configs = load_text_sampled_pipeline_configs() + assert configs, "fixtures/text_sampled_pipeline_configs.json is missing or empty; run test_enumerate_all_valid_text_task_combinations_with_config_sampling" + + pipeline_config = configs[config_idx] + + seed_path = Path("data/input_final/target_kg/graph.nt") + source_path = Path("data/input_final/txt_source/docs") + result_path = tmp_base_dir / f"text_result_saved_sample_config_idx_{config_idx}.nt" + tasks_tmp_dir = tmp_base_dir / f"text_tasks_tmp_saved_sample_config_idx_{config_idx}" + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + pipeline = KgPipe( + tasks=pipeline_config.tasks, + seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), + data_dir=tasks_tmp_dir, + name="test_text_pipeline_saved_sample", + ) + + print(f"Building pipeline... {config_idx}") + print("#######################") + + pipeline.build( + stable_files=True, + configCatalog=pipeline_config.config_catalog, + source=Data(path=source_path, format=DataFormat.TEXT), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES), + ) + + print(f"Running pipeline... {config_idx}") + print("#######################") + + pipeline.run(configCatalog=pipeline_config.config_catalog, stable_files_override=False) \ No newline at end of file diff --git a/experiments/param-opti/src/qap/test_ref_based.py b/experiments/param-opti/src/qap/test_ref_based.py new file mode 100644 index 0000000..fca309f --- /dev/null +++ b/experiments/param-opti/src/qap/test_ref_based.py @@ -0,0 +1,205 @@ +from kgpipe.common import KgPipe, Data, DataFormat +from kgpipe.common.model.configuration import ConfigurationProfile, ParameterBinding, ConfigurationDefinition +from param_opti.tasks.paris import paris_graph_alignment_task +from param_opti.tasks.fusion import fusion_first_value_task +from param_opti.tasks.openie import openie_pipeline_task +from param_opti.tasks.base_linker import relation_linker_label_alias_embedding_transformer_task, entity_linker_label_alias_embedding_transformer_task +from pathlib import Path +from typing import List +import pytest +# Using ground truth + +# 1. execute PARIS pipeline, with different thresholds +# 2. evaluate the quality of the pipeline, with different thresholds + + +# - [ ] impl paris wrapper with exchange and threshold filter + +ontology_path = "tmp/ontology.ttl" + +tmp_base_dir = Path("data/tmp/rdf_pipelines") +tmp_base_dir.mkdir(parents=True, exist_ok=True) + + +def _write_to_file(string: str, path: Path): + with open(path, "w") as f: + f.write(string) + +def _get_param(definition: ConfigurationDefinition, param_name: str): + params = getattr(definition, "parameters", None) + if params is None: + raise KeyError(f"Task config_spec has no parameters field (missing {param_name})") + + if hasattr(params, "get"): + p = params.get(param_name) + if p is None: + raise KeyError(f"Parameter {param_name} not found in config_spec.parameters") + return p + + for p in params: + if getattr(p, "name", None) == param_name: + return p + raise KeyError(f"Parameter {param_name} not found in config_spec.parameters") + + +def get_paris_pipeline(entity_matching_threshold: float, relation_matching_threshold: float): + name = ( + f"paris_graph_alignment(entity={entity_matching_threshold},rel={relation_matching_threshold})" + "_fusion_first_value" + ) + + seed_path = Path("data/inputs/target_kg/data.nt") + source_path = Path("data/inputs/rdf_source/data.nt") + result_path = Path(f"data/tmp/rdf_pipelines/result_{entity_matching_threshold}_{relation_matching_threshold}.nt") + tasks_tmp_dir = Path(f"data/tmp/rdf_pipelines/tasks_tmp_{entity_matching_threshold}_{relation_matching_threshold}") + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + config_catalog = { + "paris_graph_alignment": ConfigurationProfile( + name=f"paris_graph_alignment_entity={entity_matching_threshold},relation={relation_matching_threshold}", + definition=paris_graph_alignment_task.config_spec, + bindings=[ + ParameterBinding( + parameter=_get_param(paris_graph_alignment_task.config_spec, "entity_matching_threshold"), + value=entity_matching_threshold, + ), + ParameterBinding( + parameter=_get_param(paris_graph_alignment_task.config_spec, "relation_matching_threshold"), + value=relation_matching_threshold, + ), + ], + ), + "fusion_first_value": ConfigurationProfile( + name="fusion_first_value", + definition=fusion_first_value_task.config_spec, + bindings=[ + ParameterBinding( + parameter=_get_param(fusion_first_value_task.config_spec, "ontology_path"), + value=ontology_path, + ), + ], + ) + } + + pipeline = KgPipe( + name=name, + tasks=[paris_graph_alignment_task, fusion_first_value_task], + seed=Data(path=seed_path, format=DataFormat.RDF_NTRIPLES), + data_dir=tasks_tmp_dir, + ) + + pipeline.build( + stable_files=True, + configCatalog=config_catalog, + source=Data(path=source_path, format=DataFormat.RDF_NTRIPLES), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES), + ) + + return pipeline, config_catalog + +def get_openie_pipeline(entity_linking_threshold: float, relation_linking_threshold: float): + name = ( + f"openie_pipeline(entity={entity_linking_threshold},rel={relation_linking_threshold})" + ) + + seed_path = Path("data/inputs/target_kg/data.nt") + source_path = Path("data/inputs/text_source/docs") + result_path = Path(f"data/tmp/rdf_pipelines/result_{entity_linking_threshold}_{relation_linking_threshold}.nt") + tasks_tmp_dir = Path(f"data/tmp/rdf_pipelines/tasks_tmp_{entity_linking_threshold}_{relation_linking_threshold}") + tasks_tmp_dir.mkdir(parents=True, exist_ok=True) + + config_catalog = { + "openie_pipeline": ConfigurationProfile( + name=name, + definition=openie_pipeline_task.config_spec, + bindings=[ + ParameterBinding(parameter=_get_param(openie_pipeline_task.config_spec, "entity_linking_threshold"), value=entity_linking_threshold), + ParameterBinding(parameter=_get_param(openie_pipeline_task.config_spec, "relation_linking_threshold"), value=relation_linking_threshold), + ], + ), + } + + pipeline = KgPipe( + name=name, + tasks=[openie_pipeline_task, relation_linker_label_alias_embedding_transformer_task, entity_linker_label_alias_embedding_transformer_task], + seed=Data(path=seed_path, format=DataFormat.TEXT), + data_dir=tasks_tmp_dir, + ) + + pipeline.build( + stable_files=True, + configCatalog=config_catalog, + source=Data(path=source_path, format=DataFormat.TEXT), + result=Data(path=result_path, format=DataFormat.RDF_NTRIPLES), + ) + + return pipeline, config_catalog + +# parameterize the test with different thresholds for entity matching and relation matching +@pytest.mark.parametrize("entity_matching_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +@pytest.mark.parametrize("relation_matching_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +def test_paris_pipelines(entity_matching_threshold, relation_matching_threshold): + """ + test a paris pipeline with different thresholds for entity matching and relation matching + """ + pipeline, config_catalog = get_paris_pipeline( + entity_matching_threshold, relation_matching_threshold + ) + pipeline.run(configCatalog=config_catalog, stable_files_override=False) + print(f"Pipeline run with entity_matching_threshold={entity_matching_threshold} and relation_matching_threshold={relation_matching_threshold}") + + +@pytest.mark.parametrize("entity_matching_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +@pytest.mark.parametrize("relation_matching_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +def test_eval_paris_pipeline(entity_matching_threshold, relation_matching_threshold): + """ + evaluate a paris pipeline with different thresholds for entity matching and relation matching + current best "entity_alignment_0.9_0.7" with f1 score 0.971 + """ + + print(f"Evaluating triple alignment with entity_matching_threshold={entity_matching_threshold} and relation_matching_threshold={relation_matching_threshold}...") + from kgpipe_eval.utils.kg_utils import KgManager + from kgpipe_eval.metrics.triple_alignment import TripleAlignmentMetric, TripleAlignmentConfig + from kgpipe_eval.metrics.entity_alignment import EntityAlignmentMetric, EntityAlignmentConfig + from kgpipe_eval.api import MetricResult + from kgpipe_eval.test.utils import render_metric_result + + ref_kg_path = Path("data/inputs/reference_kg/data_agg.nt") + gen_kg_path = Path(f"data/tmp/rdf_pipelines/result_{entity_matching_threshold}_{relation_matching_threshold}.nt") + + entity_alignment_config = EntityAlignmentConfig( + method="label_embedding", + reference_kg=ref_kg_path, + verified_entities_path=None, + verified_entities_delimiter="\t", + entity_sim_threshold=0.95 + ) + + tg = KgManager.load_kg(gen_kg_path) + metric_result : MetricResult = EntityAlignmentMetric().compute(tg, entity_alignment_config) + result_string = render_metric_result(metric_result) + _write_to_file(result_string, Path(f"data/tmp/rdf_pipelines/entity_alignment_{entity_matching_threshold}_{relation_matching_threshold}.txt")) + + + triple_alignment_config = TripleAlignmentConfig( + reference_kg=ref_kg_path, + entity_alignment_config=entity_alignment_config, + value_sim_threshold=0.5, + cache_literal_embeddings=True + ) + + tg = KgManager.load_kg(gen_kg_path) + metric_result : MetricResult = TripleAlignmentMetric().compute(tg, triple_alignment_config) + result_string = render_metric_result(metric_result) + _write_to_file(result_string, Path(f"data/tmp/rdf_pipelines/triple_alignment_{entity_matching_threshold}_{relation_matching_threshold}.txt")) + + +@pytest.mark.parametrize("entity_linking_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +@pytest.mark.parametrize("relation_linking_threshold", [0.5, 0.6, 0.7, 0.8, 0.9]) +def test_openie_pipeline(entity_linking_threshold, relation_linking_threshold): + """ + test the openie pipeline + """ + pipeline, config_catalog = get_openie_pipeline(entity_linking_threshold, relation_linking_threshold) + + print(pipeline.plan()) \ No newline at end of file diff --git a/experiments/param-opti/src/qap/test_sge_based.py b/experiments/param-opti/src/qap/test_sge_based.py new file mode 100644 index 0000000..b6674bd --- /dev/null +++ b/experiments/param-opti/src/qap/test_sge_based.py @@ -0,0 +1,36 @@ +def eval_paris_pipeline(entity_matching_threshold: float, relation_matching_threshold: float): + """ + evaluate a paris pipeline with different thresholds for entity matching and relation matching + """ + pass + + # ref_kg_path = Path("data/inputs/reference_kg/data_agg.nt") + # gen_kg_path = Path(f"data/tmp/rdf_pipelines/result_{entity_matching_threshold}_{relation_matching_threshold}.nt") + + # source_grounded_correctness_config = SourceGroundedCorrectnessConfig( + # kg_graph=ref_kg_path, + # source_corpus=gen_kg_path, + # index_dir=Path("data/tmp/source_grounded_correctness"), + # verbalize_method="natural", + # verifier="nli", + # nli_model="facebook/bart-large-mnli", + # nli_device="cpu", + # llm_model="gpt-4.1-mini", + # llm_device="cpu" + # ) + + # source_grounded_correctness_metric = SourceGroundedCorrectnessMetric() + # source_grounded_correctness_metric.compute(KgManager.load_kg(gen_kg_path), source_grounded_correctness_config) + +def eval_openie_pipeline(): + """ + evaluate an openie pipeline + """ + pass + + # ref_kg_path = Path("data/inputs/reference_kg/data_agg.nt") + # gen_kg_path = Path(f"data/tmp/rdf_pipelines/result_{entity_matching_threshold}_{relation_matching_threshold}.nt") + + # source_grounded_coverage_config = SourceGroundedCoverageConfig( + # kg_graph=ref_kg_path, + # source_corpus=gen_kg_path, \ No newline at end of file diff --git a/experiments/param-opti/src/qap_mock/__init__.py b/experiments/param-opti/src/qap_mock/__init__.py new file mode 100644 index 0000000..ddd4d3e --- /dev/null +++ b/experiments/param-opti/src/qap_mock/__init__.py @@ -0,0 +1,15 @@ +""" +Mock implementation of the experiments described in `Quality_Aware_Pipelines.pdf`. + +This package is intentionally self-contained and does not depend on KGpipe. +It simulates: +- A small configuration space (implementations + parameters) +- A "true" end-to-end quality objective +- A correlated approximate quality estimator +- Search strategies (default, random, quality-aware) +""" + +from .models import PipelineFamily, SearchMethod + +__all__ = ["PipelineFamily", "SearchMethod"] + diff --git a/experiments/param-opti/src/qap_mock/__main__.py b/experiments/param-opti/src/qap_mock/__main__.py new file mode 100644 index 0000000..405c19e --- /dev/null +++ b/experiments/param-opti/src/qap_mock/__main__.py @@ -0,0 +1,57 @@ +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +from .experiments import ( + experiment_1_search_effectiveness, + experiment_2_estimation_reliability, + experiment_3_dimension_impact, +) + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser( + description="Mock experiments for Quality_Aware_Pipelines.pdf (quality-aware search)" + ) + p.add_argument( + "which", + choices=["exp1", "exp2", "exp3", "all"], + help="Which experiment(s) to run", + ) + p.add_argument( + "--outdir", + type=Path, + default=Path(__file__).parent.parent.parent / "output_qap_mock", + help="Output directory for JSON results", + ) + p.add_argument("--budget", type=int, default=20, help="Evaluation budget B (exp1/exp3)") + p.add_argument("--runs", type=int, default=5, help="Number of runs/seeds (exp1/exp3)") + p.add_argument("--samples", type=int, default=60, help="Number of sampled configs (exp2)") + + args = p.parse_args(argv) + + results: dict[str, object] = {} + + if args.which in ("exp1", "all"): + results["exp1"] = experiment_1_search_effectiveness( + outdir=args.outdir, budget=args.budget, runs=args.runs + ) + if args.which in ("exp2", "all"): + results["exp2"] = experiment_2_estimation_reliability( + outdir=args.outdir, n_samples=args.samples + ) + if args.which in ("exp3", "all"): + results["exp3"] = experiment_3_dimension_impact( + outdir=args.outdir, budget=args.budget, runs=args.runs + ) + + # Short stdout summary so it's easy to sanity-check runs. + print(json.dumps({"outdir": str(args.outdir), "ran": list(results.keys())}, indent=2)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) + diff --git a/experiments/param-opti/src/qap_mock/experiments.py b/experiments/param-opti/src/qap_mock/experiments.py new file mode 100644 index 0000000..e0ec17c --- /dev/null +++ b/experiments/param-opti/src/qap_mock/experiments.py @@ -0,0 +1,204 @@ +from __future__ import annotations + +import json +import random +from dataclasses import dataclass +from pathlib import Path +from typing import Dict, List, Optional + +from .models import PipelineFamily, SearchMethod, SearchSpaceMode +from .search import ( + best_so_far_curve, + evals_to_fraction_of_final_best, + run_search, +) +from .search_space import get_family_space, sample_config +from .stats import mae, mean, pearsonr, spearmanr, stdev, topk_agreement +from .objectives import evaluate_true_quality, estimate_quality_from_config + + +@dataclass +class Exp1Cell: + mean_best: float + std_best: float + mean_evals_to_95: Optional[float] + + def as_dict(self) -> dict: + return { + "best_score_mean": self.mean_best, + "best_score_std": self.std_best, + "evals_to_95_mean": self.mean_evals_to_95, + } + + +def _ensure_outdir(outdir: Path) -> None: + outdir.mkdir(parents=True, exist_ok=True) + + +def experiment_1_search_effectiveness( + *, + outdir: Path, + budget: int = 20, + runs: int = 5, + base_seed: int = 7, +) -> dict: + """ + Mirrors Section 6.3 / Table 2 narrative: + - Compare Default, Random Search, Quality-Aware Search + - Fixed budget B=20 + - Report best achieved score (mean Β± std over 5 runs) + - Report mean evaluations to reach 95% of each run's final best + """ + _ensure_outdir(outdir) + + methods = [SearchMethod.DEFAULT] #, SearchMethod.RANDOM, SearchMethod.QUALITY_AWARE] + families = [PipelineFamily.RDF, PipelineFamily.TEXT] + + table: Dict[str, Dict[str, Exp1Cell]] = {} + raw: Dict[str, Dict[str, List[dict]]] = {} + + for fam in families: + fam_key = fam.value + table[fam_key] = {} + raw[fam_key] = {} + + for m in methods: + seeds = [base_seed + i for i in range(runs)] + bests: List[float] = [] + evals95: List[float] = [] + raw_runs: List[dict] = [] + + for i, s in enumerate(seeds): + recs = run_search( + seed=10_000 * (i + 1) + s, + family=fam, + method=m, + budget=budget, + mode=SearchSpaceMode.JOINT, + ) + curve = best_so_far_curve(recs) + bests.append(curve[-1]) + e95 = evals_to_fraction_of_final_best(curve, 0.95) + if e95 is not None: + evals95.append(float(e95)) + + raw_runs.append( + { + "seed": s, + "curve_best_so_far": curve, + } + ) + + cell = Exp1Cell( + mean_best=mean(bests), + std_best=stdev(bests) if m != SearchMethod.DEFAULT else float("nan"), + mean_evals_to_95=mean(evals95) if (m != SearchMethod.DEFAULT and evals95) else None, + ) + table[fam_key][m.value] = cell + raw[fam_key][m.value] = raw_runs + + result = { + "budget": budget, + "runs": runs, + "table": { + fam: {meth: cell.as_dict() for meth, cell in methods_.items()} + for fam, methods_ in table.items() + }, + "raw": raw, + } + + (outdir / "exp1_search_effectiveness.json").write_text(json.dumps(result, indent=2)) + return result + + +def experiment_2_estimation_reliability( + *, + outdir: Path, + n_samples: int = 60, + seed: int = 23, + topk: int = 10, +) -> dict: + """ + Mirrors Section 6.4 narrative: + - sample configurations + - compute estimated vs true scores + - compute correlation (Pearson/Spearman), MAE, top-k agreement + """ + _ensure_outdir(outdir) + + rng = random.Random(seed) + families = [PipelineFamily.RDF, PipelineFamily.TEXT] + + out: Dict[str, dict] = {"n_samples": n_samples, "topk": topk, "by_family": {}} + + for fam in families: + true_scores: List[float] = [] + est_scores: List[float] = [] + + for _ in range(n_samples): + cfg = sample_config(rng, fam, mode=SearchSpaceMode.JOINT) + true = evaluate_true_quality(rng, cfg).total + est = estimate_quality_from_config(rng, cfg) + true_scores.append(true) + est_scores.append(est) + + fam_key = fam.value + out["by_family"][fam_key] = { + "pearson": pearsonr(est_scores, true_scores), + "spearman": spearmanr(est_scores, true_scores), + "mae": mae(est_scores, true_scores), + "topk_agreement": topk_agreement(est_scores, true_scores, topk), + } + + (outdir / "exp2_estimation_reliability.json").write_text(json.dumps(out, indent=2)) + return out + + +def experiment_3_dimension_impact( + *, + outdir: Path, + budget: int = 20, + runs: int = 5, + base_seed: int = 101, +) -> dict: + """ + Mirrors Section 6.5 narrative: + Compare best scores for restricted spaces: + - implementation-only + - parameter-only + - joint + """ + _ensure_outdir(outdir) + + families = [PipelineFamily.RDF, PipelineFamily.TEXT] + modes = [ + SearchSpaceMode.IMPLEMENTATION_ONLY, + SearchSpaceMode.PARAMETER_ONLY, + SearchSpaceMode.JOINT, + ] + + out: Dict[str, dict] = {"budget": budget, "runs": runs, "by_family": {}} + + for fam in families: + fam_out: Dict[str, dict] = {} + for mode in modes: + bests: List[float] = [] + for i in range(runs): + seed = base_seed + i * 17 + recs = run_search( + seed=20_000 * (i + 1) + seed, + family=fam, + method=SearchMethod.QUALITY_AWARE, + budget=budget, + mode=mode, + ) + curve = best_so_far_curve(recs) + bests.append(curve[-1]) + + fam_out[mode.value] = {"best_mean": mean(bests), "best_std": stdev(bests)} + + out["by_family"][fam.value] = fam_out + + (outdir / "exp3_dimension_impact.json").write_text(json.dumps(out, indent=2)) + return out + diff --git a/experiments/param-opti/src/qap_mock/models.py b/experiments/param-opti/src/qap_mock/models.py new file mode 100644 index 0000000..5809be5 --- /dev/null +++ b/experiments/param-opti/src/qap_mock/models.py @@ -0,0 +1,37 @@ +from __future__ import annotations + +from dataclasses import dataclass +from enum import Enum +from typing import Mapping + + +class PipelineFamily(str, Enum): + RDF = "rdf" + TEXT = "text" + + +class SearchMethod(str, Enum): + DEFAULT = "default" + RANDOM = "random" + QUALITY_AWARE = "quality_aware" + + +class SearchSpaceMode(str, Enum): + JOINT = "joint" + IMPLEMENTATION_ONLY = "implementation_only" + PARAMETER_ONLY = "parameter_only" + + +@dataclass(frozen=True) +class PipelineConfig: + family: PipelineFamily + implementations: Mapping[str, str] + params: Mapping[str, float] + + def as_dict(self) -> dict: + return { + "family": self.family.value, + "implementations": dict(self.implementations), + "params": dict(self.params), + } + diff --git a/experiments/param-opti/src/qap_mock/objectives.py b/experiments/param-opti/src/qap_mock/objectives.py new file mode 100644 index 0000000..8691ebe --- /dev/null +++ b/experiments/param-opti/src/qap_mock/objectives.py @@ -0,0 +1,226 @@ +from __future__ import annotations + +import math +import random +from dataclasses import dataclass + +from .models import PipelineConfig, PipelineFamily +from .pipeline_util import ( + compute_rdf_metrics, + compute_te_metrics, + default_base_workdir, + run_pipeline_for_config, + TEST_DATA_ONTOLOGY_PATH, + # _test_data_path, +) + + +@dataclass(frozen=True) +class QualityBreakdown: + accuracy: float + coverage: float + consistency: float + total: float + + +def _sigmoid(x: float) -> float: + return 1.0 / (1.0 + math.exp(-x)) + + +def _base_quality_components(cfg: PipelineConfig) -> tuple[float, float, float]: + """ + Deterministic (noise-free) quality components for a configuration. + + This is used both for the simulated "true" evaluation (with added noise) + and for the approximate estimator (with different noise). + """ + if cfg.family == PipelineFamily.RDF: + impl_acc = 0.0 + impl_cov = 0.0 + impl_con = 0.0 + + om = cfg.implementations["ontology_matching"] + if om == "string_sim": + impl_con += 0.01 + elif om == "embedding_sim": + impl_acc += 0.05 + impl_cov += 0.02 + elif om == "hybrid": + impl_acc += 0.06 + impl_cov += 0.03 + impl_con += 0.01 + elif om == "llm_alignment": + impl_cov += 0.05 + impl_acc += 0.04 + impl_con -= 0.01 + + em = cfg.implementations["entity_matching"] + if em == "rule_based": + impl_con += 0.02 + elif em == "blocking_sim": + impl_acc += 0.04 + impl_cov += 0.02 + elif em == "embedding_er": + impl_acc += 0.06 + impl_cov += 0.03 + elif em == "llm_er": + impl_cov += 0.05 + impl_acc += 0.05 + impl_con -= 0.01 + + fu = cfg.implementations["fusion"] + if fu == "union": + impl_cov += 0.03 + elif fu == "majority_vote": + impl_con += 0.03 + impl_acc += 0.01 + elif fu == "quality_weighted": + impl_con += 0.06 + impl_acc += 0.02 + + s_thr = float(cfg.params["schema_sim_threshold"]) + e_thr = float(cfg.params["entity_sim_threshold"]) + f_thr = float(cfg.params["fusion_confidence_threshold"]) + bk = float(cfg.params.get("blocking_key_strength", 0.5)) + + acc = 0.55 + 0.18 * _sigmoid((s_thr - 0.65) * 8) + 0.18 * _sigmoid((e_thr - 0.65) * 8) + cov = 0.65 - 0.25 * _sigmoid((s_thr - 0.6) * 7) - 0.25 * _sigmoid((e_thr - 0.6) * 7) + con = 0.55 + 0.20 * _sigmoid((f_thr - 0.45) * 6) + + strict = (s_thr + e_thr) / 2.0 + con -= 0.05 * _sigmoid((strict - 0.85) * 10) + + cov += 0.03 * _sigmoid((bk - 0.3) * 6) + acc -= 0.02 * _sigmoid((bk - 0.8) * 10) + + acc += impl_acc + cov += impl_cov + con += impl_con + + return acc, cov, con + + if cfg.family == PipelineFamily.TEXT: + impl_acc = 0.0 + impl_cov = 0.0 + impl_con = 0.0 + + ie = cfg.implementations["information_extraction"] + if ie == "pattern_ie": + impl_con += 0.01 + elif ie == "openie": + impl_cov += 0.04 + impl_acc += 0.01 + elif ie == "hybrid_ie": + impl_cov += 0.06 + impl_acc += 0.02 + impl_con += 0.01 + elif ie == "llm_ie": + impl_cov += 0.08 + impl_acc += 0.03 + impl_con -= 0.01 + + el = cfg.implementations["entity_linking"] + if el == "dictionary_linking": + impl_cov += 0.02 + elif el == "embedding_linking": + impl_acc += 0.06 + elif el == "llm_linking": + impl_acc += 0.07 + impl_cov += 0.02 + impl_con -= 0.01 + + fu = cfg.implementations["fusion"] + if fu == "union": + impl_cov += 0.03 + elif fu == "majority_vote": + impl_con += 0.03 + impl_acc += 0.01 + elif fu == "quality_weighted": + impl_con += 0.07 + impl_acc += 0.02 + + ie_thr = float(cfg.params["ie_conf_threshold"]) + link_thr = float(cfg.params["link_sim_threshold"]) + f_thr = float(cfg.params["fusion_confidence_threshold"]) + cw = float(cfg.params.get("context_window", 256.0)) + + acc = 0.40 + 0.22 * _sigmoid((link_thr - 0.6) * 7) + 0.10 * _sigmoid((ie_thr - 0.55) * 6) + cov = 0.55 - 0.28 * _sigmoid((ie_thr - 0.55) * 7) - 0.18 * _sigmoid((link_thr - 0.6) * 6) + con = 0.45 + 0.22 * _sigmoid((f_thr - 0.45) * 6) + + noisy = (0.6 - ie_thr) + (0.6 - link_thr) + con -= 0.10 * _sigmoid(noisy * 6) + + cov += 0.03 * _sigmoid((cw - 160.0) / 60.0) + con -= 0.02 * _sigmoid((cw - 420.0) / 70.0) + + acc += impl_acc + cov += impl_cov + con += impl_con + + return acc, cov, con + + raise ValueError(f"Unknown family: {cfg.family}") + + +def evaluate_true_quality(rng: random.Random, cfg: PipelineConfig) -> QualityBreakdown: + """ + Real(ish) end-to-end objective: run a KGpipe pipeline for this config and + compute measurable proxy metrics from its outputs. + + Notes: + - This intentionally uses bundled `kgpipe_tasks/test/test_data` inputs so + the experiments are runnable out of the box. + - Metrics are proxy/reference-independent signals (no gold labels yet). + """ + base = default_base_workdir() + run = run_pipeline_for_config(cfg=cfg, base_workdir=base, stable_files=False) + + if cfg.family == PipelineFamily.RDF: + ontology = TEST_DATA_ONTOLOGY_PATH + m = compute_rdf_metrics(output_nt=run.final_output.path, ontology_ttl=ontology) + else: + m = compute_te_metrics(te_json_path=run.final_output.path) + + acc = min(1.0, max(0.0, float(m["accuracy"]))) + cov = min(1.0, max(0.0, float(m["coverage"]))) + con = min(1.0, max(0.0, float(m["consistency"]))) + + total = 0.45 * acc + 0.30 * cov + 0.25 * con + total = min(1.0, max(0.0, total)) + return QualityBreakdown(accuracy=acc, coverage=cov, consistency=con, total=total) + + +def estimate_quality_from_config(rng: random.Random, cfg: PipelineConfig) -> float: + """ + Approximate estimator Q-hat used by the quality-aware search to rank candidates + without executing the full pipeline. + + For now this remains a cheap heuristic over the config (so the search is not + dominated by expensive runs). The "true" objective is produced by actually + executing the pipeline in `evaluate_true_quality`. + """ + acc, cov, con = _base_quality_components(cfg) + # Estimator has its own noise and slight systematic distortion. + if cfg.family == PipelineFamily.RDF: + acc += rng.gauss(0.0, 0.015) + cov += rng.gauss(0.0, 0.015) + con += rng.gauss(0.0, 0.015) + else: + acc += rng.gauss(0.0, 0.020) + cov += rng.gauss(0.0, 0.020) + con += rng.gauss(0.0, 0.020) + + acc = min(1.0, max(0.0, acc)) + cov = min(1.0, max(0.0, cov)) + con = min(1.0, max(0.0, con)) + est = 0.45 * acc + 0.30 * cov + 0.25 * con + return min(1.0, max(0.0, est)) + + +def estimate_quality(rng: random.Random, true_total: float, family: PipelineFamily) -> float: + raise RuntimeError( + "estimate_quality(true_total, family) is deprecated; " + "use estimate_quality_from_config(rng, cfg) instead." + ) + diff --git a/experiments/param-opti/src/qap_mock/pipeline_util.py b/experiments/param-opti/src/qap_mock/pipeline_util.py new file mode 100644 index 0000000..e5f0890 --- /dev/null +++ b/experiments/param-opti/src/qap_mock/pipeline_util.py @@ -0,0 +1,414 @@ +from __future__ import annotations + +import hashlib +import json +import os +import tempfile +from dataclasses import dataclass +from pathlib import Path +from typing import TYPE_CHECKING, Any, Iterable, Optional + +if TYPE_CHECKING: + from kgpipe.common import Data, DataFormat, KgPipe, KgTask # pragma: no cover + from kgpipe.common.model.task import KgTaskReport # pragma: no cover + +from .models import PipelineConfig, PipelineFamily + + +@dataclass(frozen=True) +class PipelineRunResult: + family: PipelineFamily + cfg: PipelineConfig + workdir: Path + final_output: Any # Data + task_reports: Any # list[KgTaskReport] + aux: dict + + +TEST_DATA_SEED_KG_PATH = Path("/home/marvin/project/data/final/film_1k/split_0/kg/seed/data.nt") +TEST_DATA_ONTOLOGY_PATH = Path("/home/marvin/project/data/final/film_1k/movie-ontology.ttl") +TEST_DATA_RDF_PATH = Path("/home/marvin/project/data/final/film_1k/split_1/sources/rdf/data.nt") +TEST_DATA_TEXT_PATH = Path("/home/marvin/project/data/final/film_1k/split_1/sources/text/data/") + +def _import_tasks_for_family(family: PipelineFamily) -> None: + """ + Import task modules so their @Registry.task decorators execute. + + This keeps the rest of qap_mock independent from kgpipe_tasks import side effects. + """ + # RDF: PARIS matcher + exchange + fusion tasks. + if family == PipelineFamily.RDF: + # Entity matching (docker) + exchange (python) + import kgpipe_tasks.entity_resolution.matcher.paris_rdf_matcher # noqa: F401 + import kgpipe_tasks.entity_resolution.entity_match # noqa: F401 + + # Fusion (python) + import kgpipe_tasks.entity_resolution.fusion.union # noqa: F401 + import kgpipe_tasks.entity_resolution.fusion.preference # noqa: F401 + + return + + if family == PipelineFamily.TEXT: + # CoreNLP OpenIE extraction (docker) + exchange (python) + import kgpipe_tasks.text_processing.text_extraction.corenlp_extraction # noqa: F401 + + return + + raise ValueError(f"Unknown family: {family}") + + +def _cfg_hash(cfg: PipelineConfig) -> str: + payload = json.dumps(cfg.as_dict(), sort_keys=True, separators=(",", ":")).encode("utf-8") + return hashlib.sha256(payload).hexdigest()[:16] + + +def _ensure_dir(p: Path) -> None: + p.mkdir(parents=True, exist_ok=True) + + +# def _test_data_path(relative_path: str) -> Path: +# """ +# Use kgpipe_tasks' bundled test data as default inputs so qap_mock is runnable. +# """ +# base = Path(__file__).resolve().parents[3] / "src" / "kgpipe_tasks" / "test" / "test_data" +# path = (base / relative_path).resolve() +# if not path.exists(): +# raise FileNotFoundError(f"Missing test data file: {path}") +# return path + + +def _set_env_from_params(params: dict[str, float]) -> dict[str, Optional[str]]: + """ + Apply a minimal mapping from qap_mock params to the env-var based configuration + convention used by many kgpipe tasks. + + Returns a dict of previous env values so callers can restore them. + """ + # Only set variables that are known to be read by the tasks we use. + mapping: dict[str, tuple[str, float]] = { + # RDF fusion/preference tasks + "ENTITY_MATCHING_THRESHOLD": ("entity_sim_threshold", 0.7), + "RELATION_MATCHING_THRESHOLD": ("schema_sim_threshold", 0.7), + # Text: no stable env knobs used by CoreNLP task today + } + + prev: dict[str, Optional[str]] = {} + for env_key, (p_key, default) in mapping.items(): + prev[env_key] = os.environ.get(env_key) + val = float(params.get(p_key, default)) + os.environ[env_key] = str(val) + return prev + + +def _restore_env(prev: dict[str, Optional[str]]) -> None: + for k, v in prev.items(): + if v is None: + os.environ.pop(k, None) + else: + os.environ[k] = v + + +def build_pipeline_for_config(*, cfg: PipelineConfig, workdir: Path) -> tuple[KgPipe, Data, Data]: + """ + Build a runnable KgPipe for the given configuration. + + We intentionally keep the mapping small and explicit: + - RDF: (optional) PARIS entity matching -> exchange -> fusion + - TEXT: CoreNLP OpenIE extraction (docker) -> exchange + + Returns (pipe, source, final_result_data). + """ + from kgpipe.common import Data, DataFormat, KgPipe, KgTask, Registry + + _import_tasks_for_family(cfg.family) + _ensure_dir(workdir) + + if cfg.family == PipelineFamily.RDF: + # Inputs: source + target (as seed) are bundled test fixtures. + source = Data(path=TEST_DATA_RDF_PATH, format=DataFormat.RDF_NTRIPLES) + target = Data(path=TEST_DATA_SEED_KG_PATH, format=DataFormat.RDF_NTRIPLES) + + # Ensure ontology env is set for fusion tasks that need it. + ontology_path = TEST_DATA_ONTOLOGY_PATH + os.environ.setdefault("ONTOLOGY_PATH", str(ontology_path)) + + # Decide whether to run entity matching. If we don't, we can still + # compute a meaningful output via simple union. + entity_impl = cfg.implementations.get("entity_matching", "rule_based") + fusion_impl = cfg.implementations.get("fusion", "union") + use_docker = os.environ.get("QAP_MOCK_USE_DOCKER", "0") == "1" + + tasks: list[KgTask] = [] + final_format = DataFormat.RDF_NTRIPLES + + def _empty_er(inputs: dict[str, Data], outputs: dict[str, Data]) -> None: + out_path = Path(outputs["output"].path) + out_path.parent.mkdir(parents=True, exist_ok=True) + out_path.write_text(json.dumps({"matches": [], "blocks": [], "clusters": []}, indent=2), encoding="utf-8") + + # dummy_entity_matching = KgTask( + # name="dummy_entity_matching", + # input_spec={"source": DataFormat.RDF_NTRIPLES, "target": DataFormat.RDF_NTRIPLES}, + # output_spec={"output": DataFormat.ER_JSON}, + # function=_empty_er, + # description="Dummy matcher emitting empty ER_JSON (no docker)", + # ) + + # if entity_impl != "rule_based": + # if use_docker: + tasks.extend( + [ + Registry.get_task("paris_entity_matching"), + Registry.get_task("paris_exchange"), + ] + ) + # When matches exist, prefer a fusion strategy that uses them. + if fusion_impl in ("quality_weighted", "majority_vote"): + tasks.append(Registry.get_task("fusion_first_value")) + else: + tasks.append(Registry.get_task("union_matched_rdf")) + # else: + # # Non-docker mode: skip PARIS and run a deterministic empty matcher. + # tasks.extend([dummy_entity_matching, Registry.get_task("union_matched_rdf")]) + # else: + # # No matching step: just union the two graphs. + # tasks.append(Registry.get_task("fusion_union_rdf")) + + # seed is the "kg"/target, which KgPipe.build will use when a task + # declares an input named "kg". + pipe = KgPipe(tasks=tasks, seed=target, data_dir=str(workdir), name=f"qap_mock_{cfg.family.value}") + + final = Data(path=workdir / "final.nt", format=final_format) + return pipe, source, final + + if cfg.family == PipelineFamily.TEXT: + text = Data(path=TEST_DATA_TEXT_PATH, format=DataFormat.TEXT) + + ie_impl = cfg.implementations.get("information_extraction", "pattern_ie") + + def _pattern_ie(inputs: dict[str, Data], outputs: dict[str, Data]) -> None: + import re + + in_path = Path(inputs["input"].path) + out_path = Path(outputs["output"].path) + out_path.mkdir(parents=True, exist_ok=True) + + txt = _read_text(in_path) + # Tiny, deterministic pattern extractor: "X is a Y" / "X is an Y". + triples = [] + for m in re.finditer(r"([A-Z][A-Za-z0-9_ ]{2,40}) is an? ([A-Za-z][A-Za-z0-9_ -]{2,40})", txt): + subj = m.group(1).strip() + obj = m.group(2).strip() + triples.append( + { + "subject": {"surface_form": subj}, + "predicate": {"surface_form": "is_a"}, + "object": {"surface_form": obj}, + } + ) + + doc = {"text": txt[:10_000], "triples": triples, "chains": [], "links": []} + (out_path / "pattern_ie.te.json").write_text(json.dumps(doc), encoding="utf-8") + + pattern_ie_task = KgTask( + name="pattern_ie_extraction", + input_spec={"input": DataFormat.TEXT}, + output_spec={"output": DataFormat.TE_JSON}, + function=_pattern_ie, + description="Lightweight pattern IE (no docker)", + ) + + if ie_impl == "pattern_ie": + tasks = [pattern_ie_task] + else: + use_docker = os.environ.get("QAP_MOCK_USE_DOCKER", "0") == "1" + if use_docker: + # Use CoreNLP OpenIE path for openie/hybrid/llm variants (docker-backed). + tasks = [ + Registry.get_task("corenlp_openie_extraction"), + Registry.get_task("corenlp_exchange"), + ] + else: + # Default to the lightweight extractor when docker isn't enabled. + tasks = [pattern_ie_task] + + pipe = KgPipe(tasks=tasks, seed=text, data_dir=str(workdir), name=f"qap_mock_{cfg.family.value}") + # Many TE_JSON-producing tasks treat the output as a directory of documents. + final = Data(path=workdir / "final_te", format=DataFormat.TE_JSON) + return pipe, text, final + + raise ValueError(f"Unknown family: {cfg.family}") + + +def run_pipeline_for_config( + *, cfg: PipelineConfig, base_workdir: Path, stable_files: bool = True +) -> PipelineRunResult: + """ + Execute a real KGpipe pipeline for this config and return its artifacts. + + Results are cached by (family, cfg-hash) under base_workdir to avoid repeating + expensive docker/service calls during search. + """ + run_id = f"{cfg.family.value}_{_cfg_hash(cfg)}" + workdir = base_workdir / run_id + _ensure_dir(workdir) + + try: + pipe, source, final = build_pipeline_for_config(cfg=cfg, workdir=workdir) + except ModuleNotFoundError as e: + raise RuntimeError( + "KGpipe dependencies are not installed in this environment. " + "To run the *real* (non-mock) execution path, install the project in editable mode:\n\n" + " python3 -m pip install -e .\n\n" + "This will also install the `kgcore` dependency declared in `pyproject.toml`.\n" + f"Original import error: {e}" + ) from e + + # Apply env-var config mapping used by tasks. + prev_env = _set_env_from_params(dict(cfg.params)) + try: + # If final exists and stable_files=True, KgTask.run will skip; still ok. + pipe.build(source=source, result=final, stable_files=stable_files) + reports = pipe.run(stable_files_override=stable_files) + finally: + _restore_env(prev_env) + + return PipelineRunResult( + family=cfg.family, + cfg=cfg, + workdir=workdir, + final_output=final, + task_reports=reports, + aux={"source": str(source.path), "seed": str(pipe.seed.path), "run_id": run_id}, + ) + + +def _read_text(path: Path, max_bytes: int = 4_000_000) -> str: + # Keep it simple and avoid huge reads in case a docker task goes wild. + data = path.read_bytes() + if len(data) > max_bytes: + data = data[:max_bytes] + return data.decode("utf-8", errors="replace") + + +def compute_rdf_metrics(*, output_nt: Path, ontology_ttl: Optional[Path] = None) -> dict[str, float]: + import importlib + + try: + rdflib = importlib.import_module("rdflib") + Graph = getattr(rdflib, "Graph") + URIRef = getattr(importlib.import_module("rdflib.term"), "URIRef") + g = Graph() + g.parse(output_nt, format="nt") + triples = len(g) + except Exception: + # Fallback without rdflib: approximate triples by counting lines. + txt = _read_text(output_nt) + triples = len([ln for ln in txt.splitlines() if ln.strip() and not ln.strip().startswith("#")]) + Graph = None # type: ignore[assignment] + URIRef = None # type: ignore[assignment] + g = None # type: ignore[assignment] + + # Consistency proxy: fraction of predicates that appear in ontology (or common RDF vocab). + allowed: set[str] = set() + if Graph is not None and URIRef is not None and ontology_ttl is not None and ontology_ttl.exists(): + try: + og = Graph() + og.parse(ontology_ttl) + # Allow all predicates defined as properties + rdfs:label/rdf:type. + for s, _, _ in og: + # cheap heuristic: treat all subjects that are URIRefs as "allowed" predicates + if isinstance(s, URIRef): + allowed.add(str(s)) + allowed.add("http://www.w3.org/2000/01/rdf-schema#label") + allowed.add("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") + except Exception: + allowed = set() + + if allowed and g is not None and URIRef is not None: + ok = 0 + for _, p, _ in g: + if isinstance(p, URIRef) and str(p) in allowed: + ok += 1 + consistency = ok / max(1, triples) + else: + consistency = 0.5 + + # Coverage proxy: normalize by union of input graphs when using bundled test data. + try: + src = Graph().parse(TEST_DATA_RDF_PATH, format="nt") + tgt = Graph().parse(TEST_DATA_SEED_KG_PATH, format="nt") + union_triples = len(src) + len(tgt) + coverage = min(1.0, triples / max(1, union_triples)) + except Exception: + coverage = min(1.0, triples / 10_000.0) + + # Accuracy proxy: reward non-trivial graphs (very small outputs are likely bad). + accuracy = min(1.0, max(0.0, (triples / 2000.0))) + + return {"accuracy": float(accuracy), "coverage": float(coverage), "consistency": float(consistency)} + + +def compute_te_metrics(*, te_json_path: Path) -> dict[str, float]: + """ + Compute lightweight metrics from TE_JSON outputs. + + This intentionally avoids requiring a gold standard. It's a pragmatic proxy: + - coverage ~ extracted triples count + - consistency ~ fraction of triples that have all 3 spans populated + - accuracy ~ average link score if links exist, else a baseline + """ + # TE_JSON may be a directory (many files) or a single file. + triples = 0 + complete = 0 + link_scores: list[float] = [] + + paths: Iterable[Path] + if te_json_path.is_dir(): + paths = [p for p in te_json_path.iterdir() if p.is_file()] + else: + paths = [te_json_path] + + for p in paths: + try: + doc = json.loads(_read_text(p)) + except Exception: + continue + for t in doc.get("triples", []) or []: + triples += 1 + s = (t.get("subject") or {}).get("surface_form") + r = (t.get("predicate") or {}).get("surface_form") + o = (t.get("object") or {}).get("surface_form") + if s and r and o: + complete += 1 + for l in doc.get("links", []) or []: + try: + link_scores.append(float(l.get("score", 0.0))) + except Exception: + pass + + # Normalize coverage against a rough scale for the bundled Hobbit text. + coverage = min(1.0, triples / 5000.0) + consistency = complete / max(1, triples) if triples else 0.0 + accuracy = (sum(link_scores) / len(link_scores)) if link_scores else 0.35 + accuracy = min(1.0, max(0.0, accuracy)) + + return {"accuracy": float(accuracy), "coverage": float(coverage), "consistency": float(consistency)} + + +def default_base_workdir() -> Path: + # Keep outputs inside the experiment folder by default. + return Path(__file__).resolve().parents[2] / "output_qap_mock" / "_real_runs" + + +def make_temp_base_workdir() -> Path: + return Path(tempfile.mkdtemp(prefix="qap_mock_real_")) + + +# - pipeline auto algo +# - cleaning +# normalization task +# - pipeline task aggregation +# aggregate multiple task sub (DAGs) into a single task +# example: paris matching and fusion are two sub tasks, we can aggregate them into a single task + diff --git a/experiments/param-opti/src/qap_mock/search.py b/experiments/param-opti/src/qap_mock/search.py new file mode 100644 index 0000000..1d6c38d --- /dev/null +++ b/experiments/param-opti/src/qap_mock/search.py @@ -0,0 +1,138 @@ +from __future__ import annotations + +import random +from dataclasses import dataclass +from typing import Dict, List, Optional, Tuple + +from .models import PipelineConfig, PipelineFamily, SearchMethod, SearchSpaceMode +from .objectives import QualityBreakdown, estimate_quality_from_config, evaluate_true_quality +from .search_space import get_family_space, mutate_config, sample_config + + +@dataclass +class EvaluationRecord: + cfg: PipelineConfig + true: QualityBreakdown + est_total: float + + def as_dict(self) -> dict: + return { + "config": self.cfg.as_dict(), + "true": { + "accuracy": self.true.accuracy, + "coverage": self.true.coverage, + "consistency": self.true.consistency, + "total": self.true.total, + }, + "estimated_total": self.est_total, + } + + +def _eval_once(rng: random.Random, cfg: PipelineConfig) -> EvaluationRecord: + true = evaluate_true_quality(rng, cfg) + est = estimate_quality_from_config(rng, cfg) + return EvaluationRecord(cfg=cfg, true=true, est_total=est) + + +def run_search( + *, + seed: int, + family: PipelineFamily, + method: SearchMethod, + budget: int, + mode: SearchSpaceMode = SearchSpaceMode.JOINT, +) -> List[EvaluationRecord]: + rng = random.Random(seed) + space = get_family_space(family) + default_cfg = PipelineConfig(family=family, implementations=space.default_impl, params=space.default_params) + + records: List[EvaluationRecord] = [] + + if method == SearchMethod.DEFAULT: + records.append(_eval_once(rng, default_cfg)) + return records + + if method == SearchMethod.RANDOM: + for _ in range(budget): + cfg = sample_config(rng, family, mode=mode, fixed_default=default_cfg) + records.append(_eval_once(rng, cfg)) + return records + + if method == SearchMethod.QUALITY_AWARE: + # Simple, explainable heuristic: + # - start from default + # - maintain incumbent based on estimated quality (Q-hat) + # - propose new configs by mutating incumbent (exploitation) + # - occasional random exploration + incumbent = default_cfg + incumbent_est: Optional[float] = None + + for t in range(budget): + # "Lookahead" using cheap quality estimates: generate a pool of + # candidates, pick the one with best estimated quality, then + # spend one "real" evaluation budget on it. + pool_size = 12 if t < 5 else 8 + candidates: List[PipelineConfig] = [] + for _ in range(pool_size): + explore = rng.random() < (0.35 if t < 3 else 0.20) + if explore: + candidates.append(sample_config(rng, family, mode=mode, fixed_default=default_cfg)) + else: + candidates.append( + mutate_config( + rng, + incumbent, + mode=mode, + p_change_impl=0.70, + p_change_param=0.85, + ) + ) + + best_est = None + best_cfg = None + for c in candidates: + est = estimate_quality_from_config(rng, c) + if best_est is None or est > best_est: + best_est = est + best_cfg = c + + assert best_cfg is not None + cfg = best_cfg + + rec = _eval_once(rng, cfg) + records.append(rec) + + if incumbent_est is None or rec.est_total > incumbent_est: + incumbent = cfg + incumbent_est = rec.est_total + + return records + + raise ValueError(f"Unknown method: {method}") + + +def best_so_far_curve(records: List[EvaluationRecord]) -> List[float]: + best = -1.0 + curve: List[float] = [] + for r in records: + best = max(best, r.true.total) + curve.append(best) + return curve + + +def evals_to_fraction_of_final_best(curve: List[float], fraction: float) -> Optional[int]: + if not curve: + return None + final_best = curve[-1] + target = fraction * final_best + for i, v in enumerate(curve, start=1): + if v >= target: + return i + return None + + +def summarize_best(records: List[EvaluationRecord]) -> Tuple[float, float]: + curve = best_so_far_curve(records) + best = curve[-1] if curve else float("nan") + return best, best + diff --git a/experiments/param-opti/src/qap_mock/search_space.py b/experiments/param-opti/src/qap_mock/search_space.py new file mode 100644 index 0000000..c69f031 --- /dev/null +++ b/experiments/param-opti/src/qap_mock/search_space.py @@ -0,0 +1,143 @@ +from __future__ import annotations + +import random +from dataclasses import dataclass +from typing import Dict, List, Tuple + +from .models import PipelineConfig, PipelineFamily, SearchSpaceMode + + +@dataclass(frozen=True) +class FamilySpace: + tasks: List[str] + impl_choices: Dict[str, List[str]] + param_ranges: Dict[str, Tuple[float, float]] + default_impl: Dict[str, str] + default_params: Dict[str, float] + + +def get_family_space(family: PipelineFamily) -> FamilySpace: + # Compact but expressive, mirroring the paper text: + # - discrete implementation choices per task + # - continuous thresholds + if family == PipelineFamily.RDF: + tasks = ["ontology_matching", "entity_matching", "fusion"] + impl_choices = { + "ontology_matching": ["string_sim", "embedding_sim", "hybrid", "llm_alignment"], + "entity_matching": ["rule_based", "blocking_sim", "embedding_er", "llm_er"], + "fusion": ["union", "quality_weighted", "majority_vote"], + } + param_ranges = { + "schema_sim_threshold": (0.3, 0.95), + "entity_sim_threshold": (0.3, 0.95), + "fusion_confidence_threshold": (0.1, 0.9), + "blocking_key_strength": (0.0, 1.0), + } + default_impl = { + "ontology_matching": "string_sim", + "entity_matching": "rule_based", + "fusion": "union", + } + default_params = { + "schema_sim_threshold": 0.7, + "entity_sim_threshold": 0.7, + "fusion_confidence_threshold": 0.5, + "blocking_key_strength": 0.5, + } + return FamilySpace(tasks, impl_choices, param_ranges, default_impl, default_params) + + if family == PipelineFamily.TEXT: + tasks = ["information_extraction", "entity_linking", "fusion"] + impl_choices = { + "information_extraction": ["pattern_ie", "openie", "hybrid_ie", "llm_ie"], + "entity_linking": ["dictionary_linking", "embedding_linking", "llm_linking"], + "fusion": ["union", "quality_weighted", "majority_vote"], + } + param_ranges = { + "ie_conf_threshold": (0.2, 0.95), + "link_sim_threshold": (0.2, 0.95), + "fusion_confidence_threshold": (0.1, 0.9), + "context_window": (64.0, 512.0), + } + default_impl = { + "information_extraction": "pattern_ie", + "entity_linking": "dictionary_linking", + "fusion": "union", + } + default_params = { + "ie_conf_threshold": 0.6, + "link_sim_threshold": 0.6, + "fusion_confidence_threshold": 0.5, + "context_window": 256.0, + } + return FamilySpace(tasks, impl_choices, param_ranges, default_impl, default_params) + + raise ValueError(f"Unknown family: {family}") + + +def sample_config( + rng: random.Random, + family: PipelineFamily, + mode: SearchSpaceMode = SearchSpaceMode.JOINT, + fixed_default: PipelineConfig | None = None, +) -> PipelineConfig: + space = get_family_space(family) + + impl: Dict[str, str] = {} + params: Dict[str, float] = {} + + if fixed_default is None: + fixed_default = PipelineConfig(family=family, implementations=space.default_impl, params=space.default_params) + + if mode in (SearchSpaceMode.JOINT, SearchSpaceMode.IMPLEMENTATION_ONLY): + for t in space.tasks: + impl[t] = rng.choice(space.impl_choices[t]) + else: + impl = dict(fixed_default.implementations) + + if mode in (SearchSpaceMode.JOINT, SearchSpaceMode.PARAMETER_ONLY): + for p, (lo, hi) in space.param_ranges.items(): + params[p] = rng.uniform(lo, hi) + else: + params = dict(fixed_default.params) + + return PipelineConfig(family=family, implementations=impl, params=params) + + +def mutate_config( + rng: random.Random, + cfg: PipelineConfig, + mode: SearchSpaceMode = SearchSpaceMode.JOINT, + p_change_impl: float = 0.35, + p_change_param: float = 0.8, +) -> PipelineConfig: + space = get_family_space(cfg.family) + impl = dict(cfg.implementations) + params = dict(cfg.params) + + if mode in (SearchSpaceMode.JOINT, SearchSpaceMode.IMPLEMENTATION_ONLY) and rng.random() < p_change_impl: + t = rng.choice(space.tasks) + choices = [c for c in space.impl_choices[t] if c != impl[t]] + if choices: + impl[t] = rng.choice(choices) + # Occasionally flip a second task implementation to escape local optima. + if rng.random() < 0.25: + t2 = rng.choice([x for x in space.tasks if x != t]) + choices2 = [c for c in space.impl_choices[t2] if c != impl[t2]] + if choices2: + impl[t2] = rng.choice(choices2) + + if mode in (SearchSpaceMode.JOINT, SearchSpaceMode.PARAMETER_ONLY) and rng.random() < p_change_param: + p = rng.choice(list(space.param_ranges.keys())) + lo, hi = space.param_ranges[p] + # Gaussian step with clipping keeps changes local. + step = rng.gauss(0.0, (hi - lo) * 0.08) + params[p] = min(hi, max(lo, params[p] + step)) + if rng.random() < 0.25: + p2 = rng.choice([x for x in space.param_ranges.keys() if x != p]) + lo2, hi2 = space.param_ranges[p2] + step2 = rng.gauss(0.0, (hi2 - lo2) * 0.06) + params[p2] = min(hi2, max(lo2, params[p2] + step2)) + + return PipelineConfig(family=cfg.family, implementations=impl, params=params) + diff --git a/experiments/param-opti/src/qap_mock/stats.py b/experiments/param-opti/src/qap_mock/stats.py new file mode 100644 index 0000000..9232ff1 --- /dev/null +++ b/experiments/param-opti/src/qap_mock/stats.py @@ -0,0 +1,58 @@ +from __future__ import annotations + +import math +from typing import Iterable, List, Sequence, Tuple + + +def mean(xs: Sequence[float]) -> float: + return sum(xs) / len(xs) if xs else float("nan") + + +def stdev(xs: Sequence[float]) -> float: + if len(xs) < 2: + return float("nan") + m = mean(xs) + return math.sqrt(sum((x - m) ** 2 for x in xs) / (len(xs) - 1)) + + +def rankdata(xs: Sequence[float]) -> List[int]: + # Simple dense ranking (ties get same rank). + sorted_unique = sorted(set(xs)) + rank = {v: i + 1 for i, v in enumerate(sorted_unique)} + return [rank[v] for v in xs] + + +def pearsonr(x: Sequence[float], y: Sequence[float]) -> float: + if len(x) != len(y) or len(x) < 2: + return float("nan") + mx = mean(x) + my = mean(y) + num = sum((a - mx) * (b - my) for a, b in zip(x, y)) + denx = math.sqrt(sum((a - mx) ** 2 for a in x)) + deny = math.sqrt(sum((b - my) ** 2 for b in y)) + if denx == 0.0 or deny == 0.0: + return float("nan") + return num / (denx * deny) + + +def spearmanr(x: Sequence[float], y: Sequence[float]) -> float: + rx = rankdata(x) + ry = rankdata(y) + return pearsonr(rx, ry) + + +def mae(x: Sequence[float], y: Sequence[float]) -> float: + if len(x) != len(y) or not x: + return float("nan") + return sum(abs(a - b) for a, b in zip(x, y)) / len(x) + + +def topk_agreement(x: Sequence[float], y: Sequence[float], k: int) -> float: + if len(x) != len(y) or not x: + return float("nan") + n = len(x) + k = max(1, min(k, n)) + topx = set(sorted(range(n), key=lambda i: x[i], reverse=True)[:k]) + topy = set(sorted(range(n), key=lambda i: y[i], reverse=True)[:k]) + return len(topx & topy) / k + diff --git a/experiments/param-opti/wrappers/genie/Dockerfile b/experiments/param-opti/wrappers/genie/Dockerfile new file mode 100644 index 0000000..a66625e --- /dev/null +++ b/experiments/param-opti/wrappers/genie/Dockerfile @@ -0,0 +1,30 @@ +FROM python:3.8-slim + +WORKDIR /app + +RUN apt-get update && apt-get install -y --no-install-recommends \ + git wget unzip\ + && rm -rf /var/lib/apt/lists/* + +RUN git clone https://github.com/epfl-dlab/GenIE.git + +WORKDIR /app/GenIE + +RUN pip install --upgrade pip + +RUN pip install -r pip_requirements.txt + +RUN mkdir -p data/models + +# Models initialized with a pretrained language model (GenIE - PLM) Trained on Rebel +RUN wget https://zenodo.org/record/6139236/files/genie_plm_r.ckpt \ + -O data/models/genie_plm_r.ckpt + +RUN wget https://zenodo.org/record/6139236/files/tries.zip \ + && unzip tries.zip -d data && rm tries.zip + +COPY bin/genie_cli.py /app/GenIE/genie_cli.py +COPY genie.sh /usr/local/bin/genie.sh +RUN chmod +x /usr/local/bin/genie.sh + + diff --git a/experiments/param-opti/wrappers/genie/README.md b/experiments/param-opti/wrappers/genie/README.md new file mode 100644 index 0000000..4b21480 --- /dev/null +++ b/experiments/param-opti/wrappers/genie/README.md @@ -0,0 +1,79 @@ +# README.md +## Build Docker +```bash +docker build -t genie . +``` + +## Run Docker +```bash +docker run --rm \ + -v /home/theo/Work/SCADS.AI/Projects/KGpipe/experiments/text-pipelines/test/Titanic.txt:/data/input.txt \ + -v /home/theo/Work/SCADS.AI/Projects/KGpipe/experiments/text-pipelines/wrappers/genie/output.json:/data/output.json \ + genie genie.sh /data/input.txt /data/output.json +``` + + +## Tool Parameters + +### Model Parameters +- `checkpoint` (pre trained model) **or** +- `hydra` + +--- + +### Constraint Parameters +- `entity_trie` (pickle) **or** string list +- `relation_trie` (pickle) **or** string list + +--- + +### Generate Parameters +Uses standard `Transformers generate()` function. + +#### Beam Search +- `num_beams` +- `num_return_sequences` +- `early_stopping` +- `length_penalty` + +#### Sampling +- `do_sample` +- `temperature` +- `top_k` +- `top_p` +- `typical_p` + +#### Output Length +- `max_length` +- `max_new_tokens` +- `min_length` +- `min_new_tokens` + +#### Scores & Debug +- `return_dict_in_generate` +- `output_scores` +- `output_attentions` +- `output_hidden_states` +- `output_logits` + +#### Seed +- `seed` + +#### Token-Control +- `bos_token_id` +- `eos_token_id` +- `pad_token_id` +- `decoder_start_token_id` +- `forced_bos_token_id` +- `forced_eos_token_id` + +#### Repetition / Constraints +- `repetition_penalty` +- `no_repeat_ngram_size` +- `bad_words_ids` +- `force_words_ids` +- `constraints` +- `prefix_allowed_tokens_fn` + + + diff --git a/experiments/param-opti/wrappers/genie/bin/genie_cli.py b/experiments/param-opti/wrappers/genie/bin/genie_cli.py new file mode 100644 index 0000000..3382e36 --- /dev/null +++ b/experiments/param-opti/wrappers/genie/bin/genie_cli.py @@ -0,0 +1,126 @@ +import sys +import os +import json +import re + +from genie.models import GeniePL +from genie.constrained_generation import Trie + +DATA_DIR = os.path.join(os.getcwd(), "data") + + +def load_model(): + ckpt_name = "genie_plm_r.ckpt" + path_to_checkpoint = os.path.join(DATA_DIR, "models", ckpt_name) + + model = GeniePL.load_from_checkpoint( + checkpoint_path=path_to_checkpoint + ) + + return model + + +def load_tries(): + entity_trie_path = os.path.join(DATA_DIR, "tries/large/entity_trie.pickle") + entity_trie = Trie.load(entity_trie_path) + + relation_trie_path = os.path.join(DATA_DIR, "tries/large/relation_trie.pickle") + relation_trie = Trie.load(relation_trie_path) + + return {"entity_trie": entity_trie, "relation_trie": relation_trie} + + +def split_into_sentences(text: str): + text = re.sub(r"\s+", " ", text).strip() + if not text: + return [] + + # Keep this lightweight so folder mode still benefits from a + # single long-lived Python process without extra tokenizer deps. + parts = re.split(r"(?<=[.!?])\s+(?=[A-Z0-9\"'(\[])", + text) + return [part.strip() for part in parts if part.strip()] + + +def extract_file(model, tries, input_path: str, output_path: str): + with open(input_path, "r", encoding="utf-8") as f: + text = f.read() + + sentences = split_into_sentences(text) + if not sentences: + with open(output_path, "w", encoding="utf-8") as f: + json.dump([], f, indent=2, ensure_ascii=False) + return + + generation_args = { + "num_beams": 5, + "num_return_sequences": 1, + "max_length": 128, + "early_stopping": True, + "no_repeat_ngram_size": 3, + "repetition_penalty": 1.2, + "length_penalty": 0.8, + "return_dict_in_generate": True, + "output_scores": True, + } + + outputs = model.sample( + sentences, + **tries, + **generation_args, + ) + + with open(output_path, "w", encoding="utf-8") as f: + json.dump(outputs, f, indent=2, ensure_ascii=False) + + +def main(): + if len(sys.argv) < 3: + print("Usage: genie.sh ") + sys.exit(1) + + input_path = sys.argv[1] + output_path = sys.argv[2] + + if os.path.isdir(input_path): + if os.path.isfile(output_path): + raise SystemExit("Error: output must be a folder when input is a folder") + + os.makedirs(output_path, exist_ok=True) + + model = load_model() + tries = load_tries() + + files = [ + os.path.join(input_path, name) + for name in os.listdir(input_path) + if os.path.isfile(os.path.join(input_path, name)) + ] + files.sort() + + for in_file in files: + filename = os.path.basename(in_file) + out_file = os.path.join(output_path, filename) + if os.path.exists(out_file): + continue + extract_file(model, tries, in_file, out_file) + print(f"Processed {in_file} β†’ {out_file}") + + print(f"Extraction finished. Results written to folder {output_path}") + return + + if os.path.isfile(input_path): + if os.path.isdir(output_path): + raise SystemExit("Error: output must be a file when input is a file") + + model = load_model() + tries = load_tries() + extract_file(model, tries, input_path, output_path) + print(f"Extraction finished. Results written to {output_path}") + return + + raise SystemExit("Error: input must be a file or directory") + + +if __name__ == "__main__": + main() diff --git a/experiments/param-opti/wrappers/genie/genie.sh b/experiments/param-opti/wrappers/genie/genie.sh new file mode 100644 index 0000000..71fb229 --- /dev/null +++ b/experiments/param-opti/wrappers/genie/genie.sh @@ -0,0 +1,43 @@ +#!/usr/bin/env bash +set -e + +if [ "$#" -ne 2 ]; then + echo "Usage:" + echo " genie.sh " + echo " genie.sh " + exit 1 +fi + +INPUT="$1" +OUTPUT="$2" + +GRAPHENE_DIR="/app/GenIE" + +if [ -f "$INPUT" ]; then + if [ -d "$OUTPUT" ]; then + echo "Error: Output must be a file when input is a file" + exit 1 + fi + + echo "Processing single file..." + python /app/GenIE/genie_cli.py "$INPUT" "$OUTPUT" + echo "Done." + exit 0 +fi + +if [ -d "$INPUT" ]; then + if [ -f "$OUTPUT" ]; then + echo "Error: Output must be a folder when input is a folder" + exit 1 + fi + mkdir -p "$OUTPUT" + chmod 777 "$OUTPUT" + + echo "Processing folder..." + python /app/GenIE/genie_cli.py "$INPUT" "$OUTPUT" + echo "All files processed." + exit 0 +fi + +echo "Error: Input must be a file or directory" +exit 1 \ No newline at end of file diff --git a/experiments/param-opti/wrappers/genie/output.json b/experiments/param-opti/wrappers/genie/output.json new file mode 100644 index 0000000..aa99d3e --- /dev/null +++ b/experiments/param-opti/wrappers/genie/output.json @@ -0,0 +1,20 @@ +[ + [ + { + "text": " Captain America publisher Marvel Comics ", + "log_prob": -0.8582733273506165 + } + ], + [ + { + "text": " Marvel Studios parent organization Paramount Pictures ", + "log_prob": -0.5044617652893066 + } + ], + [ + { + "text": " El Capitan Theatre country United States ", + "log_prob": -0.5770944952964783 + } + ] +] \ No newline at end of file diff --git a/experiments/param-opti/wrappers/genie/output_1.json b/experiments/param-opti/wrappers/genie/output_1.json new file mode 100644 index 0000000..4972807 --- /dev/null +++ b/experiments/param-opti/wrappers/genie/output_1.json @@ -0,0 +1,26 @@ +[ + [ + { + "text": " Captain America publisher Marvel Comics ", + "log_prob": -0.4791736900806427 + } + ], + [ + { + "text": " Marvel Cinematic Universe production company Marvel Studios ", + "log_prob": -0.33403322100639343 + } + ], + [ + { + "text": " Captain America performer Chris Evans (actor) ", + "log_prob": -0.46612370014190674 + } + ], + [ + { + "text": " Captain America conflict World War II ", + "log_prob": -0.39299872517585754 + } + ] +] \ No newline at end of file diff --git a/experiments/param-opti/wrappers/genie/test.txt b/experiments/param-opti/wrappers/genie/test.txt new file mode 100644 index 0000000..4eb32be --- /dev/null +++ b/experiments/param-opti/wrappers/genie/test.txt @@ -0,0 +1,3 @@ +Captain America: The First Avenger is a 2011 American superhero film based on the Marvel Comics character Captain America. Produced by Marvel Studios and distributed by Paramount Pictures, it is the fifth film in the Marvel Cinematic Universe (MCU). The film was directed by Joe Johnston, written by Christopher Markus and Stephen McFeely, and stars Chris Evans as Steve Rogers / Captain America alongside Tommy Lee Jones, Hugo Weaving, Hayley Atwell, Sebastian Stan, Dominic Cooper, Toby Jones, Neal McDonough, Derek Luke, and Stanley Tucci. During World War II, Rogers, a frail man, is transformed into the super-soldier Captain America and must stop the Red Skull (Weaving) from using the Tesseract as an energy source for world domination. +The film began as a concept in 1997 and was scheduled for distribution by Artisan Entertainment. However, a lawsuit disrupted the project and was not settled until September 2003. In 2005, Marvel Studios received a loan from Merrill Lynch, and planned to finance and release the film through Paramount Pictures. Directors Jon Favreau and Louis Leterrier were interested in directing the project before Johnston was approached in 2008. The principal characters were cast between March and June 2010. Production began in June, and filming took place in London, Manchester, Caerwent, Liverpool, and Los Angeles. Several different techniques were used by the visual effects company Lola to create the physical appearance of the character before he becomes Captain America. +Captain America: The First Avenger premiered at the El Capitan Theatre in Los Angeles on July 19, 2011, and was released in the United States on July 22, as part of Phase One of the MCU. The film was commercially successful, grossing over $370 million worldwide, and received positive reviews from critics, who praised Evans' performance, the film's depiction of its 1940s time period, and Johnston's direction. Two direct sequels have been released: Captain America: The Winter Soldier (2014) and Captain America: Civil War (2016). diff --git a/experiments/param-opti/wrappers/genie/test_1.txt b/experiments/param-opti/wrappers/genie/test_1.txt new file mode 100644 index 0000000..ea6f33e --- /dev/null +++ b/experiments/param-opti/wrappers/genie/test_1.txt @@ -0,0 +1 @@ +Captain America: The First Avenger is a 2011 American superhero film based on the Marvel Comics character Captain America. Produced by Marvel Studios and distributed by Paramount Pictures, it is the fifth film in the Marvel Cinematic Universe (MCU). The film was directed by Joe Johnston, written by Christopher Markus and Stephen McFeely, and stars Chris Evans as Steve Rogers / Captain America alongside Tommy Lee Jones, Hugo Weaving, Hayley Atwell, Sebastian Stan, Dominic Cooper, Toby Jones, Neal McDonough, Derek Luke, and Stanley Tucci. During World War II, Rogers, a frail man, is transformed into the super-soldier Captain America and must stop the Red Skull (Weaving) from using the Tesseract as an energy source for world domination. diff --git a/experiments/param-opti/wrappers/genie/test_docker_run.sh b/experiments/param-opti/wrappers/genie/test_docker_run.sh new file mode 100644 index 0000000..9b00275 --- /dev/null +++ b/experiments/param-opti/wrappers/genie/test_docker_run.sh @@ -0,0 +1 @@ +docker run -v $(pwd):$(pwd) genie genie.sh $(pwd)/test_1.txt $(pwd)/output_1.json \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..cafa4aa --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,55 @@ +site_name: KGpipe +site_description: Knowledge Graph pipeline evaluation framework + +# For GitHub Pages under // +use_directory_urls: true + +theme: + name: material + features: + - navigation.instant + - navigation.tracking + - navigation.sections + - navigation.expand + - navigation.top + - toc.integrate + - search.suggest + - search.highlight + +markdown_extensions: + - admonition + - toc: + permalink: true + - pymdownx.superfences + - pymdownx.details + +plugins: + - search + +docs_dir: docs +site_dir: site + +nav: + - Home: index.md + - Quickstart: quickstart.md + - KGI-Bench (benchmark site): https://scads.github.io/KGI-Bench/ + - Concepts: + - Tasks: tasks.md + - Pipelines: pipelines.md + - Configuration: configuration.md + - Parameters: parameters.md + - Meta KG: metakg.md + - Evaluation: + - Overview: evaluation.md + - Metrics index: metrics/metrics.md + - Entity coverage: metrics/entity_coverage.md + - Reference entity alignment: metrics/reference_entity_alignment.md + - Reference triple alignment: metrics/reference_triple_alignment.md + - Stats counts: metrics/stats_counts.md + - Experiments: + - Reproduce MovieKG: reproduce.md + - Other: + - Adoption (integrating existing pipelines): adoption.md + - View/UI: view.md + - Building docs: create-docs.md + - Migration (renamed): migration.md diff --git a/pyproject.toml b/pyproject.toml index 8c4e40b..13cbf96 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -21,8 +21,8 @@ dependencies = [ "rdflib>=6.0.0", "matplotlib>=3.5.0", "networkx>=2.8.0", - "transformers>=4.50.0", - "sentence_transformers>=4.1.0", + # ML stack (torch/transformers) is intentionally NOT in base deps. + # Install via `pip/uv pip install ".[ml]"` plus the desired torch index (CPU/CUDA). "pulp>=3.3.0", "pytest>=8.4.2", "dotenv>=0.9.9", @@ -39,6 +39,58 @@ dependencies = [ [project.optional-dependencies] dev = ["pytest", "pytest-mock", "pytest-cov", "ruff", "black"] +docs = [ + "mkdocs-material", + "mkdocstrings[python]", +] +cpu = [ + "torch", + "torchvision", + "torchaudio", +] +cuda = [ + "torch", + "torchvision", + "torchaudio", +] +ml = [ + "transformers>=4.50.0", + "sentence_transformers>=4.1.0", +] + +[tool.uv] +conflicts = [ + [ + { extra = "cpu" }, + { extra = "cuda" }, + ], +] + +[tool.uv.sources] +torch = [ + { index = "pytorch-cpu", extra = "cpu" }, + { index = "pytorch-cuda", extra = "cuda" }, +] +torchvision = [ + { index = "pytorch-cpu", extra = "cpu" }, + { index = "pytorch-cuda", extra = "cuda" }, +] +torchaudio = [ + { index = "pytorch-cpu", extra = "cpu" }, + { index = "pytorch-cuda", extra = "cuda" }, +] + +[[tool.uv.index]] +name = "pytorch-cpu" +url = "https://download.pytorch.org/whl/cpu" +explicit = true + +# CUDA wheels live on a separate PyTorch index. +# If you need a different CUDA version, change the URL (e.g. `cu128`, `cu126`, `cu121`). +[[tool.uv.index]] +name = "pytorch-cuda" +url = "https://download.pytorch.org/whl/cu130" +explicit = true [tool.setuptools.packages.find] where = ["src"] diff --git a/src/kgpipe/cli/eval_new.py b/src/kgpipe/cli/eval_new.py new file mode 100644 index 0000000..a86f141 --- /dev/null +++ b/src/kgpipe/cli/eval_new.py @@ -0,0 +1,387 @@ +import click +from rich.console import Console +from rich.table import Table +from typing import List, Optional, Sequence, Any +import json +from pathlib import Path +import codecs + +from kgpipe_eval.metrics.statistics import CountMetric +from kgpipe_eval.metrics.duplicates import DuplicateMetric +from kgpipe_eval.metrics.entity_alignment import EntityAlignmentMetric +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentMetric +from kgpipe_eval.metrics.consistency_violations import DisjointDomainMetric, DomainMetric, RangeMetric, RelationDirectionMetric, DatatypeMetric, DatatypeFormatMetric +from kgpipe_eval.utils.kg_utils import KgManager +from kgpipe_eval.utils.metric_utils import MeasurementKey, parse_eval_results, write_eval_csv +from kgpipe_eval.config.manager import load_metric_configs, write_default_config_yaml +from kgpipe_eval.evaluator import Evaluator +# from kgpipe_eval.metrics.semantic import OntologyClassCoverageMetric, OntologyRelationCoverageMetric, OntologyNamespaceCoverageMetric +# from kgpipe_eval.metrics.reference import PrecisionMetric, RecallMetric, F1ScoreMetric +# from kgpipe_eval.metrics.efficiency import RuntimeMetric, MemoryUsageMetric, CostMetric +# from kgpipe_eval.metrics.quality import QualityMetric +# from kgpipe_eval.metrics.completeness import CompletenessMetric +# from kgpipe_eval.metrics.accuracy import AccuracyMetric + +console = Console() + +_DEFAULT_EVAL_RESULTS_ALLOWLIST = { + "DuplicateMetric": { + "duplicates": "number", + "entity_count": "number", + "duplicates_ratio": "percentage", + } +} + +def _measurement_key_to_col(k: MeasurementKey) -> str: + return f"{k.metric}__{k.measurement}__{k.unit}" + + +def _col_to_measurement_key(col: str) -> MeasurementKey: + parts = col.split("__") + if len(parts) != 3 or not all(parts): + raise click.ClickException( + f"Invalid selection '{col}'. Expected format: ____" + ) + return MeasurementKey(metric=parts[0], measurement=parts[1], unit=parts[2]) + + +def _available_eval_result_keys(paths: list[Path]) -> list[MeasurementKey]: + keys: set[MeasurementKey] = set() + for p in paths: + flat = parse_eval_results(p) + keys.update(flat.keys()) + return sorted(keys, key=_measurement_key_to_col) + +def _decode_single_char_delimiter(delimiter: str) -> str: + """ + Allow passing common escape sequences like '\\t' for tab. + """ + decoded = codecs.decode(delimiter, "unicode_escape") if "\\" in delimiter else delimiter + if len(decoded) != 1: + raise click.ClickException( + f"--delimiter must be a single character (you passed {delimiter!r} -> {decoded!r})" + ) + return decoded + + +def _available_metric_instances() -> dict[str, Any]: + # Keep this explicit until the metrics package is more complete/stable. + return { + "CountMetric": CountMetric(), + "DuplicateMetric": DuplicateMetric(), + "EntityAlignmentMetric": EntityAlignmentMetric(), + "TripleAlignmentMetric": TripleAlignmentMetric(), + "DisjointDomainMetric": DisjointDomainMetric(), + "DomainMetric": DomainMetric(), + "RangeMetric": RangeMetric(), + "RelationDirectionMetric": RelationDirectionMetric(), + "DatatypeMetric": DatatypeMetric(), + "DatatypeFormatMetric": DatatypeFormatMetric(), + } + +def _normalize_key(k: str) -> str: + return k.strip().lower().replace("-", "_") + + +def _metric_key(metric: Any) -> str: + return getattr(metric, "key", metric.__class__.__name__) + + +def _metric_description(metric: Any) -> str: + cls = metric.__class__ + desc = getattr(cls, "description", None) + if desc: + return str(desc).strip() + if cls.__doc__: + return cls.__doc__.strip().split("\n")[0] + compute_doc = cls.compute.__doc__ + if compute_doc: + return compute_doc.strip().split("\n")[0] + return "β€”" + + +def _render_available_metrics_table() -> None: + metrics = _available_metric_instances() + table = Table(title="Available metrics (eval-new)") + table.add_column("Name", style="cyan") + table.add_column("Description", style="green") + + for name in sorted(metrics.keys()): + table.add_row(name, _metric_description(metrics[name])) + + console.print(table) + console.print( + f"[dim]{len(metrics)} metric(s). " + "Pass one or more with `eval-new run -m `.[/dim]" + ) + + +def _build_confs_for_selected_metrics( + selected_metric_instances: list[Any], + loaded_confs: dict[str, Any], +) -> dict[str, Any]: + """ + Convert configs loaded from YAML (keyed by YAML metric id) into a dict keyed by + metric class name / `.key` (what Evaluator uses). + """ + confs_by_norm = {_normalize_key(k): v for k, v in loaded_confs.items()} + out: dict[str, Any] = {} + + # Common YAML β†’ class-name aliases + alias_to_metric_key: dict[str, str] = { + "duplicates": "DuplicateMetric", + "duplicate": "DuplicateMetric", + "entity_align": "EntityAlignmentMetric", + "entity_alignment": "EntityAlignmentMetric", + } + + for metric in selected_metric_instances: + mkey = _metric_key(metric) + norm_mkey = _normalize_key(mkey) + norm_cls = _normalize_key(metric.__class__.__name__) + + # Try common YAML ids derived from metric names + base_from_key = norm_mkey.replace("_metric", "").replace("metric", "") + base_from_cls = norm_cls.replace("_metric", "").replace("metric", "") + + cfg = ( + confs_by_norm.get(norm_mkey) + or confs_by_norm.get(norm_cls) + or confs_by_norm.get(_normalize_key(alias_to_metric_key.get(norm_mkey, ""))) + or confs_by_norm.get(_normalize_key(alias_to_metric_key.get(norm_cls, ""))) + or confs_by_norm.get(base_from_key) + or confs_by_norm.get(base_from_cls) + # plural fallback (e.g. DuplicateMetric -> duplicates) + or confs_by_norm.get(f"{base_from_key}s") + or confs_by_norm.get(f"{base_from_cls}s") + ) + + if cfg is not None: + out[mkey] = cfg + out[metric.__class__.__name__] = cfg + + return out + + + +def _render_results_table(kg_path: str, metric_key: str, measurements: Sequence[Any], summary: Optional[str]) -> None: + table = Table(title=f"{Path(kg_path).name} β€” {metric_key}") + table.add_column("Measurement", style="cyan") + table.add_column("Value", style="green") + table.add_column("Unit", style="magenta") + + for m in measurements: + unit = getattr(m, "unit", None) + value = getattr(m, "value", None) + name = getattr(m, "name", None) + table.add_row(str(name), json.dumps(value, ensure_ascii=False, default=str) if not isinstance(value, (str, int, float, bool)) else str(value), "" if unit is None else str(unit)) + + console.print(table) + if summary: + console.print(f"[dim]{summary}[/dim]") + console.print("") + + +def _results_to_json_rows(kg_path: str, metric_key: str, measurements: Sequence[Any], summary: Optional[str]) -> list[dict[str, Any]]: + rows: list[dict[str, Any]] = [] + for m in measurements: + rows.append( + { + "kg_path": kg_path, + "metric": metric_key, + "measurement": getattr(m, "name", None), + "value": getattr(m, "value", None), + "unit": getattr(m, "unit", None), + "summary": summary, + } + ) + return rows + + +@click.group(name="eval-new") +def eval_new_cmd() -> None: + """ + Evaluation commands for the new metric framework. + """ + + +@eval_new_cmd.command(name="list") +def list_metrics_cmd() -> None: + """ + List all metrics available to `eval-new run`. + """ + _render_available_metrics_table() + + +@eval_new_cmd.command(name="run") +@click.argument("kg_paths", nargs=-1, type=click.Path(exists=True)) +@click.option( + "--config", + "-c", + type=click.Path(exists=True), + help="Path to metric config file", +) +@click.option( + "--metrics", + "-m", + multiple=True, + type=click.Choice(sorted(_available_metric_instances().keys())), + help="Metrics to compute", +) +@click.option( + "--output", + "-o", + type=click.Path(dir_okay=False), + help="Write results to a JSON file (list of measurement rows).", +) +@click.pass_context +def run_cmd(ctx: click.Context, kg_paths: List[str], config: Optional[str], metrics: tuple, output: Optional[str]) -> None: + """ + Compute selected metrics for one or more KGs. + + KG_PATHS: one or more RDF files/directories that RDFLib can parse. + """ + metric_instances = _available_metric_instances() + selected_metrics = list(metrics) if metrics else list(metric_instances.keys()) + + unknown = [m for m in selected_metrics if m not in metric_instances] + if unknown: + raise click.ClickException(f"Unknown metrics: {', '.join(unknown)}") + + loaded_metric_confs: dict[str, Any] = {} + if config: + loaded_metric_confs = load_metric_configs(config) + + all_rows: list[dict[str, Any]] = [] + + for kg_path in kg_paths: + console.print(f"[bold blue]Evaluating:[/bold blue] {kg_path}") + kg_graph = KgManager.load_kg_from_path(Path(kg_path)) + try: + selected_metric_instances = [metric_instances[k] for k in selected_metrics] + confs = _build_confs_for_selected_metrics(selected_metric_instances, loaded_metric_confs) + + results = Evaluator().run(kg=kg_graph, metrics=selected_metric_instances, confs=confs) + for res in results: + metric_key = _metric_key(res.metric) + _render_results_table(kg_path, metric_key, res.measurements, getattr(res, "summary", None)) + all_rows.extend(_results_to_json_rows(kg_path, metric_key, res.measurements, getattr(res, "summary", None))) + finally: + KgManager.unload_kg(kg_graph) + + if output: + out_path = Path(output) + out_path.parent.mkdir(parents=True, exist_ok=True) + out_path.write_text(json.dumps(all_rows, indent=2, ensure_ascii=False, default=str) + "\n", encoding="utf-8") + console.print(f"[green]βœ“ Saved results to[/green] {output}") + + +@eval_new_cmd.command(name="init-config") +@click.argument("output_path", type=click.Path(dir_okay=False), default="eval.default.yaml", required=False) +def init_config_cmd(output_path: str) -> None: + """ + Write a default metric-config template YAML to OUTPUT_PATH. + """ + out = write_default_config_yaml(output_path) + console.print(f"[green]βœ“ Wrote default config to[/green] {out}") + + +@eval_new_cmd.command(name="to-csv") +@click.argument("eval_json_paths", nargs=-1, type=click.Path(exists=True, dir_okay=False)) +@click.option( + "--glob", + "glob_pattern", + type=str, + help="Optional glob pattern (expanded by the shell) for eval_results.json files.", +) +@click.option( + "--select", + "-s", + "selected_cols", + multiple=True, + help="Select columns to include (repeatable). Format: ____. If omitted, defaults are used.", +) +@click.option( + "--list-keys", + is_flag=True, + help="Print available column keys found in the inputs and exit.", +) +@click.option( + "--round", + "round_ndigits", + type=int, + default=None, + help="Round float values to N decimal digits before writing CSV.", +) +@click.option( + "--delimiter", + "delimiter", + type=str, + default=",", + show_default=True, + help="CSV delimiter character (supports escapes like '\\t').", +) +@click.option( + "--output", + "-o", + "output_csv", + type=click.Path(dir_okay=False), + required=True, + help="Path to write the CSV table to.", +) +def to_csv_cmd( + eval_json_paths: List[str], + glob_pattern: Optional[str], + selected_cols: tuple[str, ...], + list_keys: bool, + round_ndigits: Optional[int], + delimiter: str, + output_csv: str, +) -> None: + """ + Convert one or more `eval_results.json` files into a CSV table. + + The CSV contains one row per (pipeline, stage), derived from file paths like: + `/stage_/eval_results.json` + + Columns follow: `____`. + """ + paths: list[Path] = [Path(p) for p in eval_json_paths] + if glob_pattern: + paths.extend(sorted(Path().glob(glob_pattern))) + + if not paths: + raise click.ClickException("No input files provided. Pass paths or --glob.") + + available = _available_eval_result_keys(paths) + console.print("[bold]Available keys in inputs:[/bold]") + for k in available: + console.print(f" - {_measurement_key_to_col(k)}") + + if list_keys: + return + + allowlist = _DEFAULT_EVAL_RESULTS_ALLOWLIST + if selected_cols: + available_cols = {_measurement_key_to_col(k) for k in available} + missing = [c for c in selected_cols if c not in available_cols] + if missing: + raise click.ClickException( + "Selected keys not found in inputs:\n" + "\n".join(f"- {m}" for m in missing) + ) + + allowlist = {} + for c in selected_cols: + k = _col_to_measurement_key(c) + allowlist.setdefault(k.metric, {})[k.measurement] = k.unit + + out_path = Path(output_csv) + delimiter = _decode_single_char_delimiter(delimiter) + write_eval_csv( + paths, + out_path=out_path, + allowlist=allowlist, + delimiter=delimiter, + round_ndigits=round_ndigits, + ) + console.print(f"[green]βœ“ Wrote CSV to[/green] {out_path}") \ No newline at end of file diff --git a/src/kgpipe/cli/list.py b/src/kgpipe/cli/list.py index df0ad90..daa3a6f 100644 --- a/src/kgpipe/cli/list.py +++ b/src/kgpipe/cli/list.py @@ -45,14 +45,18 @@ def show_registered_tasks(format: str = "table") -> None: tasks = get_registered_tasks() for task in tasks: - table.add_row( - task.name, - ", ".join(getattr(task, 'category', [])), - getattr(task, 'description', 'N/A'), - str(getattr(task, 'input_spec', 'N/A')), - str(getattr(task, 'output_spec', 'N/A')), - "/".join(function_location(task.function).split(".")[:-1]) - ) + try: + table.add_row( + task.name, + ", ".join(getattr(task, 'category', [])), + getattr(task, 'description', 'N/A'), + str(getattr(task, 'input_spec', 'N/A')), + str(getattr(task, 'output_spec', 'N/A')), + "/".join(function_location(task.function).split(".")[:-1]) + ) + except Exception as e: + print(f"Error adding task {task.name}: {e}") + continue if format == "table": console.print(table) diff --git a/src/kgpipe/cli/main.py b/src/kgpipe/cli/main.py index fba6104..6f3e0c0 100644 --- a/src/kgpipe/cli/main.py +++ b/src/kgpipe/cli/main.py @@ -20,6 +20,7 @@ from .clean import clean_cmd from .task import task_cmd from .discover import discover_cmd +from .eval_new import eval_new_cmd # from .rank import rank_cmd # Initialize Rich console for pretty output console = Console() @@ -81,6 +82,7 @@ def cli(ctx: click.Context, config: Optional[str], verbose: bool, quiet: bool): cli.add_command(clean_cmd) cli.add_command(task_cmd) cli.add_command(discover_cmd) +cli.add_command(eval_new_cmd) # cli.add_command(rank_cmd) if __name__ == "__main__": diff --git a/src/kgpipe/common/__init__.py b/src/kgpipe/common/__init__.py index 33cf414..b800851 100644 --- a/src/kgpipe/common/__init__.py +++ b/src/kgpipe/common/__init__.py @@ -25,8 +25,9 @@ def setup_logging(log_file='app.log', level=logging.DEBUG): # Call this once at the start of your application setup_logging() +from .annotations import trace_task_run from .models import ( - Data, DataFormat, KgTask, KgTaskReport, DynamicFormat, FormatRegistry, + Data, DataFormat, BasicDataFormats, CustomDataFormats, BasicTaskCategoryCatalog, KgTask, KgTaskReport, DataSet, KG, Metric, EvaluationReport, KgPipe, TaskInput, TaskOutput ) from .registry import Registry @@ -38,8 +39,9 @@ def setup_logging(log_file='app.log', level=logging.DEBUG): ) __all__ = [ - "Data", "DataFormat", "KgTask", "KgTaskReport", "DynamicFormat", "FormatRegistry", + "Data", "DataFormat", "BasicDataFormats", "CustomDataFormats", "BasicTaskCategoryCatalog", "KgTask", "KgTaskReport", "DataSet", "KG", "Stage", "Metric", "EvaluationReport", "KgPipe", "TaskInput", "TaskOutput", + "trace_task_run", "Registry", "get_docker_volume_bindings", "remap_data_path_for_container", "discover_entry_points", "get_registered_tasks", "get_registered_pipelines", diff --git a/src/kgpipe/common/annotations.py b/src/kgpipe/common/annotations.py index ceb24b1..0146571 100644 --- a/src/kgpipe/common/annotations.py +++ b/src/kgpipe/common/annotations.py @@ -1,5 +1,5 @@ from rdflib import OWL, RDFS -from kgpipe.common.systemgraph import SYS_KG +from kgpipe.common.graph.systemgraph import SYS_KG, PipeKG from kgcore.api import KGProperty from typing import get_origin, get_args, Union @@ -11,7 +11,7 @@ def kg_class(description: str = ""): as a KG entity (type/Class node) once at import time. """ def decorator(cls): - print("kg_class decorator called for class: ", cls.__name__) + # print("kg_class decorator called for class: ", cls.__name__) # add owl class props = [] if description: @@ -73,4 +73,88 @@ def decorator(cls): SYS_KG.create_relation(source=prop_et.id, target=class_et.id, type=str(RDFS.domain)) return cls - return decorator \ No newline at end of file + return decorator + + + +def trace_metric_run(): pass + + +def trace_task_run(obj): + """ + Mark a task (function or `KgTask`) so that its `.run()` persists a TaskRun in `PipeKG`. + + Works with either decorator order: + + ```python + @trace_task_run + @Registry.task(...) + def my_task(...): ... + + # or + @Registry.task(...) + @trace_task_run + def my_task(...): ... + ``` + """ + setattr(obj, "trace_task_run", True) + # TODO use logger print(f"trace_task_run decorator called for object: {obj.__name__}") + return obj + +def trace_pipeline_run(obj): + """ + Mark a pipeline (function or `KgPipeline`) so that its `.run()` persists a PipelineRun in `PipeKG`. + """ + setattr(obj, "trace_pipeline_run", True) + # TODO use logger print(f"trace_pipeline_run decorator called for object: {obj.__name__}") + return obj + + +# def Track(_cls=None, *, with_timestamp: bool = False): +# """ +# Use as: +# @Track +# @Track(with_timestamp=True) +# """ +# def decorator(cls): +# class Tracked(cls): # subclass the original class +# def __init__(self, *args: Any, **kwargs: Any): +# super().__init__(*args, **kwargs) + +# inst_id = f"{cls.__name__}:{uuid4().hex[:8]}" +# setattr(self, "_kg_id", inst_id) + +# if isinstance(self, BaseModel): +# props = self.model_dump() +# else: +# props = {k: v for k, v in vars(self).items() if not k.startswith("_")} + +# if with_timestamp: +# props["timestamp"] = datetime.now(timezone.utc).isoformat() + +# SYS_KG.create_entity([cls.__name__], id=inst_id, props=props) + +# Tracked.__name__ = cls.__name__ # optional cosmetics +# Tracked.__qualname__ = cls.__qualname__ +# Tracked.__doc__ = cls.__doc__ +# return Tracked + +# return decorator if _cls is None else decorator(_cls) + +# def kg_function(fn): +# @functools.wraps(fn) +# def wrapper(*args, **kwargs): +# result = fn(*args, **kwargs) +# call_id = f"{fn.__name__}:{uuid4().hex[:8]}" +# SYS_KG.create_entity( +# ["FunctionCall"], +# id=call_id, +# props={ +# "name": fn.__name__, +# # Be careful serializing args/kwargs; this is a toy example: +# "args": repr(args), +# "kwargs": repr(kwargs), +# }, +# ) +# return result +# return wrapper diff --git a/src/kgpipe/common/definitions.py b/src/kgpipe/common/definitions.py deleted file mode 100644 index e14c4e8..0000000 --- a/src/kgpipe/common/definitions.py +++ /dev/null @@ -1,234 +0,0 @@ -from dataclasses import dataclass -from sys import implementation -from pydantic import BaseModel -from typing import Mapping, Optional, List, Dict, Any -from kgcore.api.kg import KGId - -from kgpipe.common.model.data import DataFormat - -# Types # - -type schema_format = str - -# Vocabulary # - -from rdflib.namespace import DefinedNamespace, Namespace - -class KGPIPE_NS(DefinedNamespace): - _fail = True - _NS = Namespace("http://github.com/ScaDS/kgpipe/") - Task = _NS["Task"] - TaskRun = _NS["TaskRun"] - Method = _NS["Method"] - Tool = _NS["Tool"] - Implementation = _NS["Implementation"] - Parameter = _NS["Parameter"] - ParameterBinding = _NS["ParameterBinding"] - Pipeline = _NS["Pipeline"] - PipelineRun = _NS["PipelineRun"] - Artifact = _NS["Artifact"] - ArtifactType = _NS["ArtifactType"] - Schema = _NS["Schema"] - Metric = _NS["Metric"] - MetricRun = _NS["MetricRun"] - - -# Data # - -class DataHandle(BaseModel): - """ - A handle to a data artifact - - uri: file://example.com/data.txt - type: any/text - timestamp: 2021-01-01 - version: 1.0.0 - hash: 1234567890 - size: 1000 - """ - uri: str - type: schema_format - timestamp: Optional[str] = None - version: Optional[str] = None - hash: Optional[str] = None - size: Optional[int] = None - -# Task # - -# # TODO describing entity vs entity with used values for the task -# class TaskConfiguration(BaseModel): -# key: str -# value: Any - -# class Task(BaseModel): -# """ -# A function that implements a task in a pipeline - -# name: paris_rdf_matcher -# type: entity_resolution -# description: "PARIS java implementation to match two RDF files, producing CSV files..." -# input: [any_rdf, any_rdf] -# output: [any_csv] -# """ -# name: str -# type: str -# description: Optional[str] = None -# input: List[schema_format] -# output: List[schema_format] - -# class TaskResult(BaseModel): -# """ -# The result of a task execution including configuration variables -# """ -# task: Task -# config: Dict[str, Any] -# input: List[DataHandle] -# output: List[DataHandle] -# status: str -# duration: float - -# # Evaluation # - -# class Eval(BaseModel): -# """ -# A function that evaluates data produced by tasks -# """ -# name: str -# type: str -# description: Optional[str] = None -# input: List[schema_format] - -# class EvalResult(BaseModel):# -# """ -# Result of an evaluation function -# """ -# eval: Eval -# config: Dict[str, Any] -# input: List[DataHandle] -# output: Dict[str, Any] -# status: str -# duration: float - -# # Pipeline # - -# class Pipeline(BaseModel): -# """ -# The plan of a pipeline -# """ -# tasks: List[Task] -# input: List[schema_format] -# output: List[schema_format] -# pokemon -# class PipelineResult(BaseModel): -# """ -# Result of a pipeline execution -# """ -# task_results: List[TaskResult] -# eval_results: List[EvalResult] -# input: List[DataHandle] -# output: List[DataHandle] -# status: str -# duration: float - -# new changes # - -TaskEntityId = KGId -class TaskEntity(BaseModel): - name: str - hasSubtask: List[TaskEntityId] - -MethodEntityId = KGId -class MethodEntity(BaseModel): - name: str - realizesTask: List[TaskEntityId] - -ToolEntityId = KGId -class ToolEntity(BaseModel): - name: str - # supportsTasks: List[Task] - providesMethods: List[MethodEntityId] - -ParameterId = KGId -class ParameterEntity(BaseModel): - name: str - value: Any - type: str - description: Optional[str] = None - default_value: Optional[Any] = None - required: bool = False - allowed_values: Optional[List[Any]] = None - -ParameterBindingId = KGId -class ParameterBindingEntity(BaseModel): - value: Any - parameter: ParameterId - -ImplementationEntityId = KGId -class ImplementationEntity(BaseModel): - uri: Optional[str] = None - name: str - input_spec: List[str] - output_spec: List[str] - implementsMethod: List[MethodEntityId] - hasParameter: List[ParameterId] - usesTool: List[ToolEntityId] - - # interface: str # TODO: Interface - # hasParameter: Parameter - -TaskRunEntityId = KGId -class TaskRunEntity(BaseModel): - number: int - name: str - status: str - started_at: float - ended_at: float - input: List[DataHandle] - output: List[DataHandle] - executesTask: TaskEntityId - usesImplementation: ImplementationEntityId - hasParameterBinding: List[ParameterBindingId] - -# class PipelineDefinitionEntity(BaseModel): -# """ -# The definition of a pipeline -# """ -# placeholder: str -# #definesPipeline: Pipeline - -# TODO issue as the Graph has no ordering of the tasks -class PipelineEntity(BaseModel): - name: str - tasks: List[TaskEntityId] - input: List[DataHandle] - output: List[DataHandle] - -class PipelineRunEntity(BaseModel): - """ - The result of a pipeline execution - """ - name: str - status: str - started_at: float - ended_at: float - hasTaskRun: List[TaskRunEntity] - # usesPipelineDefinition: PipelineDefinition - # runsPipeline: Pipeline - -MetricEntityId = KGId -class MetricEntity(BaseModel): - name: str - description: Optional[str] = None - type: str - # output: List[schema_format] - # hasParameter: List[ParameterId] - -MetricRunEntityId = KGId -class MetricRunEntity(BaseModel): - status: str - started_at: float - ended_at: float - computedMetric: MetricEntityId - input: List[DataHandle] - value: float - details: str \ No newline at end of file diff --git a/src/kgpipe/common/graph/__init__.py b/src/kgpipe/common/graph/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/kgpipe/common/graph/definitions.py b/src/kgpipe/common/graph/definitions.py new file mode 100644 index 0000000..8889580 --- /dev/null +++ b/src/kgpipe/common/graph/definitions.py @@ -0,0 +1,286 @@ +from pydantic import BaseModel, ConfigDict +from typing import Optional, List, Any +from kgcore.api.kg import KGId + +# Types # + +type schema_format = str +type any_uri = str + +# Vocabulary # + +from rdflib.namespace import DefinedNamespace, Namespace + +class KGPIPE_NS(DefinedNamespace): + _fail = True + _NS = Namespace("http://github.com/ScaDS/kgpipe/") + + Task = _NS["Task"] + TaskRun = _NS["TaskRun"] + Method = _NS["Method"] + Tool = _NS["Tool"] + Implementation = _NS["Implementation"] + Parameter = _NS["Parameter"] + ParameterBinding = _NS["ParameterBinding"] + Pipeline = _NS["Pipeline"] + PipelineRun = _NS["PipelineRun"] + Artifact = _NS["Artifact"] + ArtifactType = _NS["ArtifactType"] + Schema = _NS["Schema"] + Metric = _NS["Metric"] + MetricRun = _NS["MetricRun"] + DataSpec = _NS["DataSpec"] + DataEntity = _NS["Data"] + DataType = _NS["DataType"] + ConfigSpec = _NS["ConfigSpec"] + ConfigBinding = _NS["ConfigBinding"] + + + status = _NS["status"] + started_at = _NS["started_at"] + ended_at = _NS["ended_at"] + schema = _NS["schema"] + format = _NS["format"] + name = _NS["name"] + partOfTask = _NS["partOfTask"] + hasSubtask = _NS["hasSubtask"] + description = _NS["description"] + + version = _NS["version"] + executesTask = _NS["executesTask"] + supportsTask = _NS["supportsTask"] + input = _NS["input"] + output = _NS["output"] + format = _NS["format"] + config_spec = _NS["config_spec"] + + timestamp = _NS["timestamp"] + version = _NS["version"] + hash = _NS["hash"] + size = _NS["size"] + location = _NS["location"] + data_type = _NS["data_type"] + + realisesTask = _NS["realisesTask"] + usesImplementation = _NS["usesImplementation"] + + homepage = _NS["homepage"] + implementsMethod = _NS["implementsMethod"] + usesTool = _NS["usesTool"] + hasParameter = _NS["hasParameter"] + + providesMethod = _NS["providesMethod"] + + key = _NS["key"] + alias_keys = _NS["alias_keys"] + datatype = _NS["datatype"] + required = _NS["required"] + default_value = _NS["default_value"] + allowed_values = _NS["allowed_values"] + minimum = _NS["minimum"] + maximum = _NS["maximum"] + unit = _NS["unit"] + value = _NS["value"] + binding = _NS["binding"] + + parameter = _NS["parameter"] + hasParameterBinding = _NS["hasParameterBinding"] + +# Entities # + +DataTypeEntityId = KGId +class DataTypeEntity(BaseModel): + model_config = ConfigDict(frozen=True) + ### object properties ### + format: str + data_schema: str + +DataEntityId = KGId +class DataEntity(BaseModel): + model_config = ConfigDict(frozen=True) + ### datatype properties ### + timestamp: Optional[str] = None + version: Optional[str] = None + hash: Optional[str] = None + size: Optional[int] = None + ### object properties ### + location: any_uri + data_type: DataTypeEntityId + +DataSpecEntityId = KGId +class DataSpecEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + ### object properties ### + data_type: DataTypeEntityId + +TaskEntityId = KGId +class TaskEntity(BaseModel): + model_config = ConfigDict(frozen=True) + name: str + description: Optional[str] = None + partOfTask: Optional[TaskEntityId] = None + +# TODO MethodEntityId = KGId +# TODO class MethodEntity(BaseModel): +# model_config = ConfigDict(frozen=True) +# name: str +# realizesTask: tuple[TaskEntityId, ...] + +ToolEntityId = KGId +class ToolEntity(BaseModel): + model_config = ConfigDict(frozen=True) + ### datatype properties ### + name: str + homepage: Optional[str] = None + ### object properties ### + # NOTE: these entities are used as `lru_cache` keys; must be hashable. + supportsTasks: tuple[TaskEntityId, ...] + # TODO providesMethods: tuple[MethodEntityId, ...] + +ParameterEntityId = KGId +class ParameterEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + key: str + # NOTE: these entities are used as `lru_cache` keys; must be hashable. + alias_keys: tuple[str, ...] + datatype: str + required: bool + default_value: str | int | float | bool + allowed_values: tuple[str | int | float | bool, ...] + # description: Optional[str] = None + # scope: Scope # (training/inference/io/resources) + # constraints + # minimum: Optional[float] = None + # maximum: Optional[float] = None + # unit: Optional[str] = None + +ParameterBindingEntityId = KGId +class ParameterBindingEntity(BaseModel): + value: Any + parameter: ParameterEntityId + +ConfigSpecEntityId = KGId +class ConfigSpecEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + ### object properties ### + # NOTE: these entities are used as `lru_cache` keys; must be hashable. + parameters: tuple[ParameterEntityId, ...] + +ConfigBindingEntityId = KGId +class ConfigBindingEntity(BaseModel): + name: Any + binding: tuple[ParameterBindingEntityId, ...] + +ImplementationEntityId = KGId +class ImplementationEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + version: str + ### object properties ### + input_spec: List[DataSpecEntityId] + output_spec: List[DataSpecEntityId] + realizesTask: List[TaskEntityId] + usesTool: List[ToolEntityId] + config_spec: Optional[ConfigSpecEntityId] = None + + # TODO implementsMethod: List[MethodEntityId] + # TODO interface: str + +TaskRunEntityId = KGId +class TaskRunEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + status: str + started_at: float + ended_at: float + ### object properties ### + input: List[DataEntityId] + output: List[DataEntityId] + # TODO executesTask: TaskEntityId + usesImplementation: ImplementationEntityId + hasConfigBinding: Optional[ConfigBindingEntityId] = None + +# Entity representing a task dag (not the implementation) +# class PipelineDefinitionEntity(BaseModel): +# """ +# The definition of a pipeline +# """ +# placeholder: str +# #definesPipeline: Pipeline + +PipelineStepEntityId = KGId +class PipelineStepEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + ### object properties ### + input: List[DataEntityId] + output: List[DataEntityId] + executesTask: TaskEntityId + +# TODO issue as the Graph has no ordering of the tasks +class PipelineEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + ### object properties ### + steps: List[PipelineStepEntityId] + firstStep: PipelineStepEntityId + lastStep: PipelineStepEntityId + input: List[DataEntityId] + output: List[DataEntityId] + +PipelineRunEntityId = KGId +class PipelineRunEntity(BaseModel): + """ + The result of a pipeline execution + """ + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + name: str + status: str + started_at: float + ended_at: float + ### object properties ### + hasTaskRun: List[TaskRunEntity] + # TODO usesPipelineDefinition: PipelineDefinition + # TODO runsPipeline: PipelineStepEntityId + +MetricEntityId = KGId +class MetricEntity(BaseModel): + model_config = ConfigDict(frozen=True) + ### datatype properties ### + name: str + description: Optional[str] = None + type: str # TODO should be an enum + ### object properties ### + # TODO output: List[schema_format] + # TODO hasParameter: List[ParameterId] + +MetricRunEntityId = KGId +class MetricRunEntity(BaseModel): + model_config = ConfigDict(frozen=True) + uri: Optional[str] = None + ### datatype properties ### + status: str + started_at: float + ended_at: float + value: float + details: str # TODO should be a dictionary + ### object properties ### + computedMetric: MetricEntityId + input: List[DataEntityId] diff --git a/src/kgpipe/common/graph/mapper.py b/src/kgpipe/common/graph/mapper.py new file mode 100644 index 0000000..037fd9e --- /dev/null +++ b/src/kgpipe/common/graph/mapper.py @@ -0,0 +1,213 @@ +from __future__ import annotations + +from kgpipe.common.config import config +from kgpipe.common.graph.systemgraph import PipeKG +from kgpipe.common.model.default_catalog import TaskCategory +from kgpipe.common.util import encode_string + +from kgpipe.common.graph.definitions import ( + DataEntity, + DataEntityId, + DataSpecEntity, + DataSpecEntityId, + DataTypeEntity, + DataTypeEntityId, + ImplementationEntity, + ImplementationEntityId, + PipelineRunEntity, + PipelineRunEntityId, + TaskEntity, + TaskEntityId, + TaskRunEntity, + TaskRunEntityId, + MetricRunEntity, + MetricRunEntityId, + MetricEntity, + MetricEntityId, + ParameterEntity, + ParameterEntityId, + ParameterBindingEntity, + ParameterBindingEntityId, + ConfigSpecEntity, + ConfigSpecEntityId, + ConfigBindingEntity, + ConfigBindingEntityId, +) + +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from kgpipe.common.model import ( + DataFormat, + KgData, + KgTask, + KgTaskRun, + KgPipelineRun, + KgMetricRun, + KgMetric, + ConfigurationDefinition, + Parameter, + ConfigurationProfile, + ParameterBinding, + ) + from kgpipe.evaluation.base import MetricResult + +def task_to_entity(task: "TaskCategory") -> TaskEntityId: + """Map runtime task definition to a Task entity.""" + name = task + partOfTask = None + if isinstance(task, TaskCategory): + name = task.name + if task.parent: + partOfTask = task_to_entity(task.parent) + task_entity = TaskEntity( + name=name, + partOfTask=partOfTask, + ) + return PipeKG.add_task(task_entity) + +def data_type_to_entity(data_type: DataFormat) -> DataTypeEntityId: + data_type_entity = DataTypeEntity( + format=data_type, + data_schema=data_type, + ) + return PipeKG.add_data_type(data_type_entity) + +def data_spec_to_entity(data_spec: tuple[str, DataFormat], implementation_name: str = "") -> DataSpecEntityId: + data_spec_entity = DataSpecEntity( + uri=config.PIPEKG_PREFIX + encode_string(implementation_name + "_" + data_spec[0]), + name=data_spec[0], + data_type=data_type_to_entity(data_spec[1]), + ) + return PipeKG.add_data_spec(data_spec_entity) + +def data_to_entity(data: "KgData") -> DataEntityId: + data_entity = DataEntity( + timestamp=None, # TODO + version=None, # TODO + hash=None, # TODO + size=None, # TODO + location=data.path.as_uri(), + data_type=data_type_to_entity(data.format), + ) + return PipeKG.add_data_entity(data_entity) + +def parameter_to_entity(parameter: "Parameter") -> ParameterEntityId: + parameter_entity = ParameterEntity( + key=parameter.name, + alias_keys=parameter.native_keys, + datatype=parameter.datatype, + required=parameter.required, + default_value=parameter.default_value, + allowed_values=parameter.allowed_values, + # minimum=parameter.minimum, + # maximum=parameter.maximum, + # unit=parameter.unit, + ) + return PipeKG.add_parameter(parameter_entity) + + +def config_spec_to_entity(config_spec: "ConfigurationDefinition", implementation_name: str = "") -> ConfigSpecEntityId: + if config_spec is None: + return None + parameter_entities = [parameter_to_entity(parameter) for parameter in config_spec.parameters] + config_spec_entity = ConfigSpecEntity( + name=config_spec.name, + parameters=parameter_entities, + ) + return PipeKG.add_config_spec(config_spec_entity) + +def implementation_to_entity(implementation: "KgTask") -> ImplementationEntityId: + + input_specs = [data_spec_to_entity(data_spec, implementation.name) for data_spec in implementation.input_spec.items()] + + output_specs = [data_spec_to_entity(data_spec, implementation.name) for data_spec in implementation.output_spec.items()] + + realizes_tasks = [task_to_entity(task) for task in implementation.category] + + config_spec = config_spec_to_entity(implementation.config_spec, implementation.name) + + implementation_entity = ImplementationEntity( + ### datatype properties ### + name=implementation.name, + version="1.0.0", # TODO: get version from implementation + ### object properties ### + input_spec=input_specs, + output_spec=output_specs, + realizesTask=realizes_tasks, + usesTool=[], # TODO add usesTool relations + config_spec=config_spec, + ) + return PipeKG.add_implementation(implementation_entity) + +def metric_to_entity(metric: "KgMetric") -> MetricEntityId: + metric_entity = MetricEntity( + name=metric.name, + description=metric.description, + type=metric.aspect.value, + ) + return PipeKG.add_metric(metric_entity) + + +def parameter_binding_to_entity(parameter_binding: "ParameterBinding") -> ParameterBindingEntityId: + parameter_binding_entity = ParameterBindingEntity( + value=parameter_binding.value, + parameter=parameter_to_entity(parameter_binding.parameter), + ) + return PipeKG.add_parameter_binding(parameter_binding_entity) + +def config_binding_to_entity(config_profile: "ConfigurationProfile") -> ConfigBindingEntityId: + config_binding_entity = ConfigBindingEntity( + name=config_profile.name, + binding=[parameter_binding_to_entity(binding) for binding in config_profile.bindings], + ) + return PipeKG.add_config_binding(config_binding_entity) + +def task_run_to_entity(task_run: "KgTaskRun") -> TaskRunEntityId: + + input=[data_to_entity(data) for data in task_run.inputs] + output=[data_to_entity(data) for data in task_run.outputs] + hasConfigBinding=None # TODO + usesImplementation=implementation_to_entity(task_run.task) + hasConfigBinding=config_binding_to_entity(task_run.config_profile) if task_run.config_profile else None + + print(f"hasConfigBinding: {hasConfigBinding}") + + task_run_entity = TaskRunEntity( + status=task_run.status, + started_at=task_run.start_ts, + ended_at=task_run.start_ts + task_run.duration, + input=input, + output=output, + usesImplementation=usesImplementation, + hasConfigBinding=hasConfigBinding, + ) + return PipeKG.add_task_run(task_run_entity) + +def pipeline_run_to_entity(pipeline_run: "KgPipelineRun") -> PipelineRunEntityId: + pipeline_run_entity = PipelineRunEntity( + name=pipeline_run.name, + status=pipeline_run.status, + started_at=pipeline_run.started_at, + ended_at=pipeline_run.ended_at, + ) + return PipeKG.add_pipeline_run(pipeline_run_entity) + +# TODO +# def metric_run_to_entity(metric_run: "MetricResult") -> MetricRunEntityId: +# import time +# import json +# computedMetric = metric_to_entity(metric_run.metric) +# # data_type = data_type_to_entity(DataFormat.ANY) +# input_entities = [KgData(path=metric_run.kg.path, format=DataFormat.ANY)] +# input = [data_to_entity(input_entity) for input_entity in input_entities] +# metric_run_entity = MetricRunEntity( +# status="success", +# started_at=time.time(), +# ended_at=time.time(), +# computedMetric=computedMetric, +# input=input, +# value=metric_run.value, +# details=json.dumps(metric_run.details, default=str) +# ) +# PipeKG.add_metric_run(metric_run_entity) \ No newline at end of file diff --git a/src/kgpipe/common/graph/systemgraph.py b/src/kgpipe/common/graph/systemgraph.py new file mode 100644 index 0000000..7af2736 --- /dev/null +++ b/src/kgpipe/common/graph/systemgraph.py @@ -0,0 +1,352 @@ +import functools +import ast +from uuid import uuid4 +from typing import Any, List, Optional, TYPE_CHECKING +from datetime import datetime, timezone +import hashlib +import json + +from kgcore.api import KnowledgeGraph, KGEntity, KGRelation, KGProperty, new_id +from kgcore.backend.rdf.rdf_rdflib import RDFLibBackend +from kgcore.backend.rdf.rdf_sparql import RDFSparqlBackend, SparqlAuth +from kgcore.model.rdf.rdf_base import RDFBaseModel + +from kgpipe.common.graph.definitions import ( + KGPIPE_NS, + ImplementationEntity, ImplementationEntityId, + TaskEntity, TaskEntityId, + ToolEntity, ToolEntityId, + DataEntity, DataEntityId, + DataSpecEntity, DataSpecEntityId, + DataTypeEntity, DataTypeEntityId, + MetricEntity, MetricEntityId, + MetricRunEntity, MetricRunEntityId, + TaskRunEntity, TaskRunEntityId, + ParameterEntity, ParameterEntityId, + ParameterBindingEntity, ParameterBindingEntityId, + ConfigSpecEntity, ConfigSpecEntityId, + ConfigBindingEntity, ConfigBindingEntityId, +) +from kgpipe.common.config import load_config +from kgpipe.common.util import encode_string + +if TYPE_CHECKING: + from kgpipe.common.models import KgTask, KgTaskReport + +config = load_config() +scheme, rest = config.SYS_KG_URL.split("://") + +backend = RDFLibBackend() +model = RDFBaseModel() + +try: + if scheme == "sparql": + print(f"Using SPARQL backend for system graph: {f"http://{rest}"} with http://github.com/ScaDS/kgpipe/") + backend = RDFSparqlBackend( + endpoint=f"http://{rest}", + update_endpoint=f"http://{rest}", + default_graph="http://github.com/ScaDS/kgpipe/", + auth=SparqlAuth(username=config.SYS_KG_USR, password=config.SYS_KG_PSW)) + else: + raise ValueError(f"Unsupported schema: {scheme}") +except Exception as e: + print(f"Error creating system graph: {e}") + print(f"Using RDFLib memory backend for system graph") + +SYS_KG: KnowledgeGraph = KnowledgeGraph(model=model, backend=backend) + +class PipeKG: + """ + PipeKG is the system graph for the KGpipe framework. + It is a Object Graph Mapper (OGM) for the KGpipe framework. + It is used to store the entities and relations of the KGpipe framework. + """ + + ### Core Layer Entities ### + + @staticmethod + @functools.lru_cache + def add_task(task: TaskEntity) -> TaskEntityId: + entity_id = config.PIPEKG_PREFIX + encode_string(task.name) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.Task], + properties={ + KGPIPE_NS.name: task.name, + KGPIPE_NS.description: task.description + }, + ) + if task.partOfTask: + SYS_KG.create_relation(type=KGPIPE_NS.partOfTask, source=entity_id, target=task.partOfTask) + return TaskEntityId(entity_id) + + @staticmethod + @functools.lru_cache + def add_tool(tool: ToolEntity): + entity_id = config.PIPEKG_PREFIX + encode_string(tool.name) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.Tool], + properties={ + KGPIPE_NS.name: tool.name, + KGPIPE_NS.homepage: tool.homepage, + }, + ) + for supports_task in tool.supportsTasks: + SYS_KG.create_relation(type=KGPIPE_NS.supportsTask, source=entity_id, target=supports_task) + return ToolEntityId(entity_id) + + @staticmethod + def add_implementation(implementation: ImplementationEntity): + entity_id = config.PIPEKG_PREFIX + encode_string(implementation.name) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.Implementation], + properties={ + KGPIPE_NS.name: implementation.name, + KGPIPE_NS.version: implementation.version, + }, + ) + for input_spec in implementation.input_spec: + SYS_KG.create_relation(type=KGPIPE_NS.input, source=entity_id, target=input_spec) + for output_spec in implementation.output_spec: + SYS_KG.create_relation(type=KGPIPE_NS.output, source=entity_id, target=output_spec) + for realizes_task in implementation.realizesTask: + SYS_KG.create_relation(type=KGPIPE_NS.realisesTask, source=entity_id, target=realizes_task) + if implementation.config_spec: + SYS_KG.create_relation(type=KGPIPE_NS.config_spec, source=entity_id, target=implementation.config_spec) + return ImplementationEntityId(entity_id) + + @staticmethod + def find_implementation( + name: Optional[str] = None, + # version: Optional[str] = None, + # input_spec: Optional[List[str]] = None, + # output_spec: Optional[List[str]] = None, + # realizes_task: Optional[List[str]] = None, + # has_parameter: Optional[List[str]] = None, + ) -> List[ImplementationEntity]: + entities: List[KGEntity] = SYS_KG.find_entities( + types=[str(KGPIPE_NS.Implementation)], + ) + implementations = [ImplementationEntity( + uri=entity.id, + name=entity.get_property_value(str(KGPIPE_NS.name))[0], + version=entity.get_property_value(str(KGPIPE_NS.version))[0], + input_spec=[DataSpecEntityId(neighbor.id) for neighbor in SYS_KG.get_neighbors(entity.id, str(KGPIPE_NS.input))], + output_spec=[DataSpecEntityId(neighbor.id) for neighbor in SYS_KG.get_neighbors(entity.id, str(KGPIPE_NS.output))], + realizesTask=[TaskEntityId(neighbor.id) for neighbor in SYS_KG.get_neighbors(entity.id, str(KGPIPE_NS.realisesTask))], + # hasParameter=[ParameterEntityId(neighbor.id) for neighbor in entity.get_neighbors(KGPIPE_NS.hasParameter)], + usesTool=[ToolEntityId(neighbor.id) for neighbor in SYS_KG.get_neighbors(entity.id, str(KGPIPE_NS.usesTool))], + # config_spec=ConfigSpecEntityId(entity.get_property(KGPIPE_NS.config_spec)) if entity.get_property(KGPIPE_NS.config_spec) else None, + ) for entity in entities] + if name is not None: + implementations = [impl for impl in implementations if impl.name == name] + return implementations + + ### Data Layer Entities ### + + @staticmethod + @functools.lru_cache + def add_data_spec(data_spec: DataSpecEntity): + data_spec_entity = SYS_KG.create_entity( + id=data_spec.uri if data_spec.uri else new_id(), + types=[config.ONTOLOGY_PREFIX + "DataSpec"], + properties={ + config.ONTOLOGY_PREFIX + "name": data_spec.name, + }, + ) + SYS_KG.create_relation(type=KGPIPE_NS.data_type, source=data_spec_entity.id, target=data_spec.data_type) + return DataSpecEntityId(data_spec_entity.id) + + @staticmethod + @functools.lru_cache + def add_data_entity(data_entity: DataEntity): + entity_id = config.PIPEKG_PREFIX + new_id() + data_entity_entity = SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.DataEntity], + properties={}, # TODO + # properties={ + # KGPIPE_NS.timestamp: data_entity.timestamp, + # KGPIPE_NS.version: data_entity.version, + # KGPIPE_NS.hash: data_entity.hash, + # KGPIPE_NS.size: data_entity.size, + # }, + ) + SYS_KG.create_relation(type=KGPIPE_NS.location, source=data_entity_entity.id, target=data_entity.location) + SYS_KG.create_relation(type=KGPIPE_NS.data_type, source=data_entity_entity.id, target=data_entity.data_type) + return DataEntityId(data_entity_entity.id) + + @staticmethod + @functools.lru_cache + def add_data_type(data_type: DataTypeEntity) -> DataTypeEntityId: + entity_id = config.PIPEKG_PREFIX + encode_string(data_type.format+"-"+data_type.data_schema) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.DataType], + properties={ + KGPIPE_NS.format: data_type.format, + KGPIPE_NS.schema: data_type.data_schema, + }, + ) + return DataTypeEntityId(entity_id) + + ### Pipeline Layer Entities ### + + ### Evaluation Layer Entities ### + + def add_metric(metric: MetricEntity): + pass + + ### Run Layer Entities ### + + def add_task_run(task_run: TaskRunEntity): + entity_id = config.PIPEKG_PREFIX + new_id() + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.TaskRun], + properties={ + KGPIPE_NS.status: task_run.status, + KGPIPE_NS.started_at: task_run.started_at, + KGPIPE_NS.ended_at: task_run.ended_at, + }, + ) + for input in task_run.input: + SYS_KG.create_relation(type=KGPIPE_NS.input, source=entity_id, target=input) + for output in task_run.output: + SYS_KG.create_relation(type=KGPIPE_NS.output, source=entity_id, target=output) + SYS_KG.create_relation(type=KGPIPE_NS.usesImplementation, source=entity_id, target=task_run.usesImplementation) + return TaskRunEntityId(entity_id) + + def add_metric_run(metric_run: MetricRunEntity): + pass + + ### Configuration Layer Entities ### + + @staticmethod + @functools.lru_cache + def add_parameter(parameter: ParameterEntity): + + payload = json.dumps(parameter.model_dump(mode="json"), sort_keys=True, separators=(",", ":")) + stable_hash = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:16] # short suffix + entity_id = config.PIPEKG_PREFIX + encode_string(parameter.key) + "_" + stable_hash + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.Parameter], + properties={ + KGPIPE_NS.key: parameter.key, + KGPIPE_NS.alias_keys: parameter.alias_keys, + KGPIPE_NS.datatype: parameter.datatype, + KGPIPE_NS.required: parameter.required, + KGPIPE_NS.default_value: parameter.default_value, + KGPIPE_NS.allowed_values: parameter.allowed_values, + # KGPIPE_NS.minimum: parameter.minimum, + # KGPIPE_NS.maximum: parameter.maximum, + # KGPIPE_NS.unit: parameter.unit, + }, + ) + return ParameterEntityId(entity_id) + + def find_parameter(name: str): + pass + + @staticmethod + def add_parameter_binding(parameter_binding: ParameterBindingEntity): + payload = json.dumps(parameter_binding.model_dump(mode="json"), sort_keys=True, separators=(",", ":")) + stable_hash = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:16] # short suffix + entity_id = parameter_binding.parameter + "_" + stable_hash + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.ParameterBinding], + properties={ + KGPIPE_NS.value: parameter_binding.value, + }, + ) + SYS_KG.create_relation(type=KGPIPE_NS.parameter, source=entity_id, target=parameter_binding.parameter) + return ParameterBindingEntityId(entity_id) + + def find_parameter_binding(name: str): + pass + + @staticmethod + @functools.lru_cache + def add_config_spec(config_spec: ConfigSpecEntity): + entity_id = config.PIPEKG_PREFIX + encode_string(config_spec.name) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.ConfigSpec], + properties={ + KGPIPE_NS.name: config_spec.name, + }, + ) + for parameter in config_spec.parameters: + SYS_KG.create_relation(type=KGPIPE_NS.hasParameter, source=entity_id, target=parameter) + return ConfigSpecEntityId(entity_id) + + + def find_config_spec(name: str): + pass + + @staticmethod + def add_config_binding(config_binding: ConfigBindingEntity): + entity_id = config.PIPEKG_PREFIX + encode_string(config_binding.name) + SYS_KG.create_entity( + id=entity_id, + types=[KGPIPE_NS.ConfigBinding], + properties={ + KGPIPE_NS.name: config_binding.name, + }, + ) + for binding in config_binding.binding: + SYS_KG.create_relation(type=KGPIPE_NS.hasParameterBinding, source=entity_id, target=binding) + return ConfigBindingEntityId(entity_id) + + def find_config_binding(name: str): + pass + + ### Utility Functions ### + + @staticmethod + def sparql_construct(query: str): + backend : RDFSparqlBackend = SYS_KG.backend + result = backend.query_sparql(query) + return result + + @staticmethod + def _prop_value(properties: List[KGProperty], *keys: str) -> Any: + """Find a property value by exact key or key suffix.""" + for prop in properties: + if prop.key in keys: + return prop.value + for prop in properties: + for key in keys: + if prop.key.endswith(key): + return prop.value + return None + + @staticmethod + def _to_list(value: Any) -> List[str]: + """Normalize KG property values to list[str].""" + if value is None: + return [] + if isinstance(value, list): + return [str(v) for v in value] + if isinstance(value, tuple): + return [str(v) for v in value] + if isinstance(value, str): + text = value.strip() + if not text: + return [] + # Stored literals may contain Python-list string repr. + if text.startswith("[") and text.endswith("]"): + try: + parsed = ast.literal_eval(text) + except (ValueError, SyntaxError): + return [text] + if isinstance(parsed, list): + return [str(v) for v in parsed] + return [text] + return [str(value)] + + diff --git a/src/kgpipe/common/model/__init__.py b/src/kgpipe/common/model/__init__.py index 4a7f388..8335da4 100644 --- a/src/kgpipe/common/model/__init__.py +++ b/src/kgpipe/common/model/__init__.py @@ -1,2 +1,10 @@ from .pipeline import KgPipe, KgPipePlan, KgPipePlanStep -from .task import TaskInput, TaskOutput \ No newline at end of file +from .task import TaskInput, TaskOutput, KgTask, KgTaskRun +from .evaluation import Metric, EvaluationReport +from .kg import KG +from .data import Data, DataFormat, DataSet, KgData +from .default_catalog import BasicDataFormats, CustomDataFormats, BasicTaskCategoryCatalog + +__all__ = [ + "KgPipe", "KgPipePlan", "KgPipePlanStep", "KgStageReport", "KgTask", "KgTaskRun", "Metric", "EvaluationReport", "KG", "TaskInput", "TaskOutput", "KgTaskRun", "Data", "DataSet", "BasicDataFormats", "CustomDataFormats", "BasicTaskCategoryCatalog", "KgData" +] \ No newline at end of file diff --git a/src/kgpipe/common/model/configuration.py b/src/kgpipe/common/model/configuration.py index 6dcaace..0c52bec 100644 --- a/src/kgpipe/common/model/configuration.py +++ b/src/kgpipe/common/model/configuration.py @@ -19,7 +19,6 @@ class ParameterType(Enum): object = "object" -@kg_class() class Parameter(BaseModel): """ Configuration parameter definition, not the actual value of the parameter in the pipeline execution @@ -34,19 +33,18 @@ class Parameter(BaseModel): # +allowed_values: any[*]? # +min/max/unit: number?/number?/string? name: str - native_keys: List[str] datatype: ParameterType - default_value: str | int | float | bool - required: bool + default_value: str | int | float | bool = field(default_factory=lambda: None) + required: bool = False + native_keys: List[str] = field(default_factory=list) # scope: Scope # (training/inference/io/resources) # constraints - allowed_values: List[str | int | float | bool] + allowed_values: List[str | int | float | bool] = field(default_factory=list) minimum: Optional[float] = None maximum: Optional[float] = None unit: Optional[str] = None -@kg_class() class ParameterBinding(BaseModel): """ Binding of a configuration parameter to a value in the pipeline execution @@ -54,25 +52,50 @@ class ParameterBinding(BaseModel): parameter: Parameter value: str | int | float | bool # TODO extend to more types? -@kg_class() + class ConfigurationDefinition(BaseModel): """ - Possible configurations of a task + Possible configurations specification of a task """ name: str description: Optional[str] = None parameters: List[Parameter] = field(default_factory=list) - -@kg_class() + + class ConfigurationProfile(BaseModel): """ - Configuration profile definition, not the actual values of the parameters in the pipeline execution + Configuration profile specification, the actual values of the parameters in the pipeline execution """ name: str definition: ConfigurationDefinition description: Optional[str] = None bindings: List[ParameterBinding] = field(default_factory=list) + def get_parameter(self, name: str) -> Parameter: + for parameter in self.definition.parameters: + if parameter.name == name: + return parameter + raise ValueError(f"Parameter {name} not found in configuration profile {self.name}") + + def get_parameter_binding(self, name: str) -> ParameterBinding: + for binding in self.bindings: + if binding.parameter.name == name: + return binding + raise ValueError(f"Parameter binding {name} not found in configuration profile {self.name}") + + def get_parameter_value(self, name: str) -> str | int | float | bool: + return self.get_parameter_binding(name).value + +class ConfigurationBuilder(): + def __init__(self, config_spec: ConfigurationDefinition): + self.config_spec = config_spec + self.config_profile = ConfigurationProfile(name=config_spec.name, definition=config_spec) + + def add_parameter(self, name: str, value: str | int | float | bool) -> None: + self.config_profile.bindings.append(ParameterBinding(parameter=self.get_parameter(name), value=value)) + + + class ConfigurationMapping(BaseModel): """ Mapping of a configuration profile to a task implementation diff --git a/src/kgpipe/common/model/data.py b/src/kgpipe/common/model/data.py index 0b66c0b..0e6eb49 100644 --- a/src/kgpipe/common/model/data.py +++ b/src/kgpipe/common/model/data.py @@ -1,228 +1,23 @@ from __future__ import annotations -import os -import time import uuid -from abc import ABC, abstractmethod from dataclasses import dataclass, field -from datetime import datetime from enum import Enum from pathlib import Path -from typing import Any, Callable, Dict, List, Mapping, Optional, Set, Tuple, Union, Type -import json -from uuid import uuid4 -import logging -import shutil -from rdflib import Graph +from typing import Any, Dict, Optional, Union from pydantic import BaseModel, field_validator -from pydantic_core import core_schema +from .default_catalog import BasicDataFormats, CustomDataFormats -# Format descriptions for built-in formats -FORMAT_DESCRIPTIONS = { - "ttl": "Turtle RDF format", - "nquads": "N-Quads RDF format", - "json": "JSON format", - "csv": "CSV format", - "parquet": "Parquet format", - "xml": "XML format", - "rdf": "RDF format", - "jsonld": "JSON-LD format", - "txt": "Text format", - "paris_csv": "Paris CSV format", - "openrefine_json": "OpenRefine JSON format", - "limes_xml": "LIMES XML format", - "spotlight_json": "DBpedia Spotlight JSON format", - "falcon_json": "FALCON JSON format", - "ie_json": "Information Extraction JSON format", - "valentine_json": "Valentine JSON format", - "corenlp_json": "CoreNLP JSON format", - "openie_json": "OpenIE JSON format", - "agreementmaker_rdf": "AgreementMaker RDF format", - "em_json": "Entity Matching JSON format", -} +# Backward-compatible alias used across the codebase. +DataFormat = BasicDataFormats -class DataFormat(Enum): - """Built-in data formats with enum benefits.""" - # Standard formats - RDF_TTL = "ttl" - RDF_NQUADS = "nq" - RDF_NTRIPLES = "nt" - JSON = "json" - CSV = "csv" - PARQUET = "parquet" - RDF_XML = "xml" - RDF = "rdf" - RDF_JSONLD = "jsonld" - TEXT = "txt" - XML = "xml" - ANY = "any" - - # Tool-specific formats - PARIS_CSV = "paris.csv" - OPENREFINE_JSON = "openrefine.json" - LIMES_XML = "limes.xml" - SPOTLIGHT_JSON = "spotlight.json" - FALCON_JSON = "falcon.json" - VALENTINE_JSON = "valentine.json" - CORENLP_JSON = "corenlp.json" - OPENIE_JSON = "openie.json" - AGREEMENTMAKER_RDF = "agreementmaker.rdf" - - # Exchange formats - ER_JSON = "er.json" # Entity Resolution JSON format - TE_JSON = "te.json" # Text Extraction JSON format - - # LLM Tasks - JSON_ONTO_MAPPING_JSON = "json_onto_mapping.json" - - @classmethod - def from_extension(cls, extension: str) -> DataFormat: - """Get a format by file extension. If fails print available formats and raise ValueError.""" - try: - return cls(extension) - except ValueError: - print(f"Available formats: {[f.value for f in cls]}") - raise ValueError(f"Invalid format: {extension}") - - - @property - def extension(self) -> str: - """Get the file extension for this format.""" - return self.value - - @property - def description(self) -> str: - """Get the description for this format.""" - return FORMAT_DESCRIPTIONS.get(self.value, self.value) - - @property - def is_tool_specific(self) -> bool: - """Check if this is a tool-specific format.""" - tool_specific_formats = { - "paris_csv", "openrefine_json", "limes_xml", "spotlight_json", - "falcon_json", "ie_json", "valentine_json", "corenlp_json", - "openie_json", "agreementmaker_rdf", "em_json" - } - return self.value in tool_specific_formats - - def __str__(self) -> str: - return f".{self.value}" - - def __repr__(self) -> str: - return f".{self.value}" - - -class DynamicFormat: - """Dynamic format for submodules to register custom formats.""" - - def __init__(self, name: str, extension: str, description: str, is_tool_specific: bool = False): - self.name = name - self.extension = extension - self.description = description - self.is_tool_specific = is_tool_specific - - @classmethod - def __get_pydantic_core_schema__(cls, source_type: Any, handler) -> Any: - """Provide Pydantic schema for this type.""" - return core_schema.union_schema([ - core_schema.is_instance_schema(cls), - core_schema.str_schema() - ]) - - @property - def value(self) -> str: - """Get the format value (same as name for compatibility).""" - return self.name - - def __eq__(self, other) -> bool: - """Compare formats by name.""" - if isinstance(other, DynamicFormat): - return self.name == other.name - elif isinstance(other, DataFormat): - return self.name == other.value - elif isinstance(other, str): - return self.name == other - return False - - def __hash__(self) -> int: - """Hash based on name.""" - return hash(self.name) - - def __str__(self) -> str: - return f"DynamicFormat({self.name})" - - def __repr__(self) -> str: - return f"DynamicFormat(name='{self.name}', extension='{self.extension}', description='{self.description}', is_tool_specific={self.is_tool_specific})" - - -class FormatRegistry: - """Registry for managing and discovering data formats.""" - - _dynamic_formats: Dict[str, DynamicFormat] = {} - - @classmethod - def register_format(cls, name: str, extension: str, description: str, is_tool_specific: bool = False) -> DynamicFormat: - """Register a new dynamic data format.""" - if name in cls._dynamic_formats: - return cls._dynamic_formats[name] - - format_obj = DynamicFormat(name, extension, description, is_tool_specific) - cls._dynamic_formats[name] = format_obj - return format_obj - - @classmethod - def get_format(cls, name: str) -> Optional[Union[DataFormat, DynamicFormat]]: - """Get a format by name, checking built-in formats first.""" - # Try built-in formats first - try: - return DataFormat(name) - except ValueError: - # Then check dynamic formats - return cls._dynamic_formats.get(name) - - @classmethod - def list_formats(cls, tool_specific_only: bool = False) -> List[Union[DataFormat, DynamicFormat]]: - """List all registered formats.""" - formats = list(DataFormat) + list(cls._dynamic_formats.values()) - if tool_specific_only: - formats = [f for f in formats if getattr(f, 'is_tool_specific', False)] - return formats - - @classmethod - def list_standard_formats(cls) -> List[Union[DataFormat, DynamicFormat]]: - """List all standard (non-tool-specific) formats.""" - formats = list(DataFormat) + list(cls._dynamic_formats.values()) - return [f for f in formats if not getattr(f, 'is_tool_specific', False)] - - @classmethod - def list_tool_specific_formats(cls) -> List[Union[DataFormat, DynamicFormat]]: - """List all tool-specific formats.""" - formats = list(DataFormat) + list(cls._dynamic_formats.values()) - return [f for f in formats if getattr(f, 'is_tool_specific', False)] - - @classmethod - def list_rdf_formats(cls) -> List[Union[DataFormat, DynamicFormat]]: - """List all RDF formats.""" - rdf_formats = [DataFormat.RDF_TTL, DataFormat.RDF_NQUADS, DataFormat.RDF, DataFormat.RDF_JSONLD] - dynamic_rdf = [f for f in cls._dynamic_formats.values() if 'rdf' in f.name.lower() or 'ttl' in f.name.lower()] - return rdf_formats + dynamic_rdf - - @classmethod - def list_text_formats(cls) -> List[Union[DataFormat, DynamicFormat]]: - """List all text formats.""" - text_formats = [DataFormat.JSON, DataFormat.CSV, DataFormat.XML, DataFormat.TEXT] - dynamic_text = [f for f in cls._dynamic_formats.values() if f.name.lower() in ['json', 'csv', 'xml', 'txt', 'yaml']] - return text_formats + dynamic_text - - @classmethod - def clear_dynamic_formats(cls) -> None: - """Clear all dynamically registered formats (useful for testing).""" - cls._dynamic_formats.clear() +# Type alias for any format +Format = Union[DataFormat, CustomDataFormats] -# Type alias for any format -Format = Union[DataFormat, DynamicFormat] +def _format_value(fmt: Format) -> str: + return str(fmt.value) class Data(BaseModel): """Represents a data file with a specific format.""" @@ -246,16 +41,16 @@ def __init__(self, *args, **data): @classmethod def validate_format(cls, v): """Convert string format to proper Format object.""" + if isinstance(v, (DataFormat, CustomDataFormats)): + return v + if isinstance(v, Enum) and isinstance(v.value, str): + # Allow user-defined enum values for strong typing/autocomplete. + return v if isinstance(v, str): # Try to convert string to DataFormat enum try: return DataFormat(v) except ValueError: - # If it's not a DataFormat, it might be a DynamicFormat - from .models import FormatRegistry - dynamic_format = FormatRegistry.get_format(v) - if dynamic_format: - return dynamic_format raise ValueError(f"Unknown format: {v}") return v @@ -266,23 +61,19 @@ def exists(self) -> bool: def to_dict(self) -> Dict[str, str]: return { "path": str(self.path), - "format": self.format.value + "format": _format_value(self.format) } def __str__(self) -> str: - return f"Data({self.path}, {self.format.value if isinstance(self.format, DynamicFormat) else self.format})" + return f"Data({self.path}, {_format_value(self.format)})" def __eq__(self, other): """Custom equality to handle format comparison.""" if not isinstance(other, Data): return False - return (self.path == other.path and - (hasattr(self.format, 'value') and hasattr(other.format, 'value') and - self.format.value == other.format.value)) - - - + return self.path == other.path and _format_value(self.format) == _format_value(other.format) +KgData = Data @dataclass class DataSet: @@ -307,4 +98,4 @@ def exists(self) -> bool: return self.path.exists() def __str__(self) -> str: - return f"DataSet({self.name}, {self.path}, {self.format.value})" + return f"DataSet({self.name}, {self.path}, {_format_value(self.format)})" diff --git a/src/kgpipe/common/model/default_catalog.py b/src/kgpipe/common/model/default_catalog.py index 0fde757..24fd368 100644 --- a/src/kgpipe/common/model/default_catalog.py +++ b/src/kgpipe/common/model/default_catalog.py @@ -1,13 +1,203 @@ +from __future__ import annotations +from dataclasses import dataclass +from enum import Enum +from typing import Dict, List, Optional -# TODO impl later for typed api -class TaskCategory():pass -class EntityResolution(TaskCategory): pass -class EntityMatching(EntityResolution): pass -class Fusion(EntityResolution): pass -class InformationExtraction(TaskCategory): pass -class EntityLinking(InformationExtraction): pass -class RelationExtraction(InformationExtraction): pass -class RelationLinking(InformationExtraction): pass -class DataMapping(TaskCategory): pass \ No newline at end of file +@dataclass(frozen=True) +class TaskCategory: + name: str + parent: Optional[TaskCategory] = None + description: str = "" + + + + + +class BasicTaskCategoryCatalog: + """ + Hierarchical catalog for task categories. + Supports default categories and custom category registration. + """ + entity_resolution = TaskCategory(name="EntityResolution") + entity_matching = TaskCategory(name="EntityMatching", parent=entity_resolution) + fusion = TaskCategory(name="Fusion", parent=entity_resolution) + information_extraction = TaskCategory(name="InformationExtraction") + entity_linking = TaskCategory(name="EntityLinking", parent=information_extraction) + relation_extraction = TaskCategory(name="RelationExtraction", parent=information_extraction) + relation_linking = TaskCategory(name="RelationLinking", parent=information_extraction) + data_mapping = TaskCategory(name="DataMapping") + blocking = TaskCategory(name="Blocking", parent=entity_resolution) + clustering = TaskCategory(name="Clustering", parent=entity_resolution) + + + # @dataclass(frozen=True) + # class TaskCategoryNode: + # name: str + # parent: Optional[str] = None + # description: str = "" + + # _nodes: Dict[str, TaskCategoryNode] = { + # "TaskCategory": TaskCategoryNode(name="TaskCategory", parent=None, description="Root category"), + # "EntityResolution": TaskCategoryNode(name="EntityResolution", parent="TaskCategory"), + # "Blocking": TaskCategoryNode(name="Blocking", parent="EntityResolution"), + # "EntityMatching": TaskCategoryNode(name="EntityMatching", parent="EntityResolution"), + # "Matching": TaskCategoryNode(name="Matching", parent="EntityResolution"), + # "Clustering": TaskCategoryNode(name="Clustering", parent="EntityResolution"), + # "Fusion": TaskCategoryNode(name="Fusion", parent="EntityResolution"), + # "InformationExtraction": TaskCategoryNode(name="InformationExtraction", parent="TaskCategory"), + # "EntityLinking": TaskCategoryNode(name="EntityLinking", parent="InformationExtraction"), + # "RelationExtraction": TaskCategoryNode(name="RelationExtraction", parent="InformationExtraction"), + # "RelationLinking": TaskCategoryNode(name="RelationLinking", parent="InformationExtraction"), + # "DataMapping": TaskCategoryNode(name="DataMapping", parent="TaskCategory"), + # } + + # @classmethod + # def has(cls, category: str) -> bool: + # return category in cls._nodes + + # @classmethod + # def register(cls, name: str, parent: str = "TaskCategory", description: str = "") -> None: + # if parent is not None and parent not in cls._nodes: + # raise ValueError(f"Unknown parent category: {parent}") + # cls._nodes[name] = TaskCategoryNode(name=name, parent=parent, description=description) + + # @classmethod + # def get_parent(cls, category: str) -> Optional[str]: + # node = cls._nodes.get(category) + # if node is None: + # raise ValueError(f"Unknown category: {category}") + # return node.parent + + # @classmethod + # def get_children(cls, category: str) -> List[str]: + # if category not in cls._nodes: + # raise ValueError(f"Unknown category: {category}") + # return sorted([node.name for node in cls._nodes.values() if node.parent == category]) + + # @classmethod + # def get_ancestors(cls, category: str) -> List[str]: + # if category not in cls._nodes: + # raise ValueError(f"Unknown category: {category}") + # ancestors: List[str] = [] + # cursor = cls._nodes[category].parent + # while cursor is not None: + # ancestors.append(cursor) + # cursor = cls._nodes[cursor].parent + # return ancestors + + # @classmethod + # def get_descendants(cls, category: str) -> List[str]: + # if category not in cls._nodes: + # raise ValueError(f"Unknown category: {category}") + # descendants: List[str] = [] + # queue = cls.get_children(category) + # while queue: + # current = queue.pop(0) + # descendants.append(current) + # queue.extend(cls.get_children(current)) + # return descendants + + # @classmethod + # def is_subtask_of(cls, category: str, parent: str) -> bool: + # if category not in cls._nodes or parent not in cls._nodes: + # return False + # return parent in cls.get_ancestors(category) + + # @classmethod + # def list_categories(cls) -> List[str]: + # return sorted(cls._nodes.keys()) + + +class BasicDataFormats(str, Enum): + """Framework-provided data formats with IDE autocomplete.""" + + # Standard formats + RDF_TTL = "ttl" + RDF_NQUADS = "nq" + RDF_NTRIPLES = "nt" + JSON = "json" + CSV = "csv" + PARQUET = "parquet" + RDF_XML = "xml" + RDF = "rdf" + RDF_JSONLD = "jsonld" + TEXT = "txt" + XML = "xml" + ANY = "any" + + # Tool-specific formats + PARIS_CSV = "paris.csv" + OPENREFINE_JSON = "openrefine.json" + LIMES_XML = "limes.xml" + SPOTLIGHT_JSON = "spotlight.json" + FALCON_JSON = "falcon.json" + VALENTINE_JSON = "valentine.json" + CORENLP_JSON = "corenlp.json" + OPENIE_JSON = "openie.json" + AGREEMENTMAKER_RDF = "agreementmaker.rdf" + + # Exchange formats + ER_JSON = "er.json" + TE_JSON = "te.json" + + # LLM task outputs + JSON_ONTO_MAPPING_JSON = "json_onto_mapping.json" + + @property + def extension(self) -> str: + return self.value + + @property + def description(self) -> str: + return BASIC_FORMAT_DESCRIPTIONS.get(self.value, self.value) + + @property + def is_tool_specific(self) -> bool: + return "." in self.value and self.value not in {"jsonld"} + + @classmethod + def from_extension(cls, extension: str) -> "BasicDataFormats": + try: + return cls(extension) + except ValueError as exc: + available = [f.value for f in cls] + raise ValueError(f"Invalid format: {extension}. Available formats: {available}") from exc + + +class CustomDataFormats(str, Enum): + """ + Base enum for user-defined formats. + Define project-specific formats by subclassing this enum. + """ + + @property + def extension(self) -> str: + return self.value + + +BASIC_FORMAT_DESCRIPTIONS: dict[str, str] = { + "ttl": "Turtle RDF format", + "nq": "N-Quads RDF format", + "json": "JSON format", + "csv": "CSV format", + "parquet": "Parquet format", + "xml": "XML format", + "rdf": "RDF format", + "jsonld": "JSON-LD format", + "txt": "Text format", + "paris.csv": "Paris CSV format", + "openrefine.json": "OpenRefine JSON format", + "limes.xml": "LIMES XML format", + "spotlight.json": "DBpedia Spotlight JSON format", + "falcon.json": "FALCON JSON format", + "valentine.json": "Valentine JSON format", + "corenlp.json": "CoreNLP JSON format", + "openie.json": "OpenIE JSON format", + "agreementmaker.rdf": "AgreementMaker RDF format", + "er.json": "Entity Resolution JSON format", + "te.json": "Text Extraction JSON format", + "json_onto_mapping.json": "JSON ontology mapping format", + "any": "Any format", +} \ No newline at end of file diff --git a/src/kgpipe/common/model/evaluation.py b/src/kgpipe/common/model/evaluation.py index baefa87..be5c6d7 100644 --- a/src/kgpipe/common/model/evaluation.py +++ b/src/kgpipe/common/model/evaluation.py @@ -1,28 +1,19 @@ from __future__ import annotations -import os -import time -import uuid from abc import ABC, abstractmethod from dataclasses import dataclass, field from datetime import datetime -from enum import Enum -from pathlib import Path -from typing import Any, Callable, Dict, List, Mapping, Optional, Set, Tuple, Union, Type -import json +from typing import Any, Dict from uuid import uuid4 -import logging -import shutil -from rdflib import Graph -from pydantic import BaseModel, field_validator -from pydantic_core import core_schema from kgpipe.common.model.kg import KG +# TODO move parts from kgpipe.evaluation.base to here + class Metric(ABC): """Abstract base class for evaluation metrics.""" - def __init__(self, name: str, description: Optional[str] = None): + def __init__(self, name: str, description: str | None = None): self.name = name self.description = description or name @@ -47,7 +38,7 @@ class EvaluationReport: def __post_init__(self): if not self.id: - self.id = str(uuid.uuid4()) + self.id = str(uuid4().hex) def add_metric(self, name: str, value: float) -> None: """Add a metric result to the report.""" diff --git a/src/kgpipe/common/model/kg.py b/src/kgpipe/common/model/kg.py index 92d5ab6..42bdc18 100644 --- a/src/kgpipe/common/model/kg.py +++ b/src/kgpipe/common/model/kg.py @@ -1,27 +1,14 @@ from __future__ import annotations -import os -import time import uuid -from abc import ABC, abstractmethod from dataclasses import dataclass, field -from datetime import datetime -from enum import Enum from pathlib import Path -from typing import Any, Callable, Dict, List, Mapping, Optional, Set, Tuple, Union, Type -import json -from uuid import uuid4 -import logging -import shutil -from rdflib import Graph -from pydantic import BaseModel, field_validator -from pydantic_core import core_schema +from typing import Any, Dict, List, Optional +from rdflib import Graph, SKOS, RDF from .data import Format from .pipeline import KgPipePlan -from rdflib import SKOS - # TODO check if this is still needed or if we can use the KG from kgcore and only use Data and DataSet @dataclass @@ -76,4 +63,18 @@ def exists(self) -> bool: return self.path.exists() def __str__(self) -> str: - return f"KG({self.name}, {self.path}, {self.format.value})" \ No newline at end of file + return f"KG({self.name}, {self.path}, {self.format.value})" + + +# TODO wip class for central KgPipe KG entity + +@dataclass +class KgKg: + """Represents a KG for the KgPipe framework.""" + graph_data: KgData + ontology_data: KgData + # provenance: str + # @staticmethod + # def load_from_plan(plan: KgPipePlan) -> KG: + # pass + # pass \ No newline at end of file diff --git a/src/kgpipe/common/model/pipeline.py b/src/kgpipe/common/model/pipeline.py index 5ee69f8..3755519 100644 --- a/src/kgpipe/common/model/pipeline.py +++ b/src/kgpipe/common/model/pipeline.py @@ -1,6 +1,7 @@ import os import time import uuid +import hashlib from abc import ABC, abstractmethod from dataclasses import dataclass, field from datetime import datetime @@ -18,9 +19,10 @@ from .data import Data, DataFormat, DataSet, Format from .task import KgTask, KgTaskReport +from .configuration import ConfigurationProfile # from .kg import KG from kgpipe.common.annotations import kg_class -from kgpipe.common.systemgraph import PipeKG +from kgpipe.common.graph.systemgraph import PipeKG class KgPipePlanStep(BaseModel): @@ -29,19 +31,25 @@ class KgPipePlanStep(BaseModel): input: List[Data] output: List[Data] -kg_class() +# kg_class() class KgPipePlan(BaseModel): """A KG pipeline plan.""" steps: List[KgPipePlanStep] seed: Optional[Data] = None source: Optional[Data] = None result: Optional[Data] = None - + + @staticmethod + def from_path(json_file: str) -> 'KgPipePlan': + with open(json_file, "r") as f: + json_data = json.load(f) + return KgPipePlan(**json_data) + # def __str__(self) -> str: # return f"KgTaskReport({self.task_name}, {self.status}, {self.duration:.2f}s)" # TODO rename to KgPipeReport -@kg_class() +# @kg_class() class KgStageReport(BaseModel): """Report of a stage execution.""" stage_name: str @@ -51,6 +59,15 @@ class KgStageReport(BaseModel): status: str error: Optional[str] = None + @staticmethod + def from_path(json_file: str) -> 'KgStageReport': + with open(json_file, "r") as f: + json_data = json.load(f) + return KgStageReport(**json_data) + +KgPipeReport = KgStageReport +KgPipelineRun = KgStageReport + # @dataclass # class Stage: # """Represents a stage in a pipeline, containing one or more tasks.""" @@ -78,7 +95,7 @@ class KgStageReport(BaseModel): # TODO rename to Pipeline -@kg_class() +# @kg_class() @dataclass class KgPipe: """A KG pipeline using a list of tasks.""" @@ -110,19 +127,71 @@ def add_data(self, data: Data) -> None: self.data.append(data) - def build(self, source: Data, result: Optional[Data] = None, stable_files: bool = False) -> KgPipePlan: + def build( + self, + source: Data, + result: Optional[Data] = None, + stable_files: bool = False, + configCatalog: Optional[Mapping[str, ConfigurationProfile]] = None, + ) -> KgPipePlan: """Generate the execution plan as a list of dictionaries.""" catalog = [source] + self.data calls: List[KgPipePlanStep] = [] - def gen_file_path(task: KgTask, format_spec: Format, prefix: str = "", suffix: str = ""): - if stable_files: + def _profile_fingerprint(profile: Optional[ConfigurationProfile]) -> str: + if profile is None: + return "" + # Make it stable regardless of binding order. + bindings = [] + for b in getattr(profile, "bindings", []) or []: + param = getattr(b, "parameter", None) + pname = getattr(param, "name", None) + if pname is None: + pname = str(param) + bindings.append((str(pname), b.value)) + bindings.sort(key=lambda kv: kv[0]) + payload = json.dumps( + {"definition": getattr(getattr(profile, "definition", None), "name", None), "bindings": bindings}, + sort_keys=True, + default=str, + ) + return hashlib.sha256(payload.encode("utf-8")).hexdigest() + + def _chain_hash(prev_hash: str, task_name: str, profile: Optional[ConfigurationProfile]) -> str: + fp = _profile_fingerprint(profile) + payload = json.dumps({"prev": prev_hash, "task": task_name, "profile": fp}, sort_keys=True) + return hashlib.sha256(payload.encode("utf-8")).hexdigest() + + prev_hash = "0" * 64 + + def gen_file_path( + *, + task: KgTask, + format_spec: Format, + prefix: str = "", + suffix: str = "", + task_hash: Optional[str] = None, + ) -> Path: + # Backwards-compatible: stable_files without configCatalog keeps the old deterministic names. + if stable_files and configCatalog is None: return Path(self.data_dir) / f"{prefix}{task.name}{suffix}.{format_spec.extension}" - else: - return Path(self.data_dir) / f"{prefix}{task.name}.{uuid4().hex}.{format_spec.extension}" + + # If configCatalog is provided, filenames must be deterministic based on the hash chain. + if configCatalog is not None and task_hash is not None: + short = task_hash[:12] + return Path(self.data_dir) / f"{prefix}{task.name}.{short}{suffix}.{format_spec.extension}" + + # Default behavior: unique filenames. + return Path(self.data_dir) / f"{prefix}{task.name}.{uuid4().hex}.{format_spec.extension}" for idx, task in enumerate(self.tasks): + task_hash: Optional[str] = None + if configCatalog is not None: + profile = configCatalog.get(task.name) + task_hash = _chain_hash(prev_hash, task.name, profile) + prev_hash = task_hash + # Match inputs inputs = [] for input_name, format_spec in task.input_spec.items(): @@ -144,7 +213,13 @@ def gen_file_path(task: KgTask, format_spec: Format, prefix: str = "", suffix: s break else: suffix = f"_{len(outputs)}" - output_path = gen_file_path(task, format_spec, prefix=f"{idx}_", suffix=suffix) + output_path = gen_file_path( + task=task, + format_spec=format_spec, + prefix=f"{idx}_", + suffix=suffix, + task_hash=task_hash, + ) output_data = Data(path=output_path, format=format_spec) outputs.append(output_data) @@ -152,17 +227,19 @@ def gen_file_path(task: KgTask, format_spec: Format, prefix: str = "", suffix: s if len(inputs) != len(task.input_spec): missing_inputs = len(task.input_spec) - len(inputs) + catalog_str = "\n".join([str(i) for i in catalog]) raise ValueError( f"For task {task.name}: expected {task.input_spec} inputs, got {inputs}. " f"Missing {missing_inputs} inputs." - f"catalog: {"\n".join([str(i) for i in catalog])}" + f"catalog: {catalog_str}" ) elif len(outputs) != len(task.output_spec): missing_outputs = len(task.output_spec) - len(outputs) + catalog_str = "\n".join([str(i) for i in catalog]) raise ValueError( f"\nFor task {task.name}: expected {task.output_spec} outputs, got {outputs}. " f"\nMissing {missing_outputs} outputs." - f"\nCatalog: {"\n".join([str(i) for i in catalog])}" + f"\nCatalog: {catalog_str}" ) else: print(f"Adding task '{task.name}' to plan with\n\t inputs: {[str(i.path) for i in inputs]} and \n\t outputs: {[str(o.path) for o in outputs]}") @@ -196,7 +273,11 @@ def plot(self) -> None: """Plot the pipeline.""" pass - def run(self, stable_files_override: bool = False) -> List[KgTaskReport]: + def run( + self, + stable_files_override: bool = False, + configCatalog: Optional[Mapping[str, ConfigurationProfile]] = None, + ) -> List[KgTaskReport]: """Execute each task defined in the plan and collect the reports.""" if not self.plan: raise ValueError("Pipeline plan is empty. Call build() first.") @@ -217,54 +298,32 @@ def run(self, stable_files_override: bool = False) -> List[KgTaskReport]: if not input_data.exists(): raise FileNotFoundError(f"Input file {input_data.path} does not exist") + configProfile = None + if configCatalog is not None: + configProfile = configCatalog.get(task.name) + if self.previous_was_skipped: - report = task.run(task_spec.input, task_spec.output, stable_files_override=stable_files_override) + report = task.run( + task_spec.input, + task_spec.output, + stable_files_override=stable_files_override, + configProfile=configProfile, + ) else: - report = task.run(task_spec.input, task_spec.output, stable_files_override=True) + report = task.run( + task_spec.input, + task_spec.output, + stable_files_override=True, + configProfile=configProfile, + ) if report.status != "skipped": self.previous_was_skipped = False reports.append(report) - from kgpipe.common.definitions import PipelineRunEntity, TaskRunEntity, ImplementationEntity, TaskEntity, ImplementationEntityId, TaskEntityId - from kgcore.api.kg import KGId - from kgpipe.common.config import config - from kgpipe.common.definitions import DataHandle - - # TODO this is a workaround for now, taskrun should be built from the task itself - def build_pipeline_run_entity(reports: List[KgTaskReport]) -> PipelineRunEntity: - - task_runs: List[TaskRunEntity] = [] - for idx, report in enumerate(reports): - - - # def get_implementation_entity(report: KgTaskReport) -> ImplementationEntityId: - # return PipeKG.find_implementation_by_name(report.task_name).id - - task_runs.append(TaskRunEntity( - number=idx, - name=report.task_name, - status=report.status, - started_at=report.start_ts, - ended_at=report.start_ts + report.duration, - executesTask=TaskEntityId(config.PIPEKG_PREFIX+report.task_name), - usesImplementation=ImplementationEntityId(config.PIPEKG_PREFIX+report.task_name+"Impl"), - input=[DataHandle(uri=str(input_data.path), type=input_data.format) for input_data in report.inputs], - output=[DataHandle(uri=str(output_data.path), type=output_data.format) for output_data in report.outputs], - hasParameterBinding=[] - )) - - return PipelineRunEntity( - name=self.name, - status="success", - started_at=time.time(), - ended_at=time.time(), - hasTaskRun=task_runs - ) - - pipeline_run_entity = build_pipeline_run_entity(reports) - PipeKG.add_pipeline_run(pipeline_run_entity) + # pipeline_run_entity = reports_to_pipeline_run_entity(reports, self.name) + # PipeKG.add_pipeline_run(pipeline_run_entity) return reports diff --git a/src/kgpipe/common/model/task.py b/src/kgpipe/common/model/task.py index 7d11cad..6f45559 100644 --- a/src/kgpipe/common/model/task.py +++ b/src/kgpipe/common/model/task.py @@ -4,28 +4,61 @@ # import field from dataclasses import dataclass, field from .data import Data, Format, DataFormat -from pydantic import BaseModel +from pydantic import BaseModel, Field, ConfigDict, model_validator import time import shutil +from uuid import uuid4 +import inspect from kgpipe.common.model.default_catalog import TaskCategory -from .configuration import Parameter, ConfigurationDefinition -from kgpipe.common.annotations import kg_class +from .configuration import ( + Parameter, + ConfigurationDefinition, + ConfigurationProfile, + ParameterType, +) +from kgpipe.common.graph.systemgraph import PipeKG +from kgpipe.common.graph.mapper import task_run_to_entity type TaskName = str type TaskInput = Dict[TaskName, Data] type TaskOutput = Dict[TaskName, Data] -@kg_class() class KgTaskReport(BaseModel): """Report of a task execution.""" + model_config = ConfigDict(arbitrary_types_allowed=True) + + # Backwards-compatible identifier for persisted reports (`exec-report.json`). + # Historically we stored only the task name; newer runtime code may also attach the `KgTask`. task_name: str + task: Optional["KgTask"] = Field(default=None, exclude=True) inputs: List[Data] outputs: List[Data] start_ts: float duration: float status: str error: Optional[str] = None + config_profile: Optional[ConfigurationProfile] = None + + @model_validator(mode="before") + @classmethod + def _coerce_task_fields(cls, data): + """ + Accept both legacy reports (with `task_name`) and new runtime reports (with `task`). + """ + if not isinstance(data, dict): + return data + + # If we have a task object but no explicit task_name, derive it. + if "task_name" not in data and "task" in data and data["task"] is not None: + task_obj = data["task"] + name = getattr(task_obj, "name", None) + if name is not None: + data["task_name"] = name + + return data + +KgTaskRun = KgTaskReport class TaskStatus(Enum): """Status of a task in a pipeline.""" @@ -35,13 +68,10 @@ class TaskStatus(Enum): FAILED = "failed" SKIPPED = "skipped" - - # # TODO impl later for typed api # class TaskCatalog(): # pass -@kg_class() @dataclass class KgTask: """Represents a task that can be executed in a pipeline.""" @@ -52,6 +82,8 @@ class KgTask: description: Optional[str] = None category: List[TaskCategory] = field(default_factory=list) config_spec: Optional[ConfigurationDefinition] = None + tools: List[str] = field(default_factory=list) + trace_task_run: bool = False def __post_init__(self): if not self.name: @@ -63,70 +95,32 @@ def __post_init__(self): if not callable(self.function): raise ValueError("Function must be callable") - def run(self, inputs: List[Data], outputs: List[Data], stable_files_override: bool = False, configProfile: Optional[str] = None) -> KgTaskReport: + + # TODO if configProfile is not provided, use the default config profile derived from the config_spec + def run(self, inputs: List[Data], outputs: List[Data], stable_files_override: bool = False, configProfile: Optional[ConfigurationProfile] = None) -> KgTaskReport: """Execute the task with given inputs and outputs.""" start = time.time() + report: KgTaskReport try: named_inputs = self._match(inputs, self.input_spec) named_outputs = self._match(outputs, self.output_spec) - - # print(f"Running {self.name} with\n\t inputs: {[str(i.path) for i in named_inputs.values()]}\n\t outputs: {[str(o.path) for o in named_outputs.values()]}") print(f"Running {self.name} with\n\t inputs: {named_inputs}\n\t outputs: {named_outputs}") - - # Validate that all required inputs and outputs are present - if len(named_inputs) != len(self.input_spec): - missing = set(self.input_spec.keys()) - set(named_inputs.keys()) - available = {obj.format.value: obj for obj in inputs} - expected = {k: v.value for k, v in self.input_spec.items()} - raise ValueError( - f"Missing required inputs: {missing}. " - f"Expected: {expected}. " - f"Available: {[f'{obj.path} ({obj.format.value})' for obj in inputs]}" - ) - - if len(named_outputs) != len(self.output_spec): - missing = set(self.output_spec.keys()) - set(named_outputs.keys()) - available = {obj.format.value: obj for obj in outputs} - expected = {k: v.value for k, v in self.output_spec.items()} - raise ValueError( - f"Missing required outputs: {missing}. " - f"Expected: {expected}. " - f"Available: {[f'{obj.path} ({obj.format.value})' for obj in outputs]}" - ) - if stable_files_override: - for output in named_outputs.values(): - # delete the file or directory - if output.path.exists(): - if output.path.is_file(): - output.path.unlink() - elif output.path.is_dir(): - shutil.rmtree(output.path) - - # if all outputs exists skip the task - if all(output.path.exists() for output in named_outputs.values()): + self._validate_required_data(named_inputs, self.input_spec, "inputs", inputs) + self._validate_required_data(named_outputs, self.output_spec, "outputs", outputs) + self._prepare_outputs(named_outputs, stable_files_override) + + # TODO needs to check config profile changes, or maybe not + if self._should_skip(named_outputs): print(f"Skipping task {self.name} because all outputs exist") - # exit(1) - # TODO do not override old KgTaskReport - return KgTaskReport( - task_name=self.name, - inputs=list(named_inputs.values()), - outputs=list(named_outputs.values()), - start_ts=start, - duration=time.time() - start, - status="skipped", - ) + report = self._build_report(start, "skipped", list(named_inputs.values()), list(named_outputs.values()), config_profile=configProfile) + self._trace_task_run_to_pipekg(report) + return report - self.function(named_inputs, named_outputs) - - return KgTaskReport( - task_name=self.name, - inputs=list(named_inputs.values()), - outputs=list(named_outputs.values()), - start_ts=start, - duration=time.time() - start, - status="success", - ) + self._call_function(named_inputs, named_outputs, configProfile) + report = self._build_report(start, "success", list(named_inputs.values()), list(named_outputs.values()), config_profile=configProfile) + self._trace_task_run_to_pipekg(report) + return report except Exception as e: print(f"An error occurred while running the task '{self.name}'.") @@ -134,16 +128,186 @@ def run(self, inputs: List[Data], outputs: List[Data], stable_files_override: bo print(f"Exception message: {e}") import traceback traceback.print_exc() - return KgTaskReport( - task_name=self.name, - inputs=inputs, - outputs=outputs, - start_ts=start, - duration=time.time() - start, - status="failed", - error=str(e) + report = self._build_report(start, "failed", inputs, outputs, error=str(e), config_profile=configProfile) + self._trace_task_run_to_pipekg(report) + return report + + def _trace_task_run_to_pipekg(self, report: KgTaskReport) -> None: + # TODO print(f"Tracing task run to pipekg: {report}") + if not self.trace_task_run: + return + task_run_to_entity(report) + + def _call_function( + self, + named_inputs: Dict[str, Data], + named_outputs: Dict[str, Data], + config_profile: Optional[object], + ) -> None: + """ + Call the wrapped task function with or without config. + + Supported task signatures: + - fn(inputs, outputs) + - fn(inputs, outputs, config) + - fn(inputs, outputs, *, config=...) + - fn(inputs, outputs, **kwargs) (will receive config=... if provided) + """ + sig = inspect.signature(self.function) + params = sig.parameters + + accepts_var_kwargs = any(p.kind == inspect.Parameter.VAR_KEYWORD for p in params.values()) + has_config_param = "config" in params + + if config_profile is None: + # If config is required positionally/without default, fail early with a clear error. + if has_config_param: + p = params["config"] + if p.default is inspect._empty and p.kind not in ( + inspect.Parameter.VAR_POSITIONAL, + inspect.Parameter.VAR_KEYWORD, + ): + raise TypeError( + f"{self.name} requires a 'config' argument but none was provided. " + f"Pass configProfile=... to KgTask.run(), or make 'config' optional." + ) + self.function(named_inputs, named_outputs) + return + + # config is provided: pass it only if the function can accept it + if has_config_param or accepts_var_kwargs: + # If the task declares a config spec, we require a structured ConfigurationProfile. + if self.config_spec is not None and not isinstance(config_profile, ConfigurationProfile): + raise TypeError( + f"{self.name} expects configProfile to be a ConfigurationProfile " + f"because it declares config_spec='{self.config_spec.name}', " + f"got {type(config_profile).__name__}." + ) + if isinstance(config_profile, ConfigurationProfile) and self.config_spec is not None: + self._validate_config(config_profile, self.config_spec) + self.function(named_inputs, named_outputs, config=config_profile) + return + + # Function cannot accept config: ignore it + self.function(named_inputs, named_outputs) + + def _validate_config(self, config_profile: ConfigurationProfile, config_spec: ConfigurationDefinition) -> None: + if config_profile.definition.name != config_spec.name: + raise ValueError( + f"Config profile definition '{config_profile.definition.name}' does not match " + f"task config spec '{config_spec.name}'." ) + spec_by_name: Dict[str, Parameter] = {p.name: p for p in config_spec.parameters} + spec_by_key: Dict[str, Parameter] = {} + for p in config_spec.parameters: + spec_by_key[p.name] = p + for nk in p.native_keys: + spec_by_key[nk] = p + + bound: Dict[str, object] = {} + for binding in config_profile.bindings: + raw_key = binding.parameter.name + if raw_key not in spec_by_key: + raise ValueError( + f"Unknown config parameter '{raw_key}' for spec '{config_spec.name}'. " + f"Known: {sorted(spec_by_name.keys())}" + ) + param = spec_by_key[raw_key] + value = binding.value + bound[param.name] = value + + if param.datatype == ParameterType.boolean and not isinstance(value, bool): + raise TypeError(f"Config parameter '{param.name}' expects boolean, got {type(value).__name__}") + if param.datatype == ParameterType.integer and not isinstance(value, int): + raise TypeError(f"Config parameter '{param.name}' expects integer, got {type(value).__name__}") + if param.datatype == ParameterType.number and not isinstance(value, (int, float)): + raise TypeError(f"Config parameter '{param.name}' expects number, got {type(value).__name__}") + if param.datatype == ParameterType.string and not isinstance(value, str): + raise TypeError(f"Config parameter '{param.name}' expects string, got {type(value).__name__}") + + if param.allowed_values and value not in param.allowed_values: + raise ValueError( + f"Config parameter '{param.name}' value {value!r} not in allowed_values {param.allowed_values!r}" + ) + + if param.minimum is not None: + if not isinstance(value, (int, float)): + raise TypeError(f"Config parameter '{param.name}' has minimum constraint but value is not numeric") + if value < param.minimum: + raise ValueError(f"Config parameter '{param.name}' value {value} < minimum {param.minimum}") + + if param.maximum is not None: + if not isinstance(value, (int, float)): + raise TypeError(f"Config parameter '{param.name}' has maximum constraint but value is not numeric") + if value > param.maximum: + raise ValueError(f"Config parameter '{param.name}' value {value} > maximum {param.maximum}") + + missing_required: List[str] = [] + for p in config_spec.parameters: + if not p.required: + continue + if p.name in bound: + continue + if getattr(p, "default_value", None) is None: + missing_required.append(p.name) + if missing_required: + raise ValueError(f"Missing required config parameters: {missing_required}") + + + def _build_report( + self, + start_ts: float, + status: str, + inputs: List[Data], + outputs: List[Data], + error: Optional[str] = None, + config_profile: Optional[ConfigurationProfile] = None, + ) -> KgTaskReport: + return KgTaskReport( + task=self, + task_name=self.name, + inputs=inputs, + outputs=outputs, + start_ts=start_ts, + duration=time.time() - start_ts, + status=status, + error=error, + config_profile=config_profile, + ) + + def _validate_required_data( + self, + matched: Dict[str, Data], + spec: Mapping[str, Format], + label: str, + raw_items: List[Data], + ) -> None: + if len(matched) == len(spec): + return + + missing = set(spec.keys()) - set(matched.keys()) + expected = {k: v.value for k, v in spec.items()} + available = [f"{obj.path} ({obj.format.value})" for obj in raw_items] + raise ValueError( + f"Missing required {label}: {missing}. " + f"Expected: {expected}. " + f"Available: {available}" + ) + + def _prepare_outputs(self, outputs: Dict[str, Data], stable_files_override: bool) -> None: + if not stable_files_override: + return + for output in outputs.values(): + if output.path.exists(): + if output.path.is_file(): + output.path.unlink() + elif output.path.is_dir(): + shutil.rmtree(output.path) + + def _should_skip(self, outputs: Dict[str, Data]) -> bool: + return all(output.path.exists() for output in outputs.values()) + @staticmethod def _match(data: List[Data], spec: Mapping[str, Format]) -> Dict[str, Data]: """Match data objects to specification by format.""" diff --git a/src/kgpipe/common/models.py b/src/kgpipe/common/models.py index b758f28..d3f271a 100644 --- a/src/kgpipe/common/models.py +++ b/src/kgpipe/common/models.py @@ -8,76 +8,15 @@ from __future__ import annotations -from .model.data import Data, DataFormat, DynamicFormat, DataSet, FormatRegistry +from .model.data import Data, DataFormat, DataSet +from .model.default_catalog import BasicDataFormats, CustomDataFormats, BasicTaskCategoryCatalog from .model.task import KgTask, KgTaskReport from .model.pipeline import KgPipe, KgPipePlan, KgPipePlanStep, KgStageReport from .model.evaluation import Metric, EvaluationReport from .model.kg import KG -from .model.task import TaskInput, TaskOutput +from .model.task import TaskInput, TaskOutput, KgTask, KgTaskRun +# from .model.evaluation import KgMetric, KgMetricRun __all__ = [ - "Data", "DataFormat", "DynamicFormat", "DataSet", "FormatRegistry", "KgTask", "KgTaskReport", "KgPipe", "KgPipePlan", "KgPipePlanStep", "KgStageReport", "Metric", "EvaluationReport", "KG", "TaskInput", "TaskOutput" + "Data", "DataFormat", "BasicDataFormats", "CustomDataFormats", "BasicTaskCategoryCatalog", "DataSet", "KgTask", "KgTaskReport", "KgPipe", "KgPipePlan", "KgPipePlanStep", "KgStageReport", "Metric", "EvaluationReport", "KG", "TaskInput", "TaskOutput", "KgTaskRun" ] - -# TODO remove this for next release -# @dataclass -# class KG: -# """Represents a knowledge graph.""" -# id: str -# name: str -# path: Path -# format: Format -# triple_count: Optional[int] = None -# entity_count: Optional[int] = None -# description: Optional[str] = None -# metadata: Dict[str, Any] = field(default_factory=dict) -# graph: Optional[Graph] = None -# data_graph: Optional[Graph] = None -# ontology_graph: Optional[Graph] = None -# plan: Optional[KgPipePlan] = None - -# def __post_init__(self): -# if not self.id: -# self.id = str(uuid.uuid4()) -# if isinstance(self.path, str): -# self.path = Path(self.path) -# if not self.name: -# raise ValueError("KG name cannot be empty") - -# def get_graph(self) -> Graph: -# if self.graph is None: -# tmp = Graph().parse(self.path) -# graph = Graph() -# for s, p, o in tmp: -# if (str(p) != str(SKOS.altLabel)): -# graph.add((s, p, o)) -# self.graph = graph -# return self.graph - -# def get_data_graph(self) -> Graph: -# return Graph() - -# def get_ontology_graph(self) -> Graph: -# # TODO derive from graph -# if self.ontology_graph is None: -# self.ontology_graph = Graph() -# return self.ontology_graph - -# def set_ontology_graph(self, graph: Graph) -> None: -# print(f"Setting ontology graph with {len(graph)} triples") -# self.ontology_graph = graph - -# def exists(self) -> bool: -# """Check if the KG file exists.""" -# return self.path.exists() - -# def __str__(self) -> str: -# return f"KG({self.name}, {self.path}, {self.format.value})" - - - - - -# # Backward compatibility aliases -# Task = KgTask -# Pipeline = KgPipe \ No newline at end of file diff --git a/src/kgpipe/common/registry.py b/src/kgpipe/common/registry.py index 49f8359..18bdcbf 100644 --- a/src/kgpipe/common/registry.py +++ b/src/kgpipe/common/registry.py @@ -1,23 +1,23 @@ # global Registry, entry-point discovery -from typing import Any, Callable +from typing import Any, Callable, List, Dict from kgpipe.common.models import KgTask, DataFormat -from kgpipe.common.systemgraph import PipeKG -from kgpipe.common.definitions import MetricEntity +# from kgpipe.common.graph.systemgraph import PipeKG +from kgpipe.common.graph.definitions import MetricEntity, TaskEntity from kgpipe.common.model.configuration import ConfigurationDefinition +from kgpipe.common.graph.mapper import implementation_to_entity # TODO add also to system graph - - - class Registry: """ - Holds functions and python objects + Holds functions and python objects mappings KGpipe system graph """ _registry: dict[str, Any] = {} + # Generic # + @classmethod def register(cls, kind: str): def decorator(t): @@ -25,6 +25,25 @@ def decorator(t): return t return decorator + @classmethod + def get(cls, kind: str, name: str): + return cls._registry[f"{kind}:{name}"] + + @classmethod + def list(cls, kind: str): + """List all registered items of a specific kind.""" + items = [] + for key, value in cls._registry.items(): + if key.startswith(f"{kind}:"): + items.append(value) + return items + + @classmethod + def list_all(cls): + return cls._registry + + # Metric # + @classmethod def metric(cls): def decorator(t): @@ -34,50 +53,35 @@ def decorator(t): description = getattr(obj, 'description', None) type = getattr(obj, 'aspect', None) metric = MetricEntity(name=name, description=description, type=type.value if type else None) - PipeKG.add_metric(metric) + # TODO add to system graph return t return decorator + # Task # + + @classmethod + def add_task(cls, name: str, task: KgTask): + cls._registry[f"task:{task.name}"] = task + @classmethod def task( cls, - input_spec: dict[str, DataFormat], - output_spec: dict[str, DataFormat], + input_spec: Dict[str, DataFormat], + output_spec: Dict[str, DataFormat], description: str | None = None, - category: list[str] = [], + category: List[str] = [], config_spec: ConfigurationDefinition | None = None ) -> Callable[[Callable], KgTask]: def decorator(t): task = KgTask(t.__name__.lower(), input_spec, output_spec, t, description, category, config_spec) + if getattr(t, "_trace_task_run", False): + setattr(task, "trace_task_run", True) cls._registry[f"task:{t.__name__.lower()}"] = task - PipeKG.add_task(task) + # implementation_to_entity(task) + # PipeKG.add_implementation(implementation_to_entity(task)) return task return decorator - # @classmethod - # def pipeline(cls, tasks: list[KgTask], input: Data, output: Data): - # pipeline = KgPipe(tasks, input, output) - # cls._registry[f"pipeline:{pipeline.__name__.lower()}"] = pipeline - # PipeKG.add_pipeline(pipeline) - # return pipeline - - @classmethod - def get(cls, kind: str, name: str): - return cls._registry[f"{kind}:{name}"] - @classmethod def get_task(cls, name: str) -> KgTask: return cls._registry[f"task:{name}"] - - @classmethod - def list(cls, kind: str): - """List all registered items of a specific kind.""" - items = [] - for key, value in cls._registry.items(): - if key.startswith(f"{kind}:"): - items.append(value) - return items - - @classmethod - def list_all(cls): - return cls._registry \ No newline at end of file diff --git a/src/kgpipe/common/systemgraph.py b/src/kgpipe/common/systemgraph.py deleted file mode 100644 index 2e9e756..0000000 --- a/src/kgpipe/common/systemgraph.py +++ /dev/null @@ -1,330 +0,0 @@ -import functools -import ast -from uuid import uuid4 -from typing import Any, List, TYPE_CHECKING -from pydantic import BaseModel -from datetime import datetime, timezone - -# from kgcore.api import KG, BackendName - -from kgcore.api import KnowledgeGraph, KGEntity, KGRelation, KGProperty, new_id -from kgcore.backend.rdf.rdf_rdflib import RDFLibBackend -from kgcore.backend.rdf.rdf_sparql import RDFSparqlBackend, SparqlAuth -from kgcore.model.rdf.rdf_base import RDFBaseModel - -from kgpipe.common.definitions import ( - TaskEntity, TaskRunEntity, PipelineEntity, PipelineRunEntity, ImplementationEntity, MetricEntity, MetricRunEntity -) -from kgpipe.common.config import load_config -from kgpipe.common.util import encode_string - -if TYPE_CHECKING: - from kgpipe.common.models import KgTask, KgTaskReport - - -config = load_config() -scheme, rest = config.SYS_KG_URL.split("://") - -backend = RDFLibBackend() -model = RDFBaseModel() - -try: - if scheme == "sparql": - print(f"Using SPARQL backend for system graph: {f"http://{rest}"} with http://github.com/ScaDS/kgpipe/") - backend = RDFSparqlBackend( - endpoint=f"http://{rest}", - update_endpoint=f"http://{rest}", - default_graph="http://github.com/ScaDS/kgpipe/", - auth=SparqlAuth(username=config.SYS_KG_USR, password=config.SYS_KG_PSW)) - else: - raise ValueError(f"Unsupported schema: {scheme}") -except Exception as e: - print(f"Error creating system graph: {e}") - print(f"Using RDFLib memory backend for system graph") - -SYS_KG: KnowledgeGraph = KnowledgeGraph(model=model, backend=backend) - -class PipeKG: - """ - PipeKG is the system graph for the KGpipe framework. - It is a Object Graph Mapper (OGM) for the KGpipe framework. - It is used to store the entities and relations of the KGpipe framework. - """ - - # cached_implementations: Dict[str, KGEntity] = {} - - @staticmethod - def add_task(task: "KgTask"): - from kgpipe.common.models import KgTask # Import here to avoid circular import - types = [config.ONTOLOGY_PREFIX+encode_string(c) for c in task.category] - properties = [] - properties.append(KGProperty(key="description", value=task.description)) - task_entity = SYS_KG.create_entity(id=config.PIPEKG_PREFIX+task.name+"Impl", types=types+[config.ONTOLOGY_PREFIX+"Implementation"], properties=properties) - for input_name, input_format in task.input_spec.items(): - input_entity = SYS_KG.create_entity(id=config.PIPEKG_PREFIX+task.name+"Impl_"+input_name, types=[config.ONTOLOGY_PREFIX+"Data"], properties={ - "format": input_format, - }) - SYS_KG.create_relation(type="input", source=task_entity.id, target=input_entity.id) - for output_name, output_format in task.output_spec.items(): - output_entity = SYS_KG.create_entity(id=config.PIPEKG_PREFIX+task.name+"Impl_"+output_name, types=[config.ONTOLOGY_PREFIX+"Data"], properties={ - "format": output_format, - }) - SYS_KG.create_relation(type="output", source=task_entity.id, target=output_entity.id) - - @staticmethod - def _prop_value(properties: List[KGProperty], *keys: str) -> Any: - """Find a property value by exact key or key suffix.""" - for prop in properties: - if prop.key in keys: - return prop.value - for prop in properties: - for key in keys: - if prop.key.endswith(key): - return prop.value - return None - - @staticmethod - def _to_list(value: Any) -> List[str]: - """Normalize KG property values to list[str].""" - if value is None: - return [] - if isinstance(value, list): - return [str(v) for v in value] - if isinstance(value, tuple): - return [str(v) for v in value] - if isinstance(value, str): - text = value.strip() - if not text: - return [] - # Stored literals may contain Python-list string repr. - if text.startswith("[") and text.endswith("]"): - try: - parsed = ast.literal_eval(text) - except (ValueError, SyntaxError): - return [text] - if isinstance(parsed, list): - return [str(v) for v in parsed] - return [text] - return [str(value)] - - def list_taskImplementations(self) -> List[ImplementationEntity]: - entities = SYS_KG.find_entities(types=[config.ONTOLOGY_PREFIX + "Implementation"]) - implementations: List[ImplementationEntity] = [] - - for entity in entities: - name_value = self._prop_value(entity.properties, "name", config.ONTOLOGY_PREFIX + "name") - if not name_value: - # Fallback: derive a readable name from implementation IRI. - name_value = str(entity.id).rstrip("/").split("/")[-1] - - implements_method_value = self._prop_value( - entity.properties, - "implementsMethod", - config.ONTOLOGY_PREFIX + "implementsMethod", - ) - uses_tool_value = self._prop_value( - entity.properties, - "usesTool", - config.ONTOLOGY_PREFIX + "usesTool", - ) - has_parameter_value = self._prop_value( - entity.properties, - "hasParameter", - config.ONTOLOGY_PREFIX + "hasParameter", - ) - - input_entities = SYS_KG.get_neighbors(entity.id, predicate="input") - output_entities = SYS_KG.get_neighbors(entity.id, predicate="output") - - def get_property_values(properties: list[KGProperty], key: str) -> list[str]: - return [prop.value for prop in properties if prop.key.endswith(key)] - - input_spec = [get_property_values(input_entity.properties, "format")[0] for input_entity in input_entities] - output_spec = [get_property_values(output_entity.properties, "format")[0] for output_entity in output_entities] - - implementations.append( - ImplementationEntity( - uri=str(entity.id), - name=str(name_value), - input_spec=input_spec, - output_spec=output_spec, - implementsMethod=self._to_list(implements_method_value), - hasParameter=self._to_list(has_parameter_value), - usesTool=self._to_list(uses_tool_value), - ) - ) - - return implementations - - def list_tasks(self) -> List[ImplementationEntity]: - """Backward-compatible alias used by existing UI code.""" - return self.list_taskImplementations() - - - # @staticmethod - # def add_task_result(task_result: TaskResult): - # SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"TaskRun"], properties={ - # "task": task_result.task, - # "config": task_result.config, - # "input": task_result.input, - # "output": task_result.output, - # "status": task_result.status, - # "duration": task_result.duration, - # }) - - # @staticmethod - # def add_task_run(task_run: "KgTaskReport"): - # SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"TaskReport"], properties={ - # "task": task_run.task_name, - # "input": [data.path for data in task_run.inputs], - # "output": [data.path for data in task_run.outputs], - # "status": task_run.status, - # "duration": task_run.duration, - # "error": task_run.error, - # }) - - @staticmethod - def add_pipeline(pipeline: PipelineEntity): - SYS_KG.create_entity(id=new_id(),types=["Pipeline"], properties={ - "tasks": pipeline.tasks, - "input": pipeline.input, - "output": pipeline.output, - }) - - # @staticmethod - # def add_pipeline_result(pipeline_result: PipelineResult): - # SYS_KG.create_entity(id=new_id(),types=["PipelineResult"], properties={ - # "task_results": pipeline_result.task_results, - # "eval_results": pipeline_result.eval_results, - # "input": pipeline_result.input, - # "output": pipeline_result.output, - # }) - - @staticmethod - def add_metric(metric: MetricEntity): - SYS_KG.create_entity(id=config.PIPEKG_PREFIX+encode_string(metric.name),types=[config.ONTOLOGY_PREFIX+"Metric"], properties={ - config.ONTOLOGY_PREFIX+"name": metric.name, - config.ONTOLOGY_PREFIX+"description": metric.description, - config.ONTOLOGY_PREFIX+"type": metric.type, - # "input": metric.input, - # "output": metric.output, - }) - - @staticmethod - def add_metric_run(metric_run: MetricRunEntity): - metric_run_entity = SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"MetricRun"], properties={ - config.ONTOLOGY_PREFIX+"status": metric_run.status, - config.ONTOLOGY_PREFIX+"started_at": metric_run.started_at, - config.ONTOLOGY_PREFIX+"ended_at": metric_run.ended_at, - config.ONTOLOGY_PREFIX+"value": metric_run.value, - config.ONTOLOGY_PREFIX+"details": metric_run.details, - config.ONTOLOGY_PREFIX+"input": metric_run.input[0].uri, - }) - SYS_KG.create_relation(type=config.ONTOLOGY_PREFIX+"computedMetric", source=metric_run_entity.id, target=metric_run.computedMetric) - - # @staticmethod - # def find_implementation_by_name(name: str) -> KGEntity: - # return SYS_KG.read_entity(id=config.PIPEKG_PREFIX+name, types=[config.ONTOLOGY_PREFIX+"Implementation"])[0] - - @staticmethod - def add_implementation(implementation: ImplementationEntity): - SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"Implementation"], properties={ - "name": implementation.name, - "usesTool": implementation.usesTool, - "implementsMethod": implementation.implementsMethod, - "interface": implementation.interface, - - }) - - @staticmethod - def add_pipeline_run(pipeline_run: PipelineRunEntity): - pipeline_run_entity = SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"PipelineRun"], properties={ - "name": pipeline_run.name, - "status": pipeline_run.status, - "started_at": pipeline_run.started_at, - "ended_at": pipeline_run.ended_at - }) - for idx, task_run in enumerate(pipeline_run.hasTaskRun): - task_run_entity = SYS_KG.create_entity(id=new_id(),types=[config.ONTOLOGY_PREFIX+"TaskRun"], properties={ - "number": idx, - "name": task_run.name, - "status": task_run.status, - "started_at": task_run.started_at, - "ended_at": task_run.ended_at, - }) - SYS_KG.create_relation(type="executesTask", source=task_run_entity.id, target=task_run.executesTask) - SYS_KG.create_relation(type="usesImplementation", source=task_run_entity.id, target=task_run.usesImplementation) - SYS_KG.create_relation(type=config.ONTOLOGY_PREFIX+"hasTaskRun", source=pipeline_run_entity.id, target=task_run_entity.id) - - # return pipeline_run_entity - - @staticmethod - def sparql_construct(query: str): - backend : RDFSparqlBackend = SYS_KG.backend - result = backend.query_sparql(query) - return result - - -class MapperUtil(): - """ - Intermediate class to map the core classes to the definitions to the system graph. - Will be replaced in the future - """ - - @staticmethod - def map_task(task: "KgTask") -> TaskEntity: - return TaskEntity( - name=task.name, - input=task.input, - output=task.output, - ) - - -# def Track(_cls=None, *, with_timestamp: bool = False): -# """ -# Use as: -# @Track -# @Track(with_timestamp=True) -# """ -# def decorator(cls): -# class Tracked(cls): # subclass the original class -# def __init__(self, *args: Any, **kwargs: Any): -# super().__init__(*args, **kwargs) - -# inst_id = f"{cls.__name__}:{uuid4().hex[:8]}" -# setattr(self, "_kg_id", inst_id) - -# if isinstance(self, BaseModel): -# props = self.model_dump() -# else: -# props = {k: v for k, v in vars(self).items() if not k.startswith("_")} - -# if with_timestamp: -# props["timestamp"] = datetime.now(timezone.utc).isoformat() - -# SYS_KG.create_entity([cls.__name__], id=inst_id, props=props) - -# Tracked.__name__ = cls.__name__ # optional cosmetics -# Tracked.__qualname__ = cls.__qualname__ -# Tracked.__doc__ = cls.__doc__ -# return Tracked - -# return decorator if _cls is None else decorator(_cls) - -# def kg_function(fn): -# @functools.wraps(fn) -# def wrapper(*args, **kwargs): -# result = fn(*args, **kwargs) -# call_id = f"{fn.__name__}:{uuid4().hex[:8]}" -# SYS_KG.create_entity( -# ["FunctionCall"], -# id=call_id, -# props={ -# "name": fn.__name__, -# # Be careful serializing args/kwargs; this is a toy example: -# "args": repr(args), -# "kwargs": repr(kwargs), -# }, -# ) -# return result -# return wrapper diff --git a/src/kgpipe/datasets/multipart_multisource.py b/src/kgpipe/datasets/multipart_multisource.py index 1389221..6653b06 100644 --- a/src/kgpipe/datasets/multipart_multisource.py +++ b/src/kgpipe/datasets/multipart_multisource.py @@ -90,8 +90,8 @@ def _check(self): def read_csv(self) -> List[MatchesRow]: return read_matches_csv(self.file) -def read_entities_csv(path: Path) -> List[EntitiesRow]: - return [EntitiesRow(entity_id=row["entity_id"], entity_label=row["entity_label"], entity_type=row["entity_type"], dataset=row["dataset"]) for row in csv.DictReader(path.open("r"), delimiter="\t")] +def read_entities_csv(path: Path, delimiter: str = "\t") -> List[EntitiesRow]: + return [EntitiesRow(entity_id=row["entity_id"], entity_label=row["entity_label"], entity_type=row["entity_type"], dataset=row["dataset"]) for row in csv.DictReader(path.open("r"), delimiter=delimiter)] class VerifiedEntities(BaseModel): model_config = ConfigDict(arbitrary_types_allowed=True) @@ -215,13 +215,15 @@ class SplitIndex(BaseModel): # raise ValueError(f"{self.entities_csv} must contain an 'entity_id' column; got {header}") # return self +# SourceType = Literal["rdf", "json", "text"] + class Split(BaseModel): split_id: str root: Path index: SplitIndex kg_reference: Optional[KGBundle] = None kg_seed: Optional[KGBundle] = None - sources: Dict[str, SourceBundle] + sources: Dict[str, SourceBundle] # TODO SourceType def set_index(self, entities: List[EntitiesRow]): self.index.dir.mkdir(parents=True, exist_ok=True) @@ -548,12 +550,16 @@ def load_dataset(root: Path) -> Dataset: if seed_dir.exists(): seed_data_dir = seed_dir / "data" seed_meta_dir = seed_dir / "meta" + seed_meta = SourceMeta(root=seed_meta_dir) + ve = seed_meta_dir / "verified_entities.csv" + if ve.exists(): + seed_meta.entities = VerifiedEntities(file=ve) seed_parts = list_parts(seed_data_dir, (".nt", ".ttl", ".nq")) kg_seed = KGBundle( kind="seed", root=seed_dir, data=SourceData(dir=seed_data_dir, parts=seed_parts), - meta=SourceMeta(root=seed_meta_dir) + meta=seed_meta ) # sources diff --git a/src/kgpipe/evaluation/aspects/func/integration_eval.py b/src/kgpipe/evaluation/aspects/func/integration_eval.py index 72b6288..0e5392b 100644 --- a/src/kgpipe/evaluation/aspects/func/integration_eval.py +++ b/src/kgpipe/evaluation/aspects/func/integration_eval.py @@ -3,11 +3,11 @@ from pathlib import Path import pandas as pd from rdflib import RDFS, URIRef, Graph, RDF -from dataclasses import dataclass +from dataclasses import dataclass, field from kgpipe.util.embeddings.st_emb import get_model import numpy as np -from kgpipe.datasets.multipart_multisource import read_entities_csv - +from kgpipe.datasets.multipart_multisource import read_entities_csv, EntitiesRow +from typing import Any # model # entity dict @@ -41,6 +41,7 @@ class BinaryClassificationResult: fp: int tn: int fn: int + details: dict[str, Any] = field(default_factory=dict) def accuracy(self) -> float: return (self.tp + self.tn) / (self.tp + self.tn + self.fp + self.fn) @@ -63,6 +64,7 @@ def __dict__(self): "fp": self.fp, "tn": self.tn, "fn": self.fn, + "details": self.details, "accuracy": self.accuracy(), "precision": self.precision(), "recall": self.recall(), @@ -102,7 +104,7 @@ def load_entity_dict_from_csv(path: Path, delimiter: str = ",") -> dict: return entity_dict -def load_entity_dict(path: Path) -> dict: +def load_entity_dict(path: Path) -> dict[str, EntitiesRow]: """ """ if path.name.endswith(".json"): @@ -244,8 +246,9 @@ def evaluate_source_typed_entity_coverage(kg: KG, entity_dict_path: Path) -> Ent """ checks expected & integrated source typed entity overlap using label embeddings """ - model = get_model() - entity_dict = load_entity_dict(entity_dict_path) + model = get_model() # TODO this is not used here... + # TODO we need to substract the seed from the found entities... + entity_dict: dict[str, EntitiesRow] = load_entity_dict(entity_dict_path) expected_entity_label_type_pairs = [] @@ -270,15 +273,22 @@ def evaluate_source_typed_entity_coverage(kg: KG, entity_dict_path: Path) -> Ent found_eltp = set(found_entity_label_type_pairs) expected_eltp = set(expected_entity_label_type_pairs) - tp_set = found_eltp & expected_eltp - fp_set = found_eltp - expected_eltp - fn_set = expected_eltp - found_eltp + tp_set = found_eltp & expected_eltp # correct entity type pair + fp_set = found_eltp - expected_eltp # wrong entity type pair + fn_set = expected_eltp - found_eltp # missing entity type pair return BinaryClassificationResult( tp=len(tp_set), fp=len(fp_set), fn=len(fn_set), - tn=0 + tn=0, + details={ + "found_entity_label_type_pairs": found_entity_label_type_pairs, + "expected_entity_label_type_pairs": expected_entity_label_type_pairs, + "tp_set": len(tp_set), + "fp_set": len(fp_set), + "fn_set": len(fn_set) + } ) def evaluate_reference_triple_alignment(kg: KG, reference_kg: KG) -> TripleAlignmentResult: diff --git a/src/kgpipe/evaluation/aspects/reference.py b/src/kgpipe/evaluation/aspects/reference.py index b27a934..4b474b9 100644 --- a/src/kgpipe/evaluation/aspects/reference.py +++ b/src/kgpipe/evaluation/aspects/reference.py @@ -422,8 +422,14 @@ def compute(self, kg: KG, config: ReferenceConfig, **kwargs) -> MetricResult: result = evaluate_source_typed_entity_coverage(kg, verified_source_entities_path) + # log details to file + with open("source_typed_entity_coverage_details.json", "w") as f: + json.dump(result.__dict__(), f) + return MetricResult( name=self.name, + kg=kg, + metric=self, value=result.f1_score(), normalized_score=result.f1_score(), details=result.__dict__(), @@ -714,6 +720,8 @@ def evaluate(self, kg: KG, config: Optional[ReferenceConfig] = None, metrics: Op print(traceback.format_exc()) error_result = MetricResult( name=metric.name, + kg=kg, + metric=metric, value=0.0, normalized_score=0.0, details={"error": str(e)}, diff --git a/src/kgpipe/evaluation/aspects/statistical.py b/src/kgpipe/evaluation/aspects/statistical.py index 1e5aac6..c1bd9a0 100644 --- a/src/kgpipe/evaluation/aspects/statistical.py +++ b/src/kgpipe/evaluation/aspects/statistical.py @@ -71,6 +71,9 @@ def compute(self, kg: KG, config: StatisticalConfig, **kwargs) -> MetricResult: except Exception as e: print("this exception is raised") return MetricResult( + metric=self, + started_at=time.time(), + kg=kg, name=self.name, value=0.0, normalized_score=0.0, @@ -446,7 +449,10 @@ def evaluate(self, kg: KG, metrics: Optional[List[str]] = None, config: Optional except Exception as e: # Create error result error_result = MetricResult( + metric=metric, + kg=kg, name=metric.name, + started_at=time.time(), value=0.0, normalized_score=0.0, details={"error": str(e)}, diff --git a/src/kgpipe/evaluation/base.py b/src/kgpipe/evaluation/base.py index e5c1675..5a91d0b 100644 --- a/src/kgpipe/evaluation/base.py +++ b/src/kgpipe/evaluation/base.py @@ -9,7 +9,7 @@ from enum import Enum from typing import Any, Dict, List, Optional # from kgpipe.common.systemgraph import kg_class -from kgpipe.common.systemgraph import PipeKG +from kgpipe.common.graph.systemgraph import PipeKG import time import json import functools @@ -18,10 +18,11 @@ from pydantic import BaseModel from kgpipe.common.models import KG -from kgpipe.common.definitions import MetricEntity, MetricRunEntity, MetricEntityId, DataHandle +from kgpipe.common.graph.definitions import MetricRunEntity, MetricEntityId from kgpipe.common.config import config from pathlib import Path from kgpipe.common.util import encode_string + class EvaluationAspect(Enum): """The three main aspects of KG evaluation.""" STATISTICAL = "statistical" @@ -76,19 +77,19 @@ def __str__(self) -> str: # @Track(with_timestamp=True) # @kg_class(type="MetricResult", description="Result of computing a single metric.") -class MetricResult(BaseModel): +@dataclass +class MetricResult: """Result of computing a single metric.""" name: str + metric: "Metric" value: float normalized_score: float # 0.0-1.0 range - details: Dict[str, Any] aspect: EvaluationAspect + kg: KG + started_at: float = field(default_factory=time.time) + ended_at: float = field(default_factory=time.time) + details: Dict[str, Any] = field(default_factory=dict) duration: float = 0.0 - input: str = "" # TODO - - def __post_init__(self): - if not 0.0 <= self.normalized_score <= 1.0: - raise ValueError("Normalized score must be between 0.0 and 1.0") class MetricConfig(BaseModel): name: str @@ -117,7 +118,7 @@ def save_metric_run(metric: MetricResult): started_at=time.time(), ended_at=time.time(), computedMetric=MetricEntityId(config.PIPEKG_PREFIX+encode_string(metric.name)), - input=[DataHandle(uri=metric.input, type="any/text")], + input=[], # [Data(uri=metric.kg.path, type="any/text")], value=metric.value, details=json.dumps(metric.details, default=str) ) diff --git a/src/kgpipe/io/__init__.py b/src/kgpipe/io/__init__.py new file mode 100644 index 0000000..fe16459 --- /dev/null +++ b/src/kgpipe/io/__init__.py @@ -0,0 +1,2 @@ +__all__ = [] + diff --git a/src/kgpipe/io/pipe_out.py b/src/kgpipe/io/pipe_out.py new file mode 100644 index 0000000..e19bcd5 --- /dev/null +++ b/src/kgpipe/io/pipe_out.py @@ -0,0 +1,122 @@ +from __future__ import annotations + +from pathlib import Path +from typing import List, Optional + +from pydantic import BaseModel + +from kgpipe.common.models import KgPipePlan, KgStageReport + + +class TaskOut(BaseModel): + """ + Output artifacts produced by a single task within a stage. + """ + + task_name: str + output: List[Path] + + +class StageOut(BaseModel): + """ + Output artifacts for one incremental stage. + """ + + root: Path + stage_name: str + tasks: List[TaskOut] + resultKG: Optional[Path] = None + plan: Optional[KgPipePlan] = None + report: KgStageReport + + @property + def stage_index(self) -> int: + """ + Extract stage number from `stage_` directory name. + """ + return int(self.stage_name.split("_", 1)[1]) + + +class PipeOut(BaseModel): + """ + Output artifacts for a full incremental pipeline run directory containing stage_* subdirs. + """ + + root: Path + pipeline_name: str + stages: List[StageOut] + resultKG: Optional[Path] = None + + +def _stage_paths(run_dir: Path) -> list[Path]: + stage_paths = [p for p in run_dir.iterdir() if p.is_dir() and p.name.startswith("stage_")] + stage_paths.sort(key=lambda p: int(p.name.split("_", 1)[1])) + return stage_paths + + +def _resolve_stage_result_kg(stage_dir: Path) -> Path: + """ + Prefer `result_eval.nt` (evaluation-ready), fallback to `result.nt`. + """ + candidates = [ + # stage_dir / "result_eval.nt", + stage_dir / "result.nt", + ] + for c in candidates: + if c.exists(): + return c + # Keep the legacy default for downstream tools that expect result.nt even if not created yet. + return stage_dir / "result.nt" + + +def load_stage_out(stage_dir: Path) -> StageOut: + """ + Load stage outputs from a `stage_` directory produced by KGpipe incremental runs. + """ + stage_name = stage_dir.name + + plan_path = stage_dir / "exec-plan.json" + report_path = stage_dir / "exec-report.json" + + if not plan_path.exists(): + raise FileNotFoundError(f"Missing exec plan: {plan_path}") + if not report_path.exists(): + raise FileNotFoundError(f"Missing exec report: {report_path}") + + stage_plan = KgPipePlan.model_validate_json(plan_path.read_text()) + + stage_tasks: list[TaskOut] = [] + for step in stage_plan.steps: + stage_tasks.append( + TaskOut( + task_name=step.task, + output=[stage_dir / f"{output.path}" for output in step.output], + ) + ) + + stage_report = KgStageReport.model_validate_json(report_path.read_text()) + + return StageOut( + root=stage_dir, + stage_name=stage_name, + tasks=stage_tasks, + resultKG=_resolve_stage_result_kg(stage_dir), + plan=stage_plan, + report=stage_report, + ) + + +def load_pipe_out(run_dir: Path) -> PipeOut: + """ + Load a pipeline run output directory that contains `stage_*` directories. + """ + run_dir = Path(run_dir) + stages = [load_stage_out(p) for p in _stage_paths(run_dir)] + + return PipeOut( + root=run_dir, + pipeline_name=run_dir.name, + stages=stages, + resultKG=_resolve_stage_result_kg(run_dir) if (run_dir / "result.nt").exists() else (run_dir / "result.nt"), + ) + diff --git a/src/kgpipe/test/common/test_graph.py b/src/kgpipe/test/common/test_graph.py new file mode 100644 index 0000000..e989caf --- /dev/null +++ b/src/kgpipe/test/common/test_graph.py @@ -0,0 +1,41 @@ +from uuid import uuid4 + +from kgpipe.common.graph.definitions import ( + DataTypeEntity, + DataSpecEntity, + TaskEntity, + ImplementationEntity, +) +from kgpipe.common.graph.systemgraph import PipeKG + + +def _uid(prefix: str) -> str: + return f"{prefix}_{uuid4().hex[:8]}" + + +def test_add_implementation_and_find_implemenetation(): + task = TaskEntity(name=_uid("task"), description="test task") + task_id = PipeKG.add_task(task) + + data_type = DataTypeEntity(format="text/csv", data_schema=_uid("schema")) + data_type_id = PipeKG.add_data_type(data_type) + + in_spec_id = PipeKG.add_data_spec(DataSpecEntity(name=_uid("in_spec"), data_type=data_type_id)) + out_spec_id = PipeKG.add_data_spec(DataSpecEntity(name=_uid("out_spec"), data_type=data_type_id)) + + impl_name = _uid("impl") + impl = ImplementationEntity( + name=impl_name, + version="0.0.1", + input_spec=[in_spec_id], + output_spec=[out_spec_id], + realizesTask=[task_id], + usesTool=[], + ) + + PipeKG.add_implementation(impl) + found = PipeKG.find_implementation(impl_name) + + assert found is not None + assert found.name == impl_name + assert found.version == "0.0.1" diff --git a/src/kgpipe/test/common/test_model.py b/src/kgpipe/test/common/test_model.py index c7b73c0..d579f5b 100644 --- a/src/kgpipe/test/common/test_model.py +++ b/src/kgpipe/test/common/test_model.py @@ -1,29 +1,68 @@ -from kgpipe.common.models import KgPipePlan, KgPipePlanStep, Data, DataFormat -from pathlib import Path import json +from enum import Enum +from pathlib import Path + +import pytest + +from kgpipe.common.models import ( + BasicDataFormats, + CustomDataFormats, + Data, + DataFormat, + KgPipePlan, + KgPipePlanStep, +) + +class ProjectFormats(CustomDataFormats): + EMBEDDINGS_JSON = "embeddings.json" + + +class ForeignFormats(str, Enum): + MY_RAW = "my.raw" + + +def test_kg_pipe_plan_roundtrip(): + plan = KgPipePlan( + steps=[ + KgPipePlanStep( + task="paris_entity_matching", + input=[Data(path=Path("data.nt"), format=DataFormat.RDF_NTRIPLES)], + output=[Data(path=Path("data.paris_csv"), format=DataFormat.PARIS_CSV)], + ), + KgPipePlanStep( + task="paris_csv_to_matching_format", + input=[Data(path=Path("data.paris_csv"), format=DataFormat.PARIS_CSV)], + output=[Data(path=Path("data.em_json"), format=DataFormat.ER_JSON)], + ), + ], + seed=Data(path=Path("seed.nt"), format=DataFormat.RDF_NTRIPLES), + source=Data(path=Path("source.nt"), format=DataFormat.RDF_NTRIPLES), + result=Data(path=Path("result.nt"), format=DataFormat.RDF_NTRIPLES), + ) + + plan_json = plan.model_dump_json() + plan_back = KgPipePlan(**json.loads(plan_json)) + + assert plan == plan_back + + +def test_data_accepts_basic_data_formats(): + data = Data(path=Path("a.nt"), format=BasicDataFormats.RDF_NTRIPLES) + assert data.format == BasicDataFormats.RDF_NTRIPLES + assert data.to_dict()["format"] == "nt" + + +def test_data_accepts_custom_data_formats(): + data = Data(path=Path("embed.json"), format=ProjectFormats.EMBEDDINGS_JSON) + assert data.format == ProjectFormats.EMBEDDINGS_JSON + assert data.to_dict()["format"] == "embeddings.json" + + +def test_data_rejects_foreign_string_enum_not_based_on_custom_catalog(): + with pytest.raises(ValueError): + Data(path=Path("x.raw"), format=ForeignFormats.MY_RAW) + -def test_kg_pipe_plan(): - plan = KgPipePlan( - steps=[ - KgPipePlanStep( - task="paris_entity_matching", - input=[Data(path=Path("data.nt"), format=DataFormat.RDF_NTRIPLES)], - output=[Data(path=Path("data.paris_csv"), format=DataFormat.PARIS_CSV)] - ), - KgPipePlanStep( - task="paris_csv_to_matching_format", - input=[Data(path=Path("data.paris_csv"), format=DataFormat.PARIS_CSV)], - output=[Data(path=Path("data.em_json"), format=DataFormat.ER_JSON)] - ), - ], - seed=Data(path=Path("seed.nt"), format=DataFormat.RDF_NTRIPLES), - source=Data(path=Path("source.nt"), format=DataFormat.RDF_NTRIPLES), - result=Data(path=Path("result.nt"), format=DataFormat.RDF_NTRIPLES), - ) - - plan_json = plan.model_dump_json() - print(plan_json) - - plan_back = KgPipePlan(**json.loads(plan_json)) - - assert plan == plan_back \ No newline at end of file +def test_data_rejects_unknown_string_format(): + with pytest.raises(ValueError, match="Unknown format: does-not-exist"): + Data(path=Path("x.any"), format="does-not-exist") \ No newline at end of file diff --git a/src/kgpipe/test/common/test_runtime_to_kg.py b/src/kgpipe/test/common/test_runtime_to_kg.py new file mode 100644 index 0000000..a716642 --- /dev/null +++ b/src/kgpipe/test/common/test_runtime_to_kg.py @@ -0,0 +1,83 @@ +from pathlib import Path + +from kgpipe.common.config import config +from kgpipe.common.models import Data, DataFormat, KgTaskReport +from kgpipe.common.model.task import KgTask +from kgpipe.common.runtime_to_kg import ( + data_to_handle, + reports_to_pipeline_run_entity, + task_to_task_entity, + task_report_to_task_run_entity, +) + + +def _make_report(name: str, start_ts: float, duration: float, status: str = "success") -> KgTaskReport: + return KgTaskReport( + task_name=name, + inputs=[Data(path=Path(f"{name}.in.nt"), format=DataFormat.RDF_NTRIPLES)], + outputs=[Data(path=Path(f"{name}.out.nt"), format=DataFormat.RDF_NTRIPLES)], + start_ts=start_ts, + duration=duration, + status=status, + ) + + +def test_data_to_handle_maps_path_and_format(): + data = Data(path=Path("test.nt"), format=DataFormat.RDF_NTRIPLES) + handle = data_to_handle(data) + assert handle.uri == "test.nt" + assert handle.type == DataFormat.RDF_NTRIPLES + + +def test_task_to_task_entity_maps_name_and_defaults(): + task = KgTask( + name="normalize", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=lambda _i, _o: None, + ) + entity = task_to_task_entity(task) + assert entity.name == "normalize" + assert entity.hasSubtask == [] + + +def test_task_report_to_task_run_entity_maps_core_fields(): + report = _make_report("normalize", start_ts=10.0, duration=2.5) + entity = task_report_to_task_run_entity(report, index=3) + + assert entity.number == 3 + assert entity.name == "normalize" + assert entity.status == "success" + assert entity.started_at == 10.0 + assert entity.ended_at == 12.5 + assert str(entity.executesTask) == f"{config.PIPEKG_PREFIX}normalize" + assert str(entity.usesImplementation) == f"{config.PIPEKG_PREFIX}normalizeImpl" + assert len(entity.input) == 1 + assert len(entity.output) == 1 + + +def test_reports_to_pipeline_run_entity_aggregates_times_and_runs(): + reports = [ + _make_report("step_a", start_ts=100.0, duration=10.0), + _make_report("step_b", start_ts=80.0, duration=5.0), + ] + + pipeline_entity = reports_to_pipeline_run_entity(reports, pipeline_name="demo_pipe") + + assert pipeline_entity.name == "demo_pipe" + assert pipeline_entity.status == "success" + assert pipeline_entity.started_at == 80.0 + assert pipeline_entity.ended_at == 110.0 + assert len(pipeline_entity.hasTaskRun) == 2 + assert pipeline_entity.hasTaskRun[0].number == 0 + assert pipeline_entity.hasTaskRun[1].number == 1 + + +def test_reports_to_pipeline_run_entity_handles_empty_reports(): + pipeline_entity = reports_to_pipeline_run_entity([], pipeline_name="empty_pipe") + + assert pipeline_entity.name == "empty_pipe" + assert pipeline_entity.status == "success" + assert pipeline_entity.started_at == 0.0 + assert pipeline_entity.ended_at == 0.0 + assert pipeline_entity.hasTaskRun == [] diff --git a/src/kgpipe/test/common/test_systemgraph.py b/src/kgpipe/test/common/test_systemgraph.py index 870bd75..2be56c7 100644 --- a/src/kgpipe/test/common/test_systemgraph.py +++ b/src/kgpipe/test/common/test_systemgraph.py @@ -1,72 +1,135 @@ -from kgpipe.common.systemgraph import kg_class, kg_function, SYS_KG, add_task, add_task_result, add_pipeline, add_pipeline_result -from kgpipe.common.definitions import Task, Eval, Pipeline, TaskResult, DataHandle, PipelineResult -import sys -from kgcore.backend.rdf.rdf_rdflib import RDFLibBackend - -task1 = Task( - name="test_task", - type="test_type", - description="test_description", - input=["test_input"], - output=["test_output"] -) -task2 = Task( - name="test_task2", - type="test_type2", - description="test_description2", - input=["test_input2"], - output=["test_output2"] -) -task_result1 = TaskResult( - task=task1, - config={"test_config": "test_config"}, - input=[DataHandle(uri="test_input", type="test_input_type")], - output=[DataHandle(uri="test_output", type="test_output_type")], - status="test_status", - duration=10.0 -) -task_result2 = TaskResult( - task=task2, - config={"test_config2": "test_config2"}, - input=[DataHandle(uri="test_input2", type="test_input_type2")], - output=[DataHandle(uri="test_output2", type="test_output_type2")], - status="test_status2", - duration=20.0 -) -pipeline = Pipeline( - tasks=[task1, task2], - input=["test_input"], - output=["test_output"] -) -pipeline_result = PipelineResult( - task_results=[task_result1, task_result2], - eval_results=[], - input=[DataHandle(uri="test_input", type="test_input_type")], - output=[DataHandle(uri="test_output", type="test_output_type")], - status="test_status", - duration=30.0 +from uuid import uuid4 + +from kgpipe.common.definitions import ( + DataHandle, + ImplementationEntity, + MethodEntity, + MetricEntity, + PipelineEntity, + TaskRunEntity, + ToolEntity, ) +from kgpipe.common.systemgraph import PipeKG + + +def _uid(prefix: str) -> str: + return f"{prefix}_{uuid4().hex[:8]}" + + +def test_core_layer_method_tool_and_implementation(): + method_name = _uid("method") + tool_name = _uid("tool") + impl_name = _uid("impl") + + method = MethodEntity(name=method_name, realizesTask=["task:a"]) + tool = ToolEntity(name=tool_name, providesMethods=["method:a"]) + implementation = ImplementationEntity( + name=impl_name, + input_spec=["text/csv"], + output_spec=["application/json"], + implementsMethod=["method:a"], + hasParameter=["param:a"], + usesTool=["tool:a"], + ) + + PipeKG.add_method(method) + PipeKG.add_tool(tool) + PipeKG.add_implementation(implementation) + + found_method = PipeKG.find_method(method_name) + found_tool = PipeKG.find_tool(tool_name) + found_implementation = PipeKG.find_implementation(impl_name) + + assert found_method is not None + assert found_method.name == method_name + assert "task:a" in found_method.realizesTask + + assert found_tool is not None + assert found_tool.name == tool_name + assert "method:a" in found_tool.providesMethods + + assert found_implementation is not None + assert found_implementation.name == impl_name + assert found_implementation.input_spec == ["text/csv"] + assert found_implementation.output_spec == ["application/json"] + + +def test_data_layer_artifact_type_and_spec(): + artifact_uri = f"file:///{_uid('artifact')}.csv" + artifact_type = _uid("artifact_type") + spec_name = _uid("spec") + specification = '{"type":"object","properties":{"name":{"type":"string"}}}' + data = DataHandle( + uri=artifact_uri, + type="text/csv", + version="1.0.0", + hash="abc123", + size=42, + ) + + PipeKG.add_data_artifact(data) + PipeKG.add_data_artifact_type(artifact_type) + PipeKG.add_data_artifact_spec(spec_name, specification) + + found_data = PipeKG.find_data_artifact(artifact_uri) + found_type = PipeKG.find_data_artifact_type(artifact_type) + found_spec = PipeKG.find_data_artifact_spec(spec_name) + + assert found_data is not None + assert found_data.uri == artifact_uri + assert found_data.type == "text/csv" + assert found_data.version == "1.0.0" + assert found_type == artifact_type + assert found_spec == specification + + +def test_pipeline_layer_pipeline_step_and_definition(): + pipeline_name = _uid("pipeline") + step_task = "task:clean" + definition_name = _uid("pipeline_def") + pipeline_id = f"pipeline:{pipeline_name}" + + pipeline = PipelineEntity(name=pipeline_name, tasks=[step_task], input=[], output=[]) + PipeKG.add_pipeline(pipeline) + PipeKG.add_pipeline_step(pipeline_name=pipeline_name, step_number=1, task_id=step_task) + PipeKG.add_pipeline_definition(name=definition_name, pipeline_id=pipeline_id) + + found_pipeline = PipeKG.find_pipeline(pipeline_name) + found_step = PipeKG.find_pipeline_step(pipeline_name, 1) + found_definition = PipeKG.find_pipeline_definition(definition_name) -model: RDFLibBackend = SYS_KG.backend + assert found_pipeline is not None + assert found_pipeline.name == pipeline_name + assert step_task in found_pipeline.tasks + assert found_step is not None + assert found_definition is not None -def test_task_entity(): - add_task(task1) - add_task(task2) - # print(model.get_rdflibgraph().serialize(format="turtle")) +def test_metrics_layer_add_and_find_metric(): + metric_name = _uid("metric") + metric = MetricEntity(name=metric_name, description="Accuracy metric", type="score") + PipeKG.add_metric(metric) -def test_task_result_entity(): - add_task_result(task_result1) - add_task_result(task_result2) + found_metric = PipeKG.find_metric(metric_name) - # print(model.get_rdflibgraph().serialize(format="turtle")) + assert found_metric is not None + assert found_metric.name == metric_name + assert found_metric.description == "Accuracy metric" + assert found_metric.type == "score" -def test_pipeline_entity(): - add_pipeline(pipeline) - # print(model.get_rdflibgraph().serialize(format="turtle")) +def test_run_layer_add_task_run(): + task_run = TaskRunEntity( + number=1, + name=_uid("task_run"), + status="success", + started_at=1.0, + ended_at=2.0, + input=[DataHandle(uri="file:///in.csv", type="text/csv")], + output=[DataHandle(uri="file:///out.csv", type="text/csv")], + executesTask="task:clean", + usesImplementation="impl:clean_v1", + hasParameterBinding=[], + ) -def test_pipeline_result_entity(): - add_pipeline_result(pipeline_result) - - print(model.get_rdflibgraph().serialize(format="turtle")) \ No newline at end of file + PipeKG.add_task_run(task_run) \ No newline at end of file diff --git a/src/kgpipe/test/common/test_task_category_catalog.py b/src/kgpipe/test/common/test_task_category_catalog.py new file mode 100644 index 0000000..605ac0f --- /dev/null +++ b/src/kgpipe/test/common/test_task_category_catalog.py @@ -0,0 +1,33 @@ +from kgpipe.common.models import TaskCategoryCatalog + + +def test_entity_resolution_children_include_expected_subtasks(): + children = TaskCategoryCatalog.get_children("EntityResolution") + assert "Blocking" in children + assert "Matching" in children + assert "EntityMatching" in children + assert "Clustering" in children + + +def test_subtask_relationships_for_entity_resolution(): + assert TaskCategoryCatalog.is_subtask_of("Blocking", "EntityResolution") + assert TaskCategoryCatalog.is_subtask_of("Matching", "EntityResolution") + assert TaskCategoryCatalog.is_subtask_of("Clustering", "EntityResolution") + assert not TaskCategoryCatalog.is_subtask_of("EntityResolution", "Blocking") + + +def test_ancestors_and_descendants_are_resolved(): + ancestors = TaskCategoryCatalog.get_ancestors("EntityMatching") + descendants = TaskCategoryCatalog.get_descendants("EntityResolution") + + assert ancestors[0] == "EntityResolution" + assert "TaskCategory" in ancestors + assert "Blocking" in descendants + assert "Clustering" in descendants + + +def test_register_custom_category_under_existing_parent(): + TaskCategoryCatalog.register("CandidateGeneration", parent="EntityResolution") + assert TaskCategoryCatalog.has("CandidateGeneration") + assert TaskCategoryCatalog.get_parent("CandidateGeneration") == "EntityResolution" + assert TaskCategoryCatalog.is_subtask_of("CandidateGeneration", "EntityResolution") diff --git a/src/kgpipe/test/common/test_task_model.py b/src/kgpipe/test/common/test_task_model.py new file mode 100644 index 0000000..cbe0e19 --- /dev/null +++ b/src/kgpipe/test/common/test_task_model.py @@ -0,0 +1,136 @@ +from pathlib import Path + +from kgpipe.common.models import Data, DataFormat, KgTask + + +def _write_output_task(inputs: dict[str, Data], outputs: dict[str, Data]) -> None: + _ = inputs["in"] + out_path = outputs["out"].path + out_path.parent.mkdir(parents=True, exist_ok=True) + out_path.write_text("generated") + + +def test_kgtask_run_success(tmp_path: Path): + in_file = tmp_path / "input.nt" + out_file = tmp_path / "output.nt" + in_file.write_text("seed") + + task = KgTask( + name="copy_like", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=_write_output_task, + ) + + report = task.run( + inputs=[Data(path=in_file, format=DataFormat.RDF_NTRIPLES)], + outputs=[Data(path=out_file, format=DataFormat.RDF_NTRIPLES)], + ) + + assert report.status == "success" + assert out_file.exists() + assert report.task_name == "copy_like" + assert len(report.inputs) == 1 + assert len(report.outputs) == 1 + + +def test_kgtask_run_failed_when_function_raises(tmp_path: Path): + def failing_task(_: dict[str, Data], __: dict[str, Data]) -> None: + raise RuntimeError("boom") + + in_file = tmp_path / "input.nt" + out_file = tmp_path / "output.nt" + in_file.write_text("seed") + + task = KgTask( + name="fails", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=failing_task, + ) + + report = task.run( + inputs=[Data(path=in_file, format=DataFormat.RDF_NTRIPLES)], + outputs=[Data(path=out_file, format=DataFormat.RDF_NTRIPLES)], + ) + + assert report.status == "failed" + assert report.error is not None + assert "boom" in report.error + + +def test_kgtask_run_skips_when_outputs_exist(tmp_path: Path): + called = {"count": 0} + + def should_not_run(_: dict[str, Data], __: dict[str, Data]) -> None: + called["count"] += 1 + + in_file = tmp_path / "input.nt" + out_file = tmp_path / "output.nt" + in_file.write_text("seed") + out_file.write_text("already-here") + + task = KgTask( + name="skip_if_present", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=should_not_run, + ) + + report = task.run( + inputs=[Data(path=in_file, format=DataFormat.RDF_NTRIPLES)], + outputs=[Data(path=out_file, format=DataFormat.RDF_NTRIPLES)], + ) + + assert report.status == "skipped" + assert called["count"] == 0 + + +def test_kgtask_stable_files_override_forces_run(tmp_path: Path): + called = {"count": 0} + out_file = tmp_path / "output.nt" + + def rewrite_output(_: dict[str, Data], outputs: dict[str, Data]) -> None: + called["count"] += 1 + outputs["out"].path.write_text("fresh") + + in_file = tmp_path / "input.nt" + in_file.write_text("seed") + out_file.write_text("stale") + + task = KgTask( + name="override_output", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=rewrite_output, + ) + + report = task.run( + inputs=[Data(path=in_file, format=DataFormat.RDF_NTRIPLES)], + outputs=[Data(path=out_file, format=DataFormat.RDF_NTRIPLES)], + stable_files_override=True, + ) + + assert report.status == "success" + assert called["count"] == 1 + assert out_file.read_text() == "fresh" + + +def test_kgtask_run_fails_for_missing_required_input(tmp_path: Path): + out_file = tmp_path / "output.nt" + + task = KgTask( + name="needs_input", + input_spec={"in": DataFormat.RDF_NTRIPLES}, + output_spec={"out": DataFormat.RDF_NTRIPLES}, + function=_write_output_task, + ) + + report = task.run( + inputs=[], + outputs=[Data(path=out_file, format=DataFormat.RDF_NTRIPLES)], + ) + + assert report.status == "failed" + assert report.error is not None + assert "Missing required inputs" in report.error diff --git a/src/kgpipe_eval/__init__.py b/src/kgpipe_eval/__init__.py new file mode 100644 index 0000000..04e3f84 --- /dev/null +++ b/src/kgpipe_eval/__init__.py @@ -0,0 +1,2 @@ +# Refactor of kgpipe.evaluation to be a standalone package + diff --git a/src/kgpipe_eval/api.py b/src/kgpipe_eval/api.py new file mode 100644 index 0000000..36c821a --- /dev/null +++ b/src/kgpipe_eval/api.py @@ -0,0 +1,48 @@ +from __future__ import annotations + +from abc import ABC, abstractmethod +from dataclasses import dataclass +from typing import Any + + +# MetricConfig (rich, typed, input) +# ↓ +# computation +# ↓ +# MetricResult +# β”œβ”€β”€ measurements (results) +# └── metadata (flattened config + context) + +@dataclass(frozen=True) +class MetricConfig: + pass + +@dataclass(frozen=True) +class Measurement: + name: str + value: Any + unit: str | None = None + +@dataclass(frozen=True) +class MetricResult: + metric: "Metric" + measurements: list[Measurement] + summary: str | None = None + # TODO metadata/properties: dict[str, int | float | str | bool] = field(default_factory=dict) + +class Metric(ABC): + """ + Minimal metric interface for the `kgpipe eval-new` CLI. + + Metrics are instantiated (usually with default config) and then run via `compute(...)`. + """ + + key: str + description: str + + @abstractmethod + def compute(self, *args: Any, **kwargs: Any) -> MetricResult: ... + + +# --- + diff --git a/src/kgpipe_eval/config/manager.py b/src/kgpipe_eval/config/manager.py new file mode 100644 index 0000000..d7d5d81 --- /dev/null +++ b/src/kgpipe_eval/config/manager.py @@ -0,0 +1,334 @@ +from __future__ import annotations + +from pathlib import Path +from typing import Any, Mapping +import re + +import yaml +from pydantic import BaseModel + +from kgpipe.common import KG +from kgpipe.common.model.data import DataFormat + +from kgpipe_eval.metrics.duplicates import DuplicateConfig +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentConfig +from kgpipe_eval.metrics.consistency_violations import ConsistencyViolationsConfig +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig + + +MetricConfigModel = BaseModel +REQUIRED = "" + +_VAR_PATTERN = re.compile(r"^\$(\w+)$|^\$\{(\w+)\}$") + + +def _interpolate_vars(obj: Any, vars_map: Mapping[str, Any]) -> Any: + """ + Recursively interpolate simple $var / ${var} references inside YAML-loaded data. + + Only replaces when the *entire* string is a reference token. + """ + if isinstance(obj, str): + m = _VAR_PATTERN.match(obj.strip()) + if not m: + return obj + name = m.group(1) or m.group(2) + if name in vars_map: + return vars_map[name] + return obj + if isinstance(obj, list): + return [_interpolate_vars(v, vars_map) for v in obj] + if isinstance(obj, dict): + return {k: _interpolate_vars(v, vars_map) for k, v in obj.items()} + return obj + + +def _resolve_paths(obj: Any, *, base_dir: Path) -> Any: + """ + Recursively resolve relative paths for common config keys. + + - For keys ending with `_path` or `_kg_path`, if the value is a str/Path and + relative, make it absolute by joining with `base_dir`. + - For `reference_kg` when passed as str/Path, treat it as a path too. + """ + if isinstance(obj, list): + return [_resolve_paths(v, base_dir=base_dir) for v in obj] + if isinstance(obj, dict): + out: dict[str, Any] = {} + for k, v in obj.items(): + vv = _resolve_paths(v, base_dir=base_dir) + if isinstance(vv, (str, Path)): + if k == "reference_kg" or k.endswith("_path") or k.endswith("_kg_path"): + p = Path(vv) + if not p.is_absolute(): + vv = (base_dir / p).resolve() + out[k] = vv + return out + return obj + + +def _deep_merge_dict(base: Mapping[str, Any], override: Mapping[str, Any]) -> dict[str, Any]: + """ + Merge override into base recursively (override wins). + """ + out: dict[str, Any] = dict(base) + for k, v in override.items(): + if ( + k in out + and isinstance(out[k], Mapping) + and isinstance(v, Mapping) + ): + out[k] = _deep_merge_dict(out[k], v) + else: + out[k] = v + return out + + +def _kg_from_path(path: Path, *, name: str | None = None) -> KG: + """ + Build a minimal `kgpipe.common.KG` from a filesystem path. + + Notes: + - We infer `format` from the file suffix when possible, otherwise fall back to JSON. + - The KG object lazily parses the graph when `get_graph()` is called. + """ + suffix = path.suffix.lower().lstrip(".") + try: + fmt = DataFormat(suffix) + except Exception: + fmt = DataFormat.JSON + + return KG( + id=str(path), + name=(name or path.stem), + path=path, + format=fmt, + ) + + +def _resolve_entity_alignment_config( + metric_cfg: Mapping[str, Any], + named: Mapping[str, Mapping[str, Any]], +) -> dict[str, Any]: + """ + Resolve an entity alignment config from either: + - inline: `entity_alignment_config: {...}` + - ref: `entity_alignment_config_ref: name` + Optionally supports both; inline values override the referenced dict. + """ + inline = metric_cfg.get("entity_alignment_config") or {} + ref_name = metric_cfg.get("entity_alignment_config_ref") + if ref_name is None: + if not isinstance(inline, Mapping): + raise TypeError("`entity_alignment_config` must be a mapping if provided.") + return dict(inline) + + if not isinstance(ref_name, str) or not ref_name: + raise TypeError("`entity_alignment_config_ref` must be a non-empty string.") + if ref_name not in named: + raise KeyError(f"Unknown entity alignment config ref: {ref_name!r}") + + if not isinstance(inline, Mapping): + raise TypeError("`entity_alignment_config` must be a mapping if provided.") + return _deep_merge_dict(named[ref_name], inline) + + +def load_metric_configs(config_path: str | Path) -> dict[str, MetricConfigModel]: + """ + Load a single YAML file that defines metric configs and optional shared sub-configs. + + Expected YAML structure (minimal): + + ```yaml + entity_alignment_configs: + default: + method: label_embedding + verified_entities_path: path/to/entities.csv + entity_sim_threshold: 0.95 + + metrics: + entity_align: + entity_alignment_config_ref: default + + duplicates: + entity_alignment_config_ref: default + + triple_alignment: + reference_kg_path: path/to/reference.nt + entity_alignment_config_ref: default + value_sim_threshold: 0.5 + ``` + + Returned dict keys are metric keys (e.g. "duplicates") and values are instantiated + Pydantic config objects (e.g. `DuplicateConfig`). + """ + path = Path(config_path) + raw = yaml.safe_load(path.read_text(encoding="utf-8")) or {} + if not isinstance(raw, Mapping): + raise TypeError("Top-level YAML must be a mapping/dict.") + + # Allow simple variable indirection like: + # reference_kg: test.ttl + # ... reference_kg: $reference_kg + vars_map = {k: v for k, v in raw.items() if isinstance(k, str)} + raw = _interpolate_vars(raw, vars_map) + raw = _resolve_paths(raw, base_dir=path.parent) + + named_entity_alignment: dict[str, dict[str, Any]] = {} + raw_named = raw.get("entity_alignment_configs") or {} + if raw_named: + if not isinstance(raw_named, Mapping): + raise TypeError("`entity_alignment_configs` must be a mapping/dict.") + for k, v in raw_named.items(): + if not isinstance(k, str) or not k: + raise TypeError("`entity_alignment_configs` keys must be non-empty strings.") + if not isinstance(v, Mapping): + raise TypeError(f"`entity_alignment_configs.{k}` must be a mapping/dict.") + named_entity_alignment[k] = dict(v) + + metrics_raw = raw.get("metrics") or {} + if not isinstance(metrics_raw, Mapping): + raise TypeError("`metrics` must be a mapping/dict.") + + out: dict[str, MetricConfigModel] = {} + for metric_key, metric_cfg_any in metrics_raw.items(): + if not isinstance(metric_key, str) or not metric_key: + raise TypeError("Metric keys in `metrics` must be non-empty strings.") + if metric_cfg_any is None: + metric_cfg: dict[str, Any] = {} + elif isinstance(metric_cfg_any, Mapping): + metric_cfg = dict(metric_cfg_any) + else: + raise TypeError(f"`metrics.{metric_key}` must be a mapping/dict.") + + # --- Metric-specific instantiation rules + if metric_key in {"entity_align", "entity_alignment"}: + entity_cfg_dict = _resolve_entity_alignment_config(metric_cfg, named_entity_alignment) + # Allow `reference_kg_path` convenience here too + if "reference_kg_path" in entity_cfg_dict and "reference_kg" not in entity_cfg_dict: + ref_path = Path(entity_cfg_dict.pop("reference_kg_path")) + entity_cfg_dict["reference_kg"] = _kg_from_path(ref_path) + # Backward compatible: accept `reference_kg: "/path/to/file.nt"` in YAML + if isinstance(entity_cfg_dict.get("reference_kg"), (str, Path)): + entity_cfg_dict["reference_kg"] = _kg_from_path(Path(entity_cfg_dict["reference_kg"])) + out[metric_key] = EntityAlignmentConfig.model_validate(entity_cfg_dict) + continue + + if metric_key in {"duplicates", "duplicate"}: + entity_cfg_dict = _resolve_entity_alignment_config(metric_cfg, named_entity_alignment) + if "reference_kg_path" in entity_cfg_dict and "reference_kg" not in entity_cfg_dict: + ref_path = Path(entity_cfg_dict.pop("reference_kg_path")) + entity_cfg_dict["reference_kg"] = _kg_from_path(ref_path) + if isinstance(entity_cfg_dict.get("reference_kg"), (str, Path)): + entity_cfg_dict["reference_kg"] = _kg_from_path(Path(entity_cfg_dict["reference_kg"])) + out[metric_key] = DuplicateConfig.model_validate( + { + "entity_alignment_config": EntityAlignmentConfig.model_validate(entity_cfg_dict), + } + ) + continue + + if metric_key in {"triple_alignment", "triple_align"}: + cfg_dict: dict[str, Any] = dict(metric_cfg) + entity_cfg_dict = _resolve_entity_alignment_config(metric_cfg, named_entity_alignment) + if "reference_kg_path" in entity_cfg_dict and "reference_kg" not in entity_cfg_dict: + ref_path = Path(entity_cfg_dict.pop("reference_kg_path")) + entity_cfg_dict["reference_kg"] = _kg_from_path(ref_path) + if isinstance(entity_cfg_dict.get("reference_kg"), (str, Path)): + entity_cfg_dict["reference_kg"] = _kg_from_path(Path(entity_cfg_dict["reference_kg"])) + cfg_dict["entity_alignment_config"] = EntityAlignmentConfig.model_validate(entity_cfg_dict) + + # Allow YAML to specify a path rather than an in-memory KG object + if "reference_kg_path" in cfg_dict and "reference_kg" not in cfg_dict: + ref_path = Path(cfg_dict.pop("reference_kg_path")) + cfg_dict["reference_kg"] = _kg_from_path(ref_path) + + out[metric_key] = TripleAlignmentConfig.model_validate(cfg_dict) + continue + + if metric_key in { + "consistency_violations", + "disjoint_domain", + "domain", + "range", + "relation_direction", + "datatype", + "datatype_format", + }: + cfg_dict = dict(metric_cfg) + if "reference_kg_path" in cfg_dict and "reference_kg" not in cfg_dict: + ref_path = Path(cfg_dict.pop("reference_kg_path")) + cfg_dict["reference_kg"] = _kg_from_path(ref_path) + if isinstance(cfg_dict.get("reference_kg"), (str, Path)): + cfg_dict["reference_kg"] = _kg_from_path(Path(cfg_dict["reference_kg"])) + out[metric_key] = ConsistencyViolationsConfig.model_validate(cfg_dict) + continue + + raise KeyError( + f"Unknown metric key {metric_key!r} in config. " + "Add it to `kgpipe_eval.config.manager.load_metric_configs`." + ) + + return out + + +def generate_default_config_dict() -> dict[str, Any]: + """ + Generate a complete default YAML config structure for all supported metric configs. + + This is intended as a *template* for users. Required values are filled with the + placeholder string `""`. + """ + # Shared sub-config defaults + entity_alignment_default = { + "method": "label_embedding", + # Prefer a path-based template: avoids embedding runtime `KG` objects into YAML. + "verified_entities_path": REQUIRED, + "verified_entities_delimiter": EntityAlignmentConfig.model_fields["verified_entities_delimiter"].default, + "entity_sim_threshold": EntityAlignmentConfig.model_fields["entity_sim_threshold"].default, + } + + return { + "entity_alignment_configs": { + "default": entity_alignment_default, + }, + "metrics": { + # Standalone metric uses EntityAlignmentConfig directly via a ref. + "entity_align": { + "entity_alignment_config_ref": "default", + }, + "duplicates": { + "entity_alignment_config_ref": "default", + }, + "triple_alignment": { + "reference_kg_path": REQUIRED, + "entity_alignment_config_ref": "default", + "value_sim_threshold": TripleAlignmentConfig.model_fields["value_sim_threshold"].default, + }, + # Consistency config currently requires both fields at type-level; + # template includes both so users can fill in one/both. + "consistency_violations": { + "reference_kg_path": REQUIRED, + "ontology_path": REQUIRED, + }, + }, + } + + +def generate_default_config_yaml() -> str: + """ + Return a YAML string (template) for `load_metric_configs`. + """ + cfg = generate_default_config_dict() + # Keep output stable and readable. + return yaml.safe_dump(cfg, sort_keys=False, default_flow_style=False) + + +def write_default_config_yaml(path: str | Path) -> Path: + """ + Write a default template YAML to disk and return the written path. + """ + out_path = Path(path) + out_path.write_text(generate_default_config_yaml(), encoding="utf-8") + return out_path + diff --git a/src/kgpipe_eval/evaluator.py b/src/kgpipe_eval/evaluator.py new file mode 100644 index 0000000..ec2f045 --- /dev/null +++ b/src/kgpipe_eval/evaluator.py @@ -0,0 +1,68 @@ +from __future__ import annotations + +import inspect +from dataclasses import dataclass +from typing import Any, Dict, Iterable, List, Mapping, Sequence +import traceback + +from kgpipe_eval.api import Metric, MetricResult +from kgpipe_eval.utils.kg_utils import TripleGraph + + +def _metric_key(metric: Metric) -> str: + return getattr(metric, "key", metric.__class__.__name__) + + +@dataclass +class Evaluator: + """ + Execute multiple metrics against a KG and pass the right config (if any). + """ + + def run( + self, + kg: TripleGraph, + metrics: Sequence[Metric], + confs: Mapping[str, Any] | None = None, + ) -> List[MetricResult]: + confs = dict(confs or {}) + results: List[MetricResult] = [] + + for metric in metrics: + key = _metric_key(metric) + cfg = confs.get(key, confs.get(key.lower())) + + compute = getattr(metric, "compute", None) + if compute is None: + raise TypeError(f"Metric {key!r} has no compute() method") + + sig = inspect.signature(compute) + # Bound method: typically (kg) or (kg, config) + params = [ + p for p in sig.parameters.values() + if p.kind in (p.POSITIONAL_ONLY, p.POSITIONAL_OR_KEYWORD) + ] + + try: + if len(params) <= 1: + # compute(self) or compute(self, kg) -- call without config + res = compute(kg) if len(params) == 1 else compute() + else: + # compute(self, kg, config, ...) + if cfg is None: + raise KeyError( + f"Missing config for metric {key!r}. " + f"Provide `confs[{key!r}]`." + ) + res = compute(kg, cfg) + except Exception as e: + print(f"Failed running metric {key!r}: {e}") + print(traceback.format_exc()) + raise RuntimeError(f"Failed running metric {key!r}") from e + + if not isinstance(res, MetricResult): + raise TypeError(f"Metric {key!r} returned {type(res)!r}, expected MetricResult") + results.append(res) + + return results + diff --git a/src/kgpipe_eval/metrics/__init__.py b/src/kgpipe_eval/metrics/__init__.py new file mode 100644 index 0000000..fb29624 --- /dev/null +++ b/src/kgpipe_eval/metrics/__init__.py @@ -0,0 +1,132 @@ +from .statistics import CountMetric +from .triple_alignment import TripleAlignmentMetric +from .entity_alignment import EntityAlignmentMetric +from .duplicates import DuplicateMetric +from .consistency_violations import ( + DisjointDomainMetric, + DomainMetric, + RangeMetric, + RelationDirectionMetric, + DatatypeMetric, + DatatypeFormatMetric, +) + +__all__ = [ + "CountMetric", + "TripleAlignmentMetric", + "EntityAlignmentMetric", + "DuplicateMetric", + "DisjointDomainMetric", + "DomainMetric", + "RangeMetric", + "RelationDirectionMetric", + "DatatypeMetric", + "DatatypeFormatMetric", +] + +# @dataclass(frozen=True) +# class BinaryClassificationStats: +# tp: int +# fp: int +# tn: int +# fn: int + +# def recall(self) -> float: +# d = self.tp + self.fn +# return self.tp / d if d else 0.0 + +# def precision(self) -> float: +# d = self.tp + self.fp +# return self.tp / d if d else 0.0 + +# def f1(self) -> float: +# p = self.precision() +# r = self.recall() +# return 2 * p * r / (p + r) if (p + r) else 0.0 + +# def accuracy(self) -> float: +# d = self.tp + self.fp + self.tn + self.fn +# return (self.tp + self.tn) / d if d else 0.0 + +# def reference_binary_classification(kg: KgKg, config: MetricConfig) -> MetricResult: +# stats = BinaryClassificationStats(tp=10, fp=5, tn=15, fn=3) +# return MetricResult( +# metric_key="reference_binary_classification", +# summary="Reference comparison computed", +# measurements=[ +# Measurement("tp", stats.tp), +# Measurement("fp", stats.fp), +# Measurement("tn", stats.tn), +# Measurement("fn", stats.fn), +# Measurement("precision", stats.precision(), "ratio"), +# Measurement("recall", stats.recall(), "ratio"), +# Measurement("f1", stats.f1(), "ratio"), +# Measurement("accuracy", stats.accuracy(), "ratio"), +# ], +# ) + +# def graph_size(kg: KgKg, config: MetricConfig) -> MetricResult: +# size = 1532 +# return MetricResult( +# metric_key="graph_size", +# measurements=[ +# Measurement("triple_count", size, "triples") +# ], +# summary=f"Graph contains {size} triples", +# ) + +# def entity_duplication_rate(kg: KgKg, config: MetricConfig) -> MetricResult: +# duplicates = 7 +# total = 100 +# rate = duplicates / total if total else 0.0 +# return MetricResult( +# metric_key="entity_duplication_rate", +# measurements=[ +# Measurement("duplication_rate", rate, "ratio"), +# Measurement("duplicate_entities", duplicates, "entities"), +# Measurement("total_entities", total, "entities"), +# ], +# summary=f"Entity duplication rate: {rate:.2%}", +# ) +# --- + +# class BinaryClassifier(): +# tp: int +# fp: int +# tn: int +# fn: int + +# def recall(self) -> float: +# return self.tp / (self.tp + self.fn) + +# def precision(self) -> float: +# return self.tp / (self.tp + self.fp) + +# def f1(self) -> float: +# return 2 * self.precision() * self.recall() / (self.precision() + self.recall()) + +# def accuracy(self) -> float: +# return (self.tp + self.tn) / (self.tp + self.tn + self.fp + self.fn) + +# @lru_cache +# def compute_binary_classifier(kg: KgKg) -> BinaryClassifier: +# return BinaryClassifier(tp=10, fp=5, tn=15, fn=3) + + +# # Option 1 the metrics are recall, precision, f1, accuracy +# def reference_recall(kg: KgKg, reference: KgKg) -> KgMetricResult: +# binary_classifier = compute_binary_classifier(kg, reference) +# return KgMetricResult(summary=f"Reference recall: {binary_classifier.recall()}") + +# def reference_precision(kg: KgKg, reference: KgKg) -> KgMetricResult: +# binary_classifier = compute_binary_classifier(kg, reference) +# return KgMetricResult(summary=f"Reference precision: {binary_classifier.precision()}") + +# def reference_f1(kg: KgKg, reference: KgKg) -> KgMetricResult: +# binary_classifier = compute_binary_classifier(kg, reference) +# return KgMetricResult(summary=f"Reference F1: {binary_classifier.f1()}") + +# #Option 2 the metrics are Binary Classification which allows for more detailed analysis +# def reference_binary_classification(kg: KgKg, reference: KgKg) -> KgMetricResult: +# binary_classifier = compute_binary_classifier(kg, reference) +# return KgMetricResult(summary=f"Reference binary classification: {binary_classifier.tp}, {binary_classifier.fp}, {binary_classifier.tn}, {binary_classifier.fn}") \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/consistency_violations.py b/src/kgpipe_eval/metrics/consistency_violations.py new file mode 100644 index 0000000..7bbc1ae --- /dev/null +++ b/src/kgpipe_eval/metrics/consistency_violations.py @@ -0,0 +1,664 @@ +from kgpipe_eval.api import Metric, MetricResult, Measurement + +from pydantic import BaseModel, model_validator, ConfigDict +from kgpipe.common import KG +from pathlib import Path +from kgpipe_eval.utils.kg_utils import TripleGraph +from typing import Dict, Set, Optional + +from rdflib import URIRef, RDF, Literal, Graph, XSD +from rdflib.query import Result, ResultRow + +from kgcore.api.ontology import Ontology, OntologyUtil +from tqdm import tqdm + +def get_ontology_graph(ontology_path: Optional[Path], kg: KG) -> Graph: + if ontology_path is not None: + return Graph().parse(ontology_path) + elif kg is not None: + return kg.get_ontology_graph() + + +def enrich_type_information(graph: Graph, ontology: Ontology, type_property: URIRef = RDF.type) -> Graph: + type_dict = {} + + new_graph = Graph() + + for s, p, o in graph: + domain, range = ontology.get_domain_range(str(p)) + if domain and isinstance(s, URIRef): + if str(s) not in type_dict: + type_dict[str(s)] = [] + type_dict[str(s)].append(str(domain)) + if range and isinstance(o, URIRef): + if str(o) not in type_dict: + type_dict[str(o)] = [] + type_dict[str(o)].append(str(range)) + new_graph.add((s, p, o)) + + for uri, types in type_dict.items(): + for type in types: + new_graph.add((URIRef(uri), type_property, URIRef(type))) + return new_graph + +class ConsistencyViolationsConfig(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + reference_kg: Optional[KG] = None + ontology_path: Optional[Path] = None + + @model_validator(mode="after") + def _require_reference_kg_or_ontology_path(self): + if self.reference_kg is None and self.ontology_path is None: + raise ValueError("Provide either `reference_kg` or `ontology_path`.") + return self + +class DisjointDomainMetric(Metric): + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute disjoint domain score.""" + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + for s, p, o in ontology_graph.triples((None, None, None)): + graph.add((s, p, o)) + + # Get all disjoint domains + disjoint_domains_qr: Result = graph.query( + """ + SELECT DISTINCT ?subject + WHERE { + ?subject a ?disjointDomain1 . + ?subject a ?disjointDomain2 . + ?disjointDomain1 owl:disjointWith ?disjointDomain2 . + } + """ + ) + subjects_with_disjoint_domains = set([row["subject"] for row in disjoint_domains_qr if isinstance(row, ResultRow)]) + + subjects = set([str(s) for s in graph.subjects()]) + + return MetricResult( + metric=self, + measurements=[ + Measurement(name="subjects_with_disjoint_domains", value=len(subjects_with_disjoint_domains), unit="number"), + Measurement(name="subjects", value=len(subjects), unit="number"), + Measurement(name="normalized_score", value=1.0 - (len(subjects_with_disjoint_domains) / len(subjects)), unit="ratio"), + ], + summary=f"Number of subjects with disjoint domains: {len(subjects_with_disjoint_domains)}", + ) + +class DomainMetric(Metric): + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute incorrect relation domain score. + + TODO: check if this is correct for increment eval if namespace changes to former generic namespace not seed + """ + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + # disjoint class by class + disjoint_class_by_class : Dict[str, Set[str]] = {} + for class_ in ontology.classes: + if class_.disjointWith is not None: + disjoint_class_by_class[class_.uri] = class_.disjointWith + else: + disjoint_class_by_class[class_.uri] = set() + + + def is_subject_type(o, type): + # print(o, type) + if isinstance(o, URIRef): + types = [str(t) for _, _, t in graph.triples((o, RDF.type, None))] + return type in types and not any(str(other_type) in disjoint_class_by_class.get(str(type), set()) for other_type in types) + elif isinstance(o, Literal): + return o.datatype == type + else: + return False + + domain_by_property = {} + for property in ontology.properties: + if property.domain is not None: + domain_by_property[property.uri] = property.domain.uri + else: + print(f"Property {property.uri} has no domain") + domain_by_property[property.uri] = "TODO" + + incorrect_relation_domain = 0 + correct_relation_domain = 0 + + for s, p, o in graph.triples((None, None, None)): + if str(p) in domain_by_property: + if is_subject_type(s, domain_by_property[str(p)]): + correct_relation_domain += 1 + else: + incorrect_relation_domain += 1 + + if incorrect_relation_domain + correct_relation_domain > 0: + normalized_score = 1.0 - (incorrect_relation_domain / (incorrect_relation_domain + correct_relation_domain)) + else: + normalized_score = 0.0 + + return MetricResult( + metric=self, + measurements=[ + Measurement(name="incorrect_relation_domain", value=incorrect_relation_domain, unit="number"), + Measurement(name="correct_relation_domain", value=correct_relation_domain, unit="number"), + Measurement(name="normalized_score", value=normalized_score, unit="ratio"), + ], + summary=f"Number of incorrect relation domain: {incorrect_relation_domain}", + # name=self.name, + # value=incorrect_relation_domain, + # normalized_score=normalized_score, + # details={"incorrect_relation_domain": incorrect_relation_domain, "correct_relation_domain": correct_relation_domain}, + # aspect=self.aspect + ) + +class RangeMetric(Metric): + + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute incorrect relation range score.""" + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology : Ontology= OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + # disjoint class by class + disjoint_class_by_class : Dict[str, Set[str]] = {} + for class_ in ontology.classes: + if class_.disjointWith is not None: + disjoint_class_by_class[class_.uri] = class_.disjointWith + else: + disjoint_class_by_class[class_.uri] = set() + + def is_object_type(o, type): + # print(o, type) + if isinstance(o, URIRef): + types = [str(t) for s, p, t in graph.triples((o, RDF.type, None))] + # if str(type) not in types: + # print(f"Incorrect relation range {types} of {o} for property {p} with range {types}") + return str(type) in types and not any(str(other_type) in disjoint_class_by_class.get(str(type), set()) for other_type in types) + elif isinstance(o, Literal): + datatype = o.datatype + if not datatype: + datatype = str(XSD.string) + return str(datatype) == str(type) + else: + return False + + + range_by_property = {} + for property in ontology.properties: + if property.range is not None: + range_by_property[property.uri] = property.range.uri + else: + # print(f"Property {property.uri} has no range") + range_by_property[property.uri] = None + + incorrect_relation_range = 0 + correct_relation_range = 0 + + for s, p, o in graph.triples((None, None, None)): + if str(p) in range_by_property: + if is_object_type(o, range_by_property[str(p)]): + correct_relation_range += 1 + else: + # print(f"Incorrect relation range {o if isinstance(o, URIRef) else o.datatype} for property {p} with range {range_by_property[str(p)]}") + incorrect_relation_range += 1 + + normalized_score = 1.0 - (incorrect_relation_range / (incorrect_relation_range + correct_relation_range)) if incorrect_relation_range + correct_relation_range > 0 else 1.0 + """Compute incorrect relation range score.""" + return MetricResult( + metric=self, + measurements=[ + Measurement(name="incorrect_relation_range", value=incorrect_relation_range, unit="number"), + Measurement(name="correct_relation_range", value=correct_relation_range, unit="number"), + Measurement(name="normalized_score", value=normalized_score, unit="ratio"), + ], + summary=f"Number of incorrect relation range: {incorrect_relation_range}", + # name=self.name, + # value=incorrect_relation_range, + # normalized_score=normalized_score, + # details={"incorrect_relation_range": incorrect_relation_range, "correct_relation_range": correct_relation_range}, + # aspect=self.aspect + ) + +class RelationDirectionMetric(Metric): + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute incorrect relation direction score.""" + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + if len(ontology_graph) == 0: + ontology_graph = graph + print(f"INFO: ontology_graph is empty, using graph instead") + + # TODO use ontology implementation from framework + predicate_defs_sr = ontology_graph.query( + """ + SELECT DISTINCT ?predicate ?domain ?range + WHERE { + ?predicate rdfs:domain ?domain . + ?predicate rdfs:range ?range . + } + """ + ) + + # def check_type(uri, type): + # result = graph.query( + # """ + # SELECT ?uri + # WHERE { + # ?uri a ?type . + # } + # """, + # initBindings={"uri": uri, "type": type} + # ) + # return len(result) > 0 + + predicate_defs = {} + for row in predicate_defs_sr: + predicate_defs[str(row["predicate"])] = (str(row["domain"]), str(row["range"])) + + incorrect_relation_direction = 0 + correct_relation_direction = 0 + + entity_types = {} + for s, p, o in graph.triples((None, RDF.type, None)): + if str(s) not in entity_types: + entity_types[str(s)] = [] + entity_types[str(s)].append(str(o)) + + for s, p, o in tqdm(graph, desc="Checking relation direction"): + if str(s) not in entity_types: + continue + if str(p) in predicate_defs: + domain, range = predicate_defs[str(p)] + + if isinstance(o, URIRef): + if not str(s) in entity_types: + # print(f"Skipping s {s} because it is not in entity_types") + continue + if not str(o) in entity_types: + # print(f"Skipping o {o} because it is not in entity_types") + continue + if domain in entity_types[str(s)] and range in entity_types[str(o)]: + correct_relation_direction += 1 + if domain in entity_types[str(o)] and range in entity_types[str(s)]: + incorrect_relation_direction += 1 + + # print("incorrect_relation_direction", incorrect_relation_direction) + # print("correct_relation_direction", correct_relation_direction) + + if incorrect_relation_direction + correct_relation_direction > 0: + normalized_score = incorrect_relation_direction / (incorrect_relation_direction + correct_relation_direction) + normalized_score = 1.0 - normalized_score + else: + normalized_score = 0.0 + + return MetricResult( + metric=self, + measurements=[ + Measurement(name="incorrect_relation_direction", value=incorrect_relation_direction, unit="number"), + Measurement(name="correct_relation_direction", value=correct_relation_direction, unit="number"), + Measurement(name="normalized_score", value=normalized_score, unit="ratio"), + ], + summary=f"Number of incorrect relation direction: {incorrect_relation_direction}", + # name=self.name, + # value=incorrect_relation_direction, + # normalized_score=normalized_score, + # details={ + # "incorrect_relation_direction": incorrect_relation_direction, + # "correct_relation_direction": correct_relation_direction, + # "possible_relations": predicate_defs, + # "size_ontology_graph": len(ontology_graph) + # }, + # aspect=self.aspect + ) + +class DatatypeMetric(Metric): + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute incorrect datatype score.""" + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + def is_object_type(o, type): + # print(o, type) + if isinstance(o, URIRef): + types = [str(t) for s, p, t in graph.triples((o, RDF.type, None))] + # if str(type) not in types: + # print(f"Incorrect relation range {types} of {o} for property {p} with range {types}") + return str(type) in types + elif isinstance(o, Literal): + datatype = o.datatype + if not datatype: + datatype = str(XSD.string) + return str(datatype) == str(type) + else: + return False + + # def is_object_type(o, type): + # # print(o, type) + # if isinstance(o, URIRef): + # types = [str(t) for s, p, t in graph.triples((o, RDF.type, None))] + # return type in types + # elif isinstance(o, Literal): + # return str(o.datatype) == type + # else: + # return False + + range_by_property = {} + for property in ontology.properties: + if property.range is not None: + range_by_property[property.uri] = property.range.uri + else: + print(f"Property {property.uri} has no range") + range_by_property[property.uri] = "TODO" + + incorrect_datatype = 0 + correct_datatype = 0 + + for s, p, o in graph.triples((None, None, None)): + if str(p) in range_by_property: + if isinstance(o, Literal): + if not str(p) in range_by_property or is_object_type(o, range_by_property[str(p)]): + correct_datatype += 1 + else: + incorrect_datatype += 1 + # print(f"Incorrect datatype {o.datatype} for property {p} with range {range_by_property[str(p)]}") + + normalized_score = 1.0 - (incorrect_datatype / (incorrect_datatype + correct_datatype)) if incorrect_datatype + correct_datatype > 0 else 0.0 + + return MetricResult( + metric=self, + measurements=[ + Measurement(name="incorrect_datatype", value=incorrect_datatype, unit="number"), + Measurement(name="correct_datatype", value=correct_datatype, unit="number"), + Measurement(name="normalized_score", value=normalized_score, unit="ratio"), + ], + summary=f"Number of incorrect datatype: {incorrect_datatype}", + # name=self.name, + # value=incorrect_datatype, + # normalized_score=1.0 - (incorrect_datatype / (incorrect_datatype + correct_datatype)) if incorrect_datatype + correct_datatype > 0 else 0.0, + # details={"incorrect_datatype": incorrect_datatype, "correct_datatype": correct_datatype}, + # aspect=self.aspect + ) + +class DatatypeFormatMetric(Metric): + def compute(self, kg: TripleGraph, config: ConsistencyViolationsConfig): + """Compute incorrect datatype format score.""" + + from kgpipe.evaluation.aspects.func.datatype_validator import validate_datatype + + raw_graph: Graph = kg.get_graph() + ontology_graph: Graph = get_ontology_graph(config.ontology_path, config.reference_kg) + ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + graph = enrich_type_information(raw_graph, ontology) + + def is_object_type(o, type): + # print(o, type) + if isinstance(o, URIRef): + types = [str(t) for s, p, t in graph.triples((o, RDF.type, None))] + return type in types + elif isinstance(o, Literal): + return str(o.datatype) == type + else: + return False + + range_by_property = {} + for property in ontology.properties: + if property.range is not None: + range_by_property[property.uri] = property.range.uri + else: + print(f"Property {property.uri} has no range") + range_by_property[property.uri] = "TODO" + + incorrect_datatype = 0 + correct_datatype = 0 + + for s, p, o in graph.triples((None, None, None)): + if str(p) in range_by_property: + if isinstance(o, Literal): + if str(p) in range_by_property: + if validate_datatype(str(o), range_by_property[str(p)]): + # print(f"Correct datatype {o.datatype} for property {p} and value {o} with range {range_by_property[str(p)]}") + correct_datatype += 1 + else: + # print(f"Incorrect datatype {p} \'{o}\' {range_by_property[str(p)]}") + incorrect_datatype += 1 + else: + print(f"Property {p} has no range") + # if not str(p) in range_by_property: + # print(f"Property {p} has no range") + # # or validate_datatype(str(o), range_by_property[str(p)]): + # # print(f"Correct datatype {o.datatype} for property {p} and value {o} with range {range_by_property[str(p)]}") + # correct_datatype += 1 + # else: + # incorrect_datatype += 1 + + if incorrect_datatype + correct_datatype > 0: + normalized_score = 1.0 - (incorrect_datatype / (incorrect_datatype + correct_datatype)) + else: + normalized_score = 0.0 + + return MetricResult( + metric=self, + measurements=[ + Measurement(name="incorrect_datatype", value=incorrect_datatype, unit="number"), + Measurement(name="correct_datatype", value=correct_datatype, unit="number"), + Measurement(name="normalized_score", value=normalized_score, unit="ratio"), + ], + summary=f"Number of incorrect datatype: {incorrect_datatype}", + # name=self.name, + # value=incorrect_datatype, + # normalized_score=normalized_score, + # details={"incorrect_datatype": incorrect_datatype, "correct_datatype": correct_datatype}, + # aspect=self.aspect + ) + + +# @Registry.metric() +# class OntologyClassCoverageMetric(Metric): +# """Check if the KG has correct class coverage.""" +# def __init__(self): +# super().__init__( +# name="ontology_class_coverage", +# description="Check if the KG has correct class coverage", +# aspect=EvaluationAspect.SEMANTIC +# ) + +# def compute(self, kg: KG, config: SemanticConfig, **kwargs) -> MetricResult: +# """Compute ontology class coverage score.""" + +# raw_graph: Graph = kg.get_graph() +# ontology_graph: Graph = kg.get_ontology_graph() +# ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) +# graph = enrich_type_information(raw_graph, ontology) + +# expected_classes = set([c.uri for c in ontology.classes if not c.uri.startswith(str(OWL))]) + +# found_classes = set(str(o) for s, p, o in graph.triples((None, RDF.type, None)) if not str(o).startswith(str(OWL))) + +# true_positive = len(expected_classes & found_classes) +# false_positive = len(found_classes - expected_classes) +# false_negative = len(expected_classes - found_classes) + +# precision = true_positive / (true_positive + false_positive) if true_positive + false_positive > 0 else 0.0 +# recall = true_positive / (true_positive + false_negative) if true_positive + false_negative > 0 else 0.0 +# f1_score = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0 + +# return MetricResult( +# name=self.name, +# value=true_positive, +# normalized_score=f1_score, +# details={"true_positive": true_positive, "false_positive": false_positive, "false_negative": false_negative}, +# aspect=self.aspect +# ) + +# @Registry.metric() +# class OntologyRelationCoverageMetric(Metric): +# """Check if the KG has correct relation coverage.""" +# def __init__(self): +# super().__init__( +# name="ontology_relation_coverage", +# description="Check if the KG has correct relation coverage", +# aspect=EvaluationAspect.SEMANTIC +# ) + +# def compute(self, kg: KG, config: SemanticConfig, **kwargs) -> MetricResult: +# """Compute ontology relation coverage score.""" + +# raw_graph: Graph = kg.get_graph() +# ontology_graph: Graph = kg.get_ontology_graph() +# ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) +# graph = enrich_type_information(raw_graph, ontology) + +# NOT_FILTER: List[str] = [str(OWL), str(RDF), str(RDFS)] + +# expected_relations = set([r.uri for r in ontology.properties]) +# expected_relations = set([r for r in expected_relations if not any(filter(lambda x: r.startswith(x), NOT_FILTER))]) + +# # print(expected_relations) + +# found_relations = set(str(p) for _, p, _ in graph.triples((None, None, None))) +# def filter_relation(r): +# return any(filter(lambda x: r.startswith(x), NOT_FILTER)) +# found_relations = set([r for r in found_relations if not filter_relation(r)]) + +# # print(found_relations) + +# true_positive = len(expected_relations & found_relations) +# false_positive = len(found_relations - expected_relations) +# false_negative = len(expected_relations - found_relations) + +# precision = true_positive / (true_positive + false_positive) if true_positive + false_positive > 0 else 0.0 +# recall = true_positive / (true_positive + false_negative) if true_positive + false_negative > 0 else 0.0 +# f1_score = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0 + +# return MetricResult( +# name=self.name, +# value=true_positive, +# normalized_score=f1_score, +# details={"true_positive": true_positive, "false_positive": false_positive, "false_negative": false_negative, "missing": (expected_relations - found_relations)}, +# aspect=self.aspect +# ) + +# @Registry.metric() +# class OntologyPropertyCoverageMetric(Metric): +# """Check if the KG has correct property coverage.""" +# def __init__(self): +# super().__init__( +# name="ontology_property_coverage", +# description="Check if the KG has correct property coverage", +# aspect=EvaluationAspect.SEMANTIC +# ) + +# def compute(self, kg: KG, config: SemanticConfig, **kwargs) -> MetricResult: +# """Compute ontology property coverage score.""" +# return MetricResult( +# name=self.name, +# value=0.0, +# normalized_score=1.0, +# details={"error": "Not implemented"}, +# aspect=self.aspect +# ) + +# @Registry.metric() +# class OntologyNamespaceCoverageMetric(Metric): +# """Check if the KG has correct namespace coverage.""" +# def __init__(self): +# super().__init__( +# name="ontology_namespace_coverage", +# description="Check if the KG has correct namespace coverage", +# aspect=EvaluationAspect.SEMANTIC +# ) + +# def compute(self, kg: KG, config: SemanticConfig, **kwargs) -> MetricResult: +# """Compute ontology namespace coverage score.""" + +# # graph = kg.get_graph() +# # ontology_graph = kg.get_ontology_graph() +# # if len(ontology_graph) == 0: +# # ontology_graph = graph + +# # ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + + +# return MetricResult( +# name=self.name, +# value=0.0, +# normalized_score=1.0, +# details={"error": "Not implemented"}, +# aspect=self.aspect +# ) + +# class OntologyClassCoverageMetric(): +# pass + +# class OntologyRelationCoverageMetric(): +# pass + +# class OntologyNamespaceCoverageMetric(): +# pass + +# Cardinality Metric + # """Compute incorrect relation cardinality score.""" + + # raw_graph: Graph = kg.get_graph() + # ontology_graph: Graph = kg.get_ontology_graph() + # ontology = OntologyUtil.load_ontology_from_graph(ontology_graph) + # graph = enrich_type_information(raw_graph, ontology) + # if len(ontology_graph) == 0: + # ontology_graph = graph + + # cardinality_by_property = {} + # property_cardinalities: Dict[str, Dict[str, int]] = defaultdict(lambda: defaultdict(int)) + # properties_in_graph = set() + + # for s, p, o in graph.triples((None, None, None)): + # properties_in_graph.add(str(p)) + + # for property in properties_in_graph: + # cardinality_by_property[property] = get_property_cardinality(ontology_graph, property) + + # # print(cardinality_by_property) + # # print(property_cardinalities) + + # for s, p, o in graph.triples((None, None, None)): + # if str(p) in cardinality_by_property: + # if str(s) in property_cardinalities[str(p)]: + # property_cardinalities[str(p)][str(s)] += 1 + # else: + # property_cardinalities[str(p)][str(s)] = 1 + + # incorrect_cardinality = 0 + # correct_cardinality = 0 + + # for property, cardinality in property_cardinalities.items(): + # min, max = cardinality_by_property[property] + # for subject, count in cardinality.items(): + # if count > max: + # incorrect_cardinality += 1 + # elif count < min: + # incorrect_cardinality += 1 + # else: + # correct_cardinality += 1 + + # return MetricResult( + # name=self.name, + # value=incorrect_cardinality, + # normalized_score=1.0 - (incorrect_cardinality / (incorrect_cardinality + correct_cardinality)) if incorrect_cardinality + correct_cardinality > 0 else 0.0, + # details={"incorrect_cardinality": incorrect_cardinality, "correct_cardinality": correct_cardinality}, + # aspect=self.aspect + # ) \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/duplicates.py b/src/kgpipe_eval/metrics/duplicates.py new file mode 100644 index 0000000..869f3ff --- /dev/null +++ b/src/kgpipe_eval/metrics/duplicates.py @@ -0,0 +1,71 @@ +from kgpipe_eval.utils.alignment_utils import EntityAlignment, align_entities_by_label_embedding, EntityAlignmentConfig +from kgpipe_eval.api import Metric, MetricResult, Measurement +from kgpipe_eval.utils.kg_utils import Term, TripleGraph + +from pydantic import BaseModel, ConfigDict +from kgpipe.common import KG +import numpy as np + +DEBUG = False + +class DuplicateConfig(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + entity_alignment_config: EntityAlignmentConfig + +class DuplicateMeasures(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + duplicates: int + total_references: int + already_matched_references: set[Term] + +def eval_duplicates(kg: TripleGraph, config: DuplicateConfig): + """ + checks expected & integrated source entity overlap using label embeddings + """ + + alignments : list[EntityAlignment] = align_entities_by_label_embedding(kg, config.entity_alignment_config) + + duplicates = set() + already_matched_references = set() + + for alignment in alignments: + if alignment.target in already_matched_references: + duplicates.add(alignment.target) + already_matched_references.add(alignment.target) + + if DEBUG: + print("Duplicates:") + for alignment in alignments: + if alignment.target in duplicates: + print(alignment.target, alignment.source, alignment.score) + + return duplicates + +class DuplicateMetric(Metric): + def compute(self, kg: TripleGraph, config: DuplicateConfig): + duplicates = eval_duplicates(kg, config) + entity_count = len(list(kg.entities())) + return MetricResult( + metric=self, + measurements=[ + Measurement(name="duplicates", value=len(duplicates), unit="number"), + Measurement(name="entity_count", value=entity_count, unit="number"), + Measurement(name="duplicates_ratio", value=len(duplicates) / entity_count, unit="percentage"), + ], + summary=f"Duplicates in the KG" + ) + +# find all duplicate entities in the KG +# using +# - reference KG +# - fuzzy matching +# - exact matching +# - semantic matching +# - clustering +# - ... +# return a list of duplicate entities +# return a list of duplicate entities with the matching score +# return a list of duplicate entities with the matching score and the matching type +# return a list of duplicate entities with the matching score and the matching type and the matching details +# return a list of duplicate entities with the matching score and the matching type and the matching details and the matching details +# return a list of duplicate entities with the matching score and the matching type and the matching details and the matching details and the matching details \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/entity_alignment.py b/src/kgpipe_eval/metrics/entity_alignment.py new file mode 100644 index 0000000..d6237cc --- /dev/null +++ b/src/kgpipe_eval/metrics/entity_alignment.py @@ -0,0 +1,163 @@ +from kgpipe.common import KG + +from kgpipe_eval.api import Metric, Measurement, MetricResult + +from kgpipe_eval.utils.measurement_utils import BCMeasurement +from kgpipe_eval.utils.alignment_utils import align_entities_by_label_embedding, EntityAlignmentConfig, load_entity_uri_label_type_pairs, get_entity_uri_label_typeset_pairs, get_entity_uri_label_type_pairs + +# Core Interface + +def eval_entity_alignment(kg: KG, config: EntityAlignmentConfig): + if config.method == "label_embedding": + alignments = eval_entity_alignment_by_label_embedding(kg, config) + elif config.method == "label_alias_embedding": + alignments = eval_entity_alignment_by_label_alias_embedding(kg, config) + elif config.method == "label_embedding_and_type": + alignments = eval_entity_alignment_by_label_embedding_and_type(kg, config) + elif config.method == "label_embedding_and_intersecting_type": + alignments = eval_entity_alignment_by_label_embedding_and_intersecting_type(kg, config) + else: + raise ValueError(f"Invalid method: {config.method}") + return alignments + +# Specific Implementations + +def eval_entity_alignment_by_label_embedding_and_type(kg: KG, config: EntityAlignmentConfig): + alignments = align_entities_by_label_embedding(kg, config) + + ref_entity_uri_label_type_pairs = load_entity_uri_label_type_pairs(config) + gen_entity_uri_label_type_pairs = list(get_entity_uri_label_type_pairs(kg, config.ignored_entities)) + + # print ref and gen pairs for testing + # print("--------------------------------") + # print("ref_entity_uri_label_type_pairs") + # for pair in ref_entity_uri_label_type_pairs: + # print(pair) + # print("--------------------------------") + # print("gen_entity_uri_label_type_pairs") + # for pair in gen_entity_uri_label_type_pairs: + # print(pair) + # print("--------------------------------") + # print("alignments") + # for alignment in alignments: + # print(alignment) + + ref_types = {pair.uri: pair.type for pair in ref_entity_uri_label_type_pairs if pair.type is not None} + # TODO gen_types can be multiple types, we need to handle this + gen_types = {pair.uri: pair.type for pair in gen_entity_uri_label_type_pairs if pair.type is not None} + + filtered_alignments = [] + for alignment in alignments: + if alignment.target in ref_types and alignment.source in gen_types: + if ref_types[alignment.target] == gen_types[alignment.source]: + filtered_alignments.append(alignment) + + ref_uris = set(pair.uri for pair in ref_entity_uri_label_type_pairs) + gen_uris = set(pair.uri for pair in gen_entity_uri_label_type_pairs) + aligned_gen_uris = set(alignment.target for alignment in filtered_alignments) + aligned_ref_uris = set(alignment.source for alignment in filtered_alignments) + + tp = len(ref_uris & aligned_gen_uris) # generated entities that are also in the reference + fp = len(gen_uris - aligned_ref_uris) # generated entities that are not in the reference + tn = 0 + fn = len(ref_uris - aligned_gen_uris) # missing generated entities that are in the reference + + return BCMeasurement( + tp=tp, + fp=fp, + tn=tn, + fn=fn + ) + +def eval_entity_alignment_by_label_embedding_and_intersecting_type(kg: KG, config: EntityAlignmentConfig): + # Debugging: print some information about the config + print("--------------------------------") + print("ignored_entities") + print(len(config.ignored_entities)) + print("--------------------------------") + + alignments = align_entities_by_label_embedding(kg, config) + + ref_entity_uri_label_type_pairs = load_entity_uri_label_type_pairs(config) + gen_entity_uri_label_type_pairs = list(get_entity_uri_label_typeset_pairs(kg, config.ignored_entities)) + + ref_types = {pair.uri: set([pair.type]) for pair in ref_entity_uri_label_type_pairs if pair.type is not None} + # TODO gen_types can be multiple types, we need to handle this + gen_types = {pair.uri: pair.type_set for pair in gen_entity_uri_label_type_pairs if pair.type_set is not None} + + filtered_alignments = [] + for alignment in alignments: + if alignment.target in ref_types and alignment.source in gen_types: + # Debugging: print the intersection of the reference and generated types + # print("---") + # print("alignment.target", alignment.target) + # print("alignment.source", alignment.source) + # print("ref_types[alignment.target]", ref_types[alignment.target]) + # print("gen_types[alignment.source]", gen_types[alignment.source]) + # print("intersection", ref_types[alignment.target] & gen_types[alignment.source]) + # print("---") + if len(ref_types[alignment.target] & gen_types[alignment.source]) > 0: + filtered_alignments.append(alignment) + + ref_uris = set(pair.uri for pair in ref_entity_uri_label_type_pairs) + gen_uris = set(pair.uri for pair in gen_entity_uri_label_type_pairs) + aligned_gen_uris = set(alignment.target for alignment in filtered_alignments) + aligned_ref_uris = set(alignment.source for alignment in filtered_alignments) + + tp = len(ref_uris & aligned_gen_uris) # generated entities that are also in the reference + fp = len(gen_uris - aligned_ref_uris) # generated entities that are not in the reference + tn = 0 + fn = len(ref_uris - aligned_gen_uris) # missing generated entities that are in the reference + + return BCMeasurement( + tp=tp, + fp=fp, + tn=tn, + fn=fn + ) + + +def eval_entity_alignment_by_label_embedding(kg: KG, config: EntityAlignmentConfig): + alignments = align_entities_by_label_embedding(kg, config) + + ref_entity_uri_label_type_pairs = load_entity_uri_label_type_pairs(config) + gen_entity_uri_label_type_pairs = list(get_entity_uri_label_type_pairs(kg)) + + ref_uris = set(pair.uri for pair in ref_entity_uri_label_type_pairs) + gen_uris = set(pair.uri for pair in gen_entity_uri_label_type_pairs) + aligned_gen_uris = set(alignment.target for alignment in alignments) + aligned_ref_uris = set(alignment.source for alignment in alignments) + + tp = len(ref_uris & aligned_gen_uris) # generated entities that are also in the reference + fp = len(gen_uris - aligned_ref_uris) # generated entities that are not in the reference + tn = 0 + fn = len(ref_uris - aligned_gen_uris) # missing generated entities that are in the reference + + return BCMeasurement( + tp=tp, + fp=fp, + tn=tn, + fn=fn + ) + +def eval_entity_alignment_by_label_alias_embedding(kg: KG, config: EntityAlignmentConfig): + raise NotImplementedError("Label alias embedding alignment is not implemented yet") + +# Metric Implementation + +class EntityAlignmentMetric(Metric): + def compute(self, kg: KG, config: EntityAlignmentConfig): + alignments: BCMeasurement = eval_entity_alignment(kg, config) + return MetricResult( + metric=self, + measurements=[ + Measurement(name="tp", value=alignments.tp, unit="number"), + Measurement(name="fp", value=alignments.fp, unit="number"), + Measurement(name="tn", value=alignments.tn, unit="number"), + Measurement(name="fn", value=alignments.fn, unit="number"), + Measurement(name="precision", value=alignments.precision(), unit="percentage"), + Measurement(name="recall", value=alignments.recall(), unit="percentage"), + Measurement(name="f1_score", value=alignments.f1_score(), unit="percentage"), + ], + summary=f"Entity alignment by {config.method}" + ) \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/llm_annotation.py b/src/kgpipe_eval/metrics/llm_annotation.py new file mode 100644 index 0000000..49b099a --- /dev/null +++ b/src/kgpipe_eval/metrics/llm_annotation.py @@ -0,0 +1,4 @@ +from kgpipe_eval.api import Metric + +class LLM_KgAccuracyMetric(Metric): + pass \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/statistics.py b/src/kgpipe_eval/metrics/statistics.py new file mode 100644 index 0000000..191776d --- /dev/null +++ b/src/kgpipe_eval/metrics/statistics.py @@ -0,0 +1,76 @@ +from kgpipe_eval.utils.kg_utils import TripleGraph +from kgpipe_eval.api import Metric, MetricResult, Measurement +from functools import lru_cache + +from pydantic import BaseModel +from typing import Mapping +from collections import defaultdict + +from rdflib import RDF, RDFS +from rdflib.term import URIRef, Literal + +class CountMeasures(BaseModel): + entity_count: int + triple_count: int + property_count: int + class_count: int + property_occurrence: Mapping[str, int] + class_occurrence: Mapping[str, int] + +# @lru_cache(maxsize=1) +def count_measures(kg: TripleGraph) -> CountMeasures: + + triple_count = 0 + subject_count = 0 # TODO misses shallow object entities + + class_occurrence = defaultdict(int) + property_occurrence = defaultdict(int) + + for _ in kg.subjects(): + subject_count += 1 + + for s, p, o in kg.triples((None, None, None)): + triple_count += 1 + if p == RDF.type: + class_occurrence[str(o)] += 1 + property_occurrence[str(p)] += 1 + + return CountMeasures( + entity_count=subject_count, + property_count=len(property_occurrence.keys()), + triple_count=triple_count, + class_count=len(class_occurrence.keys()), + class_occurrence=class_occurrence, + property_occurrence=property_occurrence, + ) + +class CountMetric(Metric): + key = "CountMetric" + description = "Counts triples/classes/properties (basic statistics)." + + def compute(self, kg: TripleGraph) -> MetricResult: + counts = count_measures(kg) + return MetricResult( + metric=self, + measurements=[ + Measurement(name="entity_count", value=counts.entity_count, unit="number"), + Measurement(name="triple_count", value=counts.triple_count, unit="number"), + Measurement(name="property_count", value=counts.property_count, unit="number"), + Measurement(name="class_count", value=counts.class_count, unit="number"), + Measurement(name="property_occurrence", value=counts.property_occurrence, unit="dictionary"), + Measurement(name="class_occurrence", value=counts.class_occurrence, unit="dictionary"), + ], + summary=f"Measures of entities, triples, properties, classes, property occurrences, and class occurrences" + ) + +class DegreeMetric(Metric): + # def compute(self, kg: TripleGraph) -> MetricResult: + # degrees = degree_measures(kg) + # return MetricResult( + # metric=self, + # measurements=[ + # Measurement(name="degree", value=degrees.degree, unit="number"), + # ], + # summary=f"Measures of degrees" + # ) + pass \ No newline at end of file diff --git a/src/kgpipe_eval/metrics/triple_alignment.py b/src/kgpipe_eval/metrics/triple_alignment.py new file mode 100644 index 0000000..5114ddb --- /dev/null +++ b/src/kgpipe_eval/metrics/triple_alignment.py @@ -0,0 +1,74 @@ +from pydantic import BaseModel, ConfigDict +from typing import Literal + +from kgpipe.common import KG +from kgpipe_eval.metrics.entity_alignment import EntityAlignmentConfig +from kgpipe_eval.utils.kg_utils import KgLike, KgManager, TripleGraph +from kgpipe_eval.utils.alignment_utils import align_triples_by_value_embedding +from kgpipe_eval.utils.measurement_utils import BCMeasurement +from kgpipe_eval.api import Measurement, Metric, MetricResult + +# measures precision, recall, f1 score, etc. + +class TripleAlignmentConfig(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + reference_kg: KgLike + method: Literal["value_embedding", "exact"] = "value_embedding" + entity_alignment_config: EntityAlignmentConfig + value_sim_threshold: float = 0.5 + cache_literal_embeddings: bool = False + cache_ref_literal_embeddings: bool = True + +def eval_triple_alignment(tg: TripleGraph, config: TripleAlignmentConfig): + if config.method == "value_embedding": + alignments = align_triples_by_value_embedding(tg, config) + elif config.method == "exact": + pass + # alignments = align_triples_by_exact_match(tg, config) + else: + raise ValueError(f"Invalid method: {config.method}") + + print("Triple alignments: ", len(alignments)) + + ref_tg = KgManager.load_kg(config.reference_kg) + ref_triples = set(ref_tg.triples((None, None, None))) + gen_triples = set(tg.triples((None, None, None))) + + aligned_ref_triples = set(a.target for a in alignments) + aligned_gen_triples = set(a.source for a in alignments) + + tp = len(aligned_ref_triples) # aligned reference triples + fp = len(gen_triples - aligned_gen_triples) # generated triples not aligned to any reference triple + tn = 0 + fn = len(ref_triples - aligned_ref_triples) # reference triples missing in generation + + return BCMeasurement(tp=tp, fp=fp, tn=tn, fn=fn) + +# def eval_triple_alignment_by_label_embedding(method: Literal["exact", "fuzzy", "semantic"] = "exact"): +# pass + + +# def eval_triple_alignment_by_label_embedding_soft_literals(method: Literal["exact", "fuzzy", "semantic"] = "exact"): +# pass + +class TripleAlignmentMetric(Metric): + + def compute(self, kg: KG, config: TripleAlignmentConfig): + m: BCMeasurement = eval_triple_alignment(kg, config) + return MetricResult( + metric=self, + measurements=[ + Measurement(name="tp", value=m.tp, unit="number"), + Measurement(name="fp", value=m.fp, unit="number"), + Measurement(name="tn", value=m.tn, unit="number"), + Measurement(name="fn", value=m.fn, unit="number"), + Measurement(name="precision", value=m.precision(), unit="percentage"), + Measurement(name="recall", value=m.recall(), unit="percentage"), + Measurement(name="f1_score", value=m.f1_score(), unit="percentage"), + ], + summary=f"Triple alignment by {config.method}", + ) + + +# Backward-compatibility alias (imported by `kgpipe_eval.metrics.__init__`). +# TripleAlignmentMetric = TripleAlignmentMetric \ No newline at end of file diff --git a/src/kgpipe_eval/test/__init__.py b/src/kgpipe_eval/test/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/kgpipe_eval/test/examples.py b/src/kgpipe_eval/test/examples.py new file mode 100644 index 0000000..06ad14d --- /dev/null +++ b/src/kgpipe_eval/test/examples.py @@ -0,0 +1,257 @@ +SEED_TURTLE_TRIPLES = """ +@prefix : . +@prefix o: . +@prefix rdf: . +@prefix rdfs: . +@prefix xsd: . + +:store1 rdf:type o:BookStore ; + rdfs:label "Example Books (Downtown)"@en ; + :countryCode "US" ; + :hasInventory :itemA, :itemB, :itemC . + +:publisherHC rdf:type o:Publisher ; + rdfs:label "HarperCollins" ; + :countryCode "GB" . +""" +TEST_TURTLE_TRIPLES = """ +@prefix : . +@prefix o: . +@prefix rdf: . +@prefix rdfs: . +@prefix xsd: . + +# Entities designed to exercise alignment corner-cases: +# - multiple entities per type (Book/Author/Publisher/Store) +# - missing / extra attributes across graphs +# - literal variations (lang tags, datatypes, different lexical forms) +# - ambiguous labels (near-duplicates, casing differences) +# - multi-valued properties + +:store1 rdf:type o:BookStore ; + rdfs:label "Example Books (Downtown)"@en ; + :countryCode "US" ; + :hasInventory :itemA, :itemB, :itemC . + +:publisherHC rdf:type o:Publisher ; + rdfs:label "HarperCollins" ; + :countryCode "GB" . + +# different wrong type +:publisherPenguin rdf:type o:Author ; + rdfs:label "Penguin Books"@en ; + :countryCode "GB" . + +:authorTolkien rdf:type o:Author ; + rdfs:label "J. R. R. Tolkien" ; + :born "1892-01-03"^^xsd:date ; + :died "1973-09-02"^^xsd:date ; + :sameAs . + +:authorRowling rdf:type o:Author ; + rdfs:label "J.K. Rowling" ; + :born "1965-07-31"^^xsd:date . + +:itemA rdf:type o:Book ; + rdfs:label "The Hobbit"@en ; + :bookTitle "The Hobbit, or There and Back Again"@en ; + :bookAuthor :authorTolkien ; + :publisher :publisherHC ; + :isbn13 "9780261102217" ; + :pageCount "310"^^xsd:integer ; + :tags "fantasy", "classic" ; + :inSeries :seriesMiddleEarth . + +:itemB rdf:type o:Book ; + rdfs:label "The Hobbit (Illustrated)"@en ; + :bookTitle "The Hobbit"@en ; + :bookAuthor :authorTolkien ; + :publisher :publisherHC ; + :isbn13 "978-0-261-10221-7" ; # lexical variation + :pageCount 320 ; # integer without explicit datatype + :publicationYear "1997"^^xsd:gYear . + +:itemC rdf:type o:Book ; + rdfs:label "Harry Potter and the Philosopher's Stone"@en ; + :bookTitle "Harry Potter and the Philosopher's Stone"@en ; + :bookAuthor :authorRowling ; + :publisher :publisherPenguin ; + :isbn13 "9780747532699" ; + :pageCount "223"^^xsd:integer . + +# Same label, different type (common edge case for label-only alignment) +:hobbit rdf:type o:Film ; + rdfs:label "The Hobbit"@en ; + :releaseYear "2012"^^xsd:gYear . + +# Missing rdf:type but has label (edge case for type-aware matching) +:unknownEntity rdfs:label "HarperCollins" . + +:seriesMiddleEarth rdf:type o:Series ; + rdfs:label "Middle-earth Legendarium"@en . + +# false positive unexpected entity +:unexpectedEntity rdf:type o:Book ; + rdfs:label "Unexpected Entity"@en . +""" + +GENERATED_TURTLE_TRIPLES = """ +@prefix : . +@prefix o: . +@prefix rdf: . +@prefix rdfs: . +@prefix xsd: . + +# Entities designed to exercise alignment corner-cases: +# - multiple entities per type (Book/Author/Publisher/Store) +# - missing / extra attributes across graphs +# - literal variations (lang tags, datatypes, different lexical forms) +# - ambiguous labels (near-duplicates, casing differences) +# - multi-valued properties + +:store1 rdf:type o:BookStore ; + rdfs:label "Example Books (Downtown)"@en ; + :countryCode "US" ; + :hasInventory :itemA, :itemB, :itemC . + +:publisherHC rdf:type o:Publisher ; + rdfs:label "HarperCollins" ; + :countryCode "GB" . + +# different wrong type +:publisherPenguin rdf:type o:Author ; + rdfs:label "Penguin Books"@en ; + :countryCode "GB" . + +:authorTolkien rdf:type o:Author ; + rdfs:label "J. R. R. Tolkien" ; + :born "1892-01-03"^^xsd:date ; + :died "1973-09-02"^^xsd:date ; + :sameAs . + +:authorRowling rdf:type o:Author ; + rdfs:label "J.K. Rowling" ; + :born "1965-07-31"^^xsd:date . + +:itemA rdf:type o:Book ; + rdfs:label "The Hobbit"@en ; + :bookTitle "The Hobbit, or There and Back Again"@en ; + :bookAuthor :authorTolkien ; + :publisher :publisherHC ; + :isbn13 "9780261102217" ; + :pageCount "310"^^xsd:integer ; + :tags "fantasy", "classic" ; + :inSeries :seriesMiddleEarth . + +:itemB rdf:type o:Book ; + rdfs:label "The Hobbit (Illustrated)"@en ; + :bookTitle "The Hobbit"@en ; + :bookAuthor :authorTolkien ; + :publisher :publisherHC ; + :isbn13 "978-0-261-10221-7" ; # lexical variation + :pageCount 320 ; # integer without explicit datatype + :publicationYear "1997"^^xsd:gYear . + +:itemC rdf:type o:Book ; + rdfs:label "Harry Potter and the Philosopher's Stone"@en ; + :bookTitle "Harry Potter and the Philosopher's Stone"@en ; + :bookAuthor :authorRowling ; + :publisher :publisherPenguin ; + :isbn13 "9780747532699" ; + :pageCount "223"^^xsd:integer . + +# Same label, different type (common edge case for label-only alignment) +:hobbit rdf:type o:Film ; + rdfs:label "The Hobbit"@en ; + :releaseYear "2012"^^xsd:gYear . + +# Missing rdf:type but has label (edge case for type-aware matching) +:unknownEntity rdfs:label "HarperCollins" . + +:seriesMiddleEarth rdf:type o:Series ; + rdfs:label "Middle-earth Legendarium"@en . + +# false positive unexpected entity +:unexpectedEntity rdf:type o:Book ; + rdfs:label "Unexpected Entity"@en . +""" + +REFERENCE_TURTLE_TRIPLES = """ +@prefix : . +@prefix o: . +@prefix rdf: . +@prefix rdfs: . +@prefix xsd: . + +# Reference graph intentionally differs from TEST_TURTLE_TRIPLES: +# - different labels / casing / punctuation +# - extra / missing properties +# - alternate modeling (blank nodes, different predicates) +# - near-duplicate entities to test ambiguity + +:storeMain rdf:type o:BookStore ; + rdfs:label "Example Books - Downtown"@en ; + :countryCode "USA" ; # lexical variation + :hasInventory :refItemA, :refItemC . + +:publisherHC rdf:type o:Publisher ; + rdfs:label "Harper Collins"@en ; # spacing difference + :countryCode "UK" . + +:publisherPenguin rdf:type o:Publisher ; + rdfs:label "Penguin"@en ; + :countryCode "GB" . + +:authorTolkien rdf:type o:Author ; + rdfs:label "J.R.R. Tolkien" ; # punctuation difference + :born "1892-01-03"^^xsd:date ; + :sameAs ; + :nameParts [ :given "John" ; :middle "Ronald Reuel" ; :family "Tolkien" ] . + +:authorRowling rdf:type o:Author ; + rdfs:label "Joanne Rowling"@en ; # alias-ish label + :born "1965-07-31"^^xsd:date . + +:refItemA rdf:type o:Book ; + rdfs:label "The Hobbit"@en ; + :title "The Hobbit, or There and Back Again"@en ; # different predicate + :bookAuthor :authorTolkien ; + :publisher :publisherHC ; + :isbn13 "9780261102217" ; + :pageCount "310"^^xsd:integer ; + :tags "classic" . # missing one tag compared to test + +# Same-work but modeled as a separate edition entity +:refItemA_Edition1 rdf:type o:Edition ; + rdfs:label "The Hobbit (1st edition)"@en ; + :about :refItemA ; + :publicationYear "1937"^^xsd:gYear . + +:refItemC rdf:type o:Book ; + rdfs:label "Harry Potter and the Philosopher’s Stone"@en ; # curly apostrophe + :bookTitle "Harry Potter and the Philosopher's Stone"@en ; + :bookAuthor :authorRowling ; + :publisher :publisherPenguin ; + :isbn13 "9780747532699" ; + :pageCount "223"^^xsd:integer ; + :tags "fantasy"@en . + +# Near-duplicate label (to trigger ambiguity in label similarity) +:refItemC_US rdf:type o:Book ; + rdfs:label "Harry Potter and the Sorcerer's Stone"@en ; + :sameAs :refItemC . +""" + +VERIFIED_ENTITIES = """ +dataset,entity_id,entity_label,entity_type +test,http://example.org/reference_bookstore/itemA,The Hobbit,o:Book +test,http://example.org/reference_bookstore/itemB,The Hobbit (Illustrated),o:Book +test,http://example.org/reference_bookstore/itemC,Harry Potter and the Philosopher's Stone,o:Book +test,http://example.org/reference_bookstore/authorTolkien,J. R. R. Tolkien,o:Author +test,http://example.org/reference_bookstore/authorRowling,J.K. Rowling,o:Author +test,http://example.org/reference_bookstore/publisherHC,HarperCollins,o:Publisher +test,http://example.org/reference_bookstore/publisherPenguin,Penguin Books,o:Publisher +test,http://example.org/reference_bookstore/store1,Example Books (Downtown),o:BookStore +test,http://example.org/reference_bookstore/seriesMiddleEarth,Middle-earth Legendarium,o:Series +test,http://example.org/reference_bookstore/missingEntity,Gone with the Wind,o:Book +""" \ No newline at end of file diff --git a/src/kgpipe_eval/test/test_alignment_eval.py b/src/kgpipe_eval/test/test_alignment_eval.py new file mode 100644 index 0000000..9d10a93 --- /dev/null +++ b/src/kgpipe_eval/test/test_alignment_eval.py @@ -0,0 +1,62 @@ +import json + +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig +from kgpipe_eval.metrics.entity_alignment import EntityAlignmentMetric +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentMetric, TripleAlignmentConfig +from kgpipe_eval.test.utils import get_test_kg, get_verified_entities_path, render_metric_result, get_reference_kg, get_generated_kg +from kgpipe_eval.utils.kg_utils import KgManager +from kgpipe_eval.api import MetricResult + + +def test_align_entities_by_label_embedding(): + config = EntityAlignmentConfig( + method="label_embedding", + reference_kg=None, + verified_entities_path=get_verified_entities_path(), + verified_entities_delimiter=",", + entity_sim_threshold=0.95 + ) + tg = KgManager.load_kg(get_test_kg()) + metric_result : MetricResult = EntityAlignmentMetric().compute(tg, config) + print(render_metric_result(metric_result)) + +def test_align_entities_by_label_embedding_and_type(): + config = EntityAlignmentConfig( + method="label_embedding_and_type", + reference_kg=None, + verified_entities_path=get_verified_entities_path(), + verified_entities_delimiter=",", + entity_sim_threshold=0.95 + ) + tg = KgManager.load_kg(get_test_kg()) + metric_result : MetricResult = EntityAlignmentMetric().compute(tg, config) + print(render_metric_result(metric_result)) + +def test_align_entities_by_label_embedding_and_type_ref_kg(): + config = EntityAlignmentConfig( + method="label_embedding", + reference_kg=get_reference_kg(), + verified_entities_path=None, + verified_entities_delimiter="\t", + entity_sim_threshold=0.95 + ) + tg = KgManager.load_kg(get_test_kg()) + metric_result : MetricResult = EntityAlignmentMetric().compute(tg, config) + print(render_metric_result(metric_result)) + +def test_align_triples_by_value_embedding(): + config = TripleAlignmentConfig( + reference_kg=get_reference_kg(), + entity_alignment_config=EntityAlignmentConfig( + method="label_embedding", + reference_kg=get_reference_kg(), + verified_entities_path=None, + verified_entities_delimiter="\t", + entity_sim_threshold=0.95 + ), + value_sim_threshold=0.5 + ) + tg = KgManager.load_kg(get_generated_kg()) + metric_result : MetricResult = TripleAlignmentMetric().compute(tg, config) + print(render_metric_result(metric_result)) + diff --git a/src/kgpipe_eval/test/test_config_manager.py b/src/kgpipe_eval/test/test_config_manager.py new file mode 100644 index 0000000..37af30d --- /dev/null +++ b/src/kgpipe_eval/test/test_config_manager.py @@ -0,0 +1,102 @@ +from __future__ import annotations + +from pathlib import Path + +from kgpipe_eval.config.manager import load_metric_configs, generate_default_config_dict +from kgpipe_eval.metrics.duplicates import DuplicateConfig +from kgpipe_eval.metrics.triple_alignment import TripleAlignmentConfig +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig + + +def test_load_metric_configs_resolves_entity_alignment_refs(tmp_path: Path) -> None: + cfg = tmp_path / "eval.yaml" + cfg.write_text( + """ +entity_alignment_configs: + default: + method: label_embedding + verified_entities_path: tmp_test_data/verified_entities.csv + verified_entities_delimiter: "," + entity_sim_threshold: 0.95 + +metrics: + entity_align: + entity_alignment_config_ref: default + + duplicates: + entity_alignment_config_ref: default + + triple_alignment: + reference_kg_path: tmp_test_data/reference.nt + entity_alignment_config_ref: default + value_sim_threshold: 0.6 +""".lstrip(), + encoding="utf-8", + ) + + loaded = load_metric_configs(cfg) + assert "entity_align" in loaded + assert "duplicates" in loaded + assert "triple_alignment" in loaded + + assert isinstance(loaded["entity_align"], EntityAlignmentConfig) + assert isinstance(loaded["duplicates"], DuplicateConfig) + assert isinstance(loaded["triple_alignment"], TripleAlignmentConfig) + + assert loaded["entity_align"].verified_entities_delimiter == "," + assert loaded["duplicates"].entity_alignment_config.verified_entities_delimiter == "," + assert loaded["triple_alignment"].entity_alignment_config.verified_entities_delimiter == "," + + # reference_kg is constructed from reference_kg_path + assert loaded["triple_alignment"].reference_kg.path.as_posix().endswith("tmp_test_data/reference.nt") + + +def test_generate_default_config_dict_has_all_sections() -> None: + cfg = generate_default_config_dict() + assert "entity_alignment_configs" in cfg + assert "metrics" in cfg + assert "default" in cfg["entity_alignment_configs"] + assert "verified_entities_path" in cfg["entity_alignment_configs"]["default"] + + metrics = cfg["metrics"] + assert "entity_align" in metrics + assert "duplicates" in metrics + assert "triple_alignment" in metrics + assert "consistency_violations" in metrics + + +def test_load_metric_configs_interpolates_vars_and_resolves_paths(tmp_path: Path) -> None: + # mirror the style used in experiments/examples/scripts/run_eval.yaml + cfg = tmp_path / "run_eval.yaml" + (tmp_path / "test.ttl").write_text( + """ +@prefix : . +@prefix rdfs: . +:a rdfs:label "A" . +""".lstrip(), + encoding="utf-8", + ) + cfg.write_text( + """ +reference_kg: test.ttl + +entity_alignment_configs: + default: + method: label_embedding + reference_kg: $reference_kg + entity_sim_threshold: 0.95 + +metrics: + duplicates: + entity_alignment_config_ref: default +""".lstrip(), + encoding="utf-8", + ) + + loaded = load_metric_configs(cfg) + assert isinstance(loaded["duplicates"], DuplicateConfig) + # reference_kg should be a KG whose path resolves relative to cfg location + kg = loaded["duplicates"].entity_alignment_config.reference_kg + assert kg is not None + assert kg.path == (tmp_path / "test.ttl").resolve() + diff --git a/src/kgpipe_eval/test/test_consistency_eval.py b/src/kgpipe_eval/test/test_consistency_eval.py new file mode 100644 index 0000000..e69de29 diff --git a/src/kgpipe_eval/test/test_duplicates_eval.py b/src/kgpipe_eval/test/test_duplicates_eval.py new file mode 100644 index 0000000..1c21d89 --- /dev/null +++ b/src/kgpipe_eval/test/test_duplicates_eval.py @@ -0,0 +1,20 @@ +from kgpipe_eval.metrics.duplicates import DuplicateConfig +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig +from kgpipe_eval.test.utils import get_verified_entities_path +from kgpipe_eval.api import MetricResult +from kgpipe_eval.metrics.duplicates import DuplicateMetric +from kgpipe_eval.test.utils import get_test_kg, render_metric_result +from kgpipe_eval.utils.kg_utils import KgManager + +def test_duplicates_eval(): + config = DuplicateConfig( + entity_alignment_config=EntityAlignmentConfig( + method="label_embedding", + reference_kg=None, + verified_entities_path=get_verified_entities_path(), + verified_entities_delimiter=",", + entity_sim_threshold=0.95 + ) + ) + metric_result : MetricResult = DuplicateMetric().compute(KgManager.load_kg(get_test_kg()), config) + print(render_metric_result(metric_result)) diff --git a/src/kgpipe_eval/test/test_evaluator.py b/src/kgpipe_eval/test/test_evaluator.py new file mode 100644 index 0000000..2e2c03b --- /dev/null +++ b/src/kgpipe_eval/test/test_evaluator.py @@ -0,0 +1,32 @@ +from __future__ import annotations + +from kgpipe_eval.evaluator import Evaluator +from kgpipe_eval.metrics.statistics import CountMetric +from kgpipe_eval.metrics.duplicates import DuplicateMetric, DuplicateConfig +from kgpipe_eval.utils.alignment_utils import EntityAlignmentConfig +from kgpipe_eval.test.utils import get_test_kg, get_verified_entities_path +from kgpipe_eval.utils.kg_utils import KgManager + + +def test_evaluator_runs_metrics_with_and_without_config() -> None: + kg = KgManager.load_kg(get_test_kg()) + try: + dup_cfg = DuplicateConfig( + entity_alignment_config=EntityAlignmentConfig( + method="label_embedding", + verified_entities_path=get_verified_entities_path(), + verified_entities_delimiter=",", + entity_sim_threshold=0.95, + ) + ) + + metrics = [CountMetric(), DuplicateMetric()] + confs = {"DuplicateMetric": dup_cfg} + + results = Evaluator().run(kg=kg, metrics=metrics, confs=confs) + assert len(results) == 2 + assert results[0].metric.__class__.__name__ == "CountMetric" + assert results[1].metric.__class__.__name__ == "DuplicateMetric" + finally: + KgManager.unload_kg(kg) + diff --git a/src/kgpipe_eval/test/test_kg_utils.py b/src/kgpipe_eval/test/test_kg_utils.py new file mode 100644 index 0000000..39f10f0 --- /dev/null +++ b/src/kgpipe_eval/test/test_kg_utils.py @@ -0,0 +1,29 @@ +from kgpipe_eval.utils.kg_utils import KgManager +from kgpipe_eval.test.utils import get_test_kg, get_reference_kg +from pathlib import Path + +tmp_dir = Path("tmp_test_data") + +def test_substract_kg(): + # TODO test can be improved / cleaned up + kg = get_test_kg() + kg_graph = KgManager.load_kg(kg) + kg_path = kg.path + + # read kg + with open(kg_path, "r") as f: + triples = f.readlines() + sample_triples = triples[:10] + other_kg_path = tmp_dir / "other_kg.nt" + with open(other_kg_path, "w") as f: + f.write("\n".join(sample_triples)) + other_kg_graph = KgManager.load_kg(other_kg_path) + + substracted_kg_graph = KgManager.substract_kg(kg_graph, other_kg_graph) + len_kg_triples = len(list(kg_graph.triples((None, None, None)))) + len_other_kg_triples = len(list(other_kg_graph.triples((None, None, None)))) + len_substracted_kg_triples = len(list(substracted_kg_graph.triples((None, None, None)))) + # print(f"len_kg_triples: {len_kg_triples}") + # print(f"len_other_kg_triples: {len_other_kg_triples}") + # print(f"len_substracted_kg_triples: {len_substracted_kg_triples}") + assert len_substracted_kg_triples == len_kg_triples - len_other_kg_triples \ No newline at end of file diff --git a/src/kgpipe_eval/test/test_llm_eval.py b/src/kgpipe_eval/test/test_llm_eval.py new file mode 100644 index 0000000..b02ced2 --- /dev/null +++ b/src/kgpipe_eval/test/test_llm_eval.py @@ -0,0 +1,6 @@ +import pytest + +# @pytest.skip(reason="Long running test") +# def test_llm_eval(): +# pass + diff --git a/src/kgpipe_eval/test/test_metric_utils.py b/src/kgpipe_eval/test/test_metric_utils.py new file mode 100644 index 0000000..6ddc8c8 --- /dev/null +++ b/src/kgpipe_eval/test/test_metric_utils.py @@ -0,0 +1,64 @@ +from __future__ import annotations + +import json +from pathlib import Path + +from kgpipe_eval.utils.metric_utils import eval_results_jsons_to_rows, write_eval_csv + + +def test_eval_results_json_to_rows_and_csv(tmp_path: Path) -> None: + # Create a fake output structure: //stage_1/eval_results.json + p = tmp_path / "rdf_a" / "stage_1" + p.mkdir(parents=True) + + (p / "eval_results.json").write_text( + json.dumps( + [ + { + "metric": "DuplicateMetric", + "summary": "Duplicates in the KG", + "measurements": [ + {"name": "duplicates", "value": 3, "unit": "number"}, + {"name": "entity_count", "value": 10, "unit": "number"}, + {"name": "duplicates_ratio", "value": 0.3, "unit": "percentage"}, + ], + } + ] + ) + ) + + allowlist = { + "DuplicateMetric": { + "duplicates": "number", + "entity_count": "number", + "duplicates_ratio": "percentage", + } + } + + rows = eval_results_jsons_to_rows([p / "eval_results.json"], allowlist=allowlist) + assert rows == [ + { + "pipeline": "rdf_a", + "stage": "stage_1", + "DuplicateMetric__duplicates__number": 3, + "DuplicateMetric__entity_count__number": 10, + "DuplicateMetric__duplicates_ratio__percentage": 0.3, + } + ] + + out_csv = tmp_path / "out.csv" + write_eval_csv([p / "eval_results.json"], out_path=out_csv, allowlist=allowlist) + txt = out_csv.read_text() + + # Header + one row, with stable columns including pipeline/stage and allowlist columns. + lines = [l for l in txt.splitlines() if l.strip()] + assert len(lines) == 2 + assert lines[0].split(",") == [ + "pipeline", + "stage", + "DuplicateMetric__duplicates__number", + "DuplicateMetric__duplicates_ratio__percentage", + "DuplicateMetric__entity_count__number", + ] + assert lines[1].split(",") == ["rdf_a", "stage_1", "3", "0.3", "10"] + diff --git a/src/kgpipe_eval/test/test_source_eval.py b/src/kgpipe_eval/test/test_source_eval.py new file mode 100644 index 0000000..e69de29 diff --git a/src/kgpipe_eval/test/test_statistics_eval.py b/src/kgpipe_eval/test/test_statistics_eval.py new file mode 100644 index 0000000..67d803d --- /dev/null +++ b/src/kgpipe_eval/test/test_statistics_eval.py @@ -0,0 +1,8 @@ +from kgpipe_eval.metrics.statistics import CountMetric +from kgpipe_eval.test.utils import get_test_kg +from kgpipe_eval.utils.kg_utils import KgManager + +def test_count_metric(): + metric = CountMetric() + report = metric.compute(KgManager.load_kg(get_test_kg())) + print(report) \ No newline at end of file diff --git a/src/kgpipe_eval/test/utils.py b/src/kgpipe_eval/test/utils.py new file mode 100644 index 0000000..855b957 --- /dev/null +++ b/src/kgpipe_eval/test/utils.py @@ -0,0 +1,52 @@ +from pathlib import Path +from kgpipe.common import KG +from kgpipe.common.model.data import DataFormat +from kgpipe_eval.test.examples import * +from kgpipe_eval.api import MetricResult +from kgpipe_eval.utils.metric_utils import render_metric_result +from rdflib import Graph +import json +from collections.abc import Mapping, Sequence + +tmp_dir = Path("tmp_test_data") + +if not tmp_dir.exists(): + tmp_dir.mkdir(parents=True, exist_ok=True) + + +def get_test_kg(sample_size: int = -1) -> KG: + test_triples = TEST_TURTLE_TRIPLES + if sample_size > 0: + test_triples = test_triples[:sample_size] + # write test_triples to a file + g = Graph() + g.parse(data=test_triples, format="turtle") + g.serialize(destination=tmp_dir / "test.nt", format="ntriples") + return KG("test", name="test", path=tmp_dir / "test.nt", format=DataFormat.RDF_NTRIPLES) + +def get_generated_kg(sample_size: int = -1) -> KG: + generated_triples = GENERATED_TURTLE_TRIPLES + if sample_size > 0: + generated_triples = generated_triples[:sample_size] + # write generated_triples to a file + g = Graph() + g.parse(data=generated_triples, format="turtle") + g.serialize(destination=tmp_dir / "generated.nt", format="ntriples") + return KG("generated", name="generated", path=tmp_dir / "generated.nt", format=DataFormat.RDF_NTRIPLES) + +def get_reference_kg(sample_size: int = -1) -> KG: + reference_triples = REFERENCE_TURTLE_TRIPLES + if sample_size > 0: + reference_triples = reference_triples[:sample_size] + # write reference_triples to a file + g = Graph() + g.parse(data=reference_triples, format="turtle") + g.serialize(destination=tmp_dir / "reference.nt", format="ntriples") + return KG("reference", name="reference", path=tmp_dir / "reference.nt", format=DataFormat.RDF_NTRIPLES) + +def get_verified_entities_path() -> Path: + path = tmp_dir / "verified_entities.csv" + with open(path, "w") as f: + # Avoid a leading blank line which breaks csv.DictReader header parsing + f.write(VERIFIED_ENTITIES.lstrip().replace("o:", "http://example.org/ontology/")) + return path \ No newline at end of file diff --git a/src/kgpipe_eval/utils/__init__.py b/src/kgpipe_eval/utils/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/kgpipe_eval/utils/alignment_utils.py b/src/kgpipe_eval/utils/alignment_utils.py new file mode 100644 index 0000000..0f9359d --- /dev/null +++ b/src/kgpipe_eval/utils/alignment_utils.py @@ -0,0 +1,388 @@ +from transformers.models.t5gemma2.modeling_t5gemma2 import T5Gemma2ClassificationHead +from kgpipe.common import KG +from typing import TYPE_CHECKING, Literal, NamedTuple, Optional +from functools import lru_cache +from pydantic import BaseModel, ConfigDict, model_validator + +from kgpipe_eval.utils.kg_utils import TripleGraph, Term, Triple, KgLike, KgManager +from kgpipe.util.embeddings.st_emb import get_model + +from rdflib import RDFS, RDF +from rdflib.term import BNode +from rdflib.term import Literal as RdLiteral +from kgpipe.datasets.multipart_multisource import read_entities_csv, EntitiesRow +import numpy as np +from pathlib import Path +from tqdm import tqdm +from tqdm import tqdm +from typing import Set + +# TODO source entities csv to label only graph + +DEBUG = True + +class EntityAlignmentConfig(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + method: Literal["label_embedding", "label_alias_embedding", "label_embedding_and_type", "label_embedding_and_intersecting_type"] = "label_embedding" + reference_kg: Optional[KgLike] = None + verified_entities_path: Optional[Path] = None + verified_entities_delimiter: str = "\t" + entity_sim_threshold: float = 0.95 + ignored_entities: Optional[Set[Term]] = None + + # value_sim_threshold: float = 0.5 + + @model_validator(mode="after") + def _require_reference_source(self): + if self.reference_kg is None and self.verified_entities_path is None: + raise ValueError("Provide either `reference_kg` or `verified_entities_path`.") + return self + + +EntityAlignment = NamedTuple("EntityAlignment", [("source", Term), ("target", Term), ("score", float)]) +TripleAlignment = NamedTuple("TripleAlignment", [("source", Triple), ("target", Triple)]) + +# Core alignment method interfaces + +@lru_cache(maxsize=1000) +def get_aligned_entities(kg: KG, reference_kg: KG, method: Literal["exact", "fuzzy", "semantic"] = "exact") -> list[EntityAlignment]: + return kg.entities.intersection(reference_kg.entities) + +def get_aligned_triples(kg: KG, reference_kg: KG, method: Literal["exact", "fuzzy", "semantic"] = "exact") -> list[TripleAlignment]: + return kg.triples.intersection(reference_kg.triples) + +# Helper methods + +# def get_entity_uri_label_pairs(triple_graph: TripleGraph) -> list[tuple[Term, Term]]: +# return [(s, label) for s, _, label in triple_graph.triples((None, RDFS.label, None))] + +UriLabelTypePair = NamedTuple("UriLabelTypePair", [("uri", Term), ("label", Term), ("type", Term)]) +UriLabelTypeSetPair = NamedTuple("UriLabelTypeSetPair", [("uri", Term), ("label", Term), ("type_set", set[Term])]) + +def get_entity_uri_label_type_pairs(kg: KG, ignored_entities: Optional[Set[Term]] = None) -> list[UriLabelTypePair]: + label_by_uri = {} + type_by_uri = {} + for s, p, o in kg.triples((None, RDFS.label, None)): + label_by_uri[str(s)] = str(o) + for s, p, o in kg.triples((None, RDF.type, None)): + type_by_uri[str(s)] = str(o) + for uri in label_by_uri: + if ignored_entities and str(uri) in ignored_entities: + continue + if uri in type_by_uri: + yield UriLabelTypePair(uri=uri, label=label_by_uri[uri], type=type_by_uri[uri]) + else: + yield UriLabelTypePair(uri=uri, label=label_by_uri[uri], type=None) + +def get_entity_uri_label_typeset_pairs(kg: KG, ignored_entities: Optional[Set[Term]] = None) -> list[UriLabelTypeSetPair]: + label_by_uri = {} + types_by_uri = {} + for s, p, o in kg.triples((None, RDFS.label, None)): + label_by_uri[str(s)] = str(o) + for s, p, o in kg.triples((None, RDF.type, None)): + if str(s) not in types_by_uri: + types_by_uri[str(s)] = set() + types_by_uri[str(s)].add(str(o)) + for uri in label_by_uri: + if ignored_entities and str(uri) in ignored_entities: + continue + if uri in types_by_uri: + yield UriLabelTypeSetPair(uri=uri, label=label_by_uri[uri], type_set=types_by_uri[uri]) + else: + yield UriLabelTypeSetPair(uri=uri, label=label_by_uri[uri], type_set=set()) + +def load_verified_entities(path: Path, delimiter: str = "\t") -> list[UriLabelTypePair]: + """ + """ + if path.name.endswith(".json"): + raise ValueError("JSON format not supported for verified entities") + elif path.name.endswith(".csv"): + return [UriLabelTypePair(uri=entity.entity_id, label=entity.entity_label, type=entity.entity_type) for entity in read_entities_csv(path=path, delimiter=delimiter)] + else: + raise ValueError(f"Unsupported file type: {path}") + +def load_entity_uri_label_type_pairs(config: EntityAlignmentConfig) -> list[UriLabelTypePair]: + if config.verified_entities_path is not None: + return load_verified_entities(config.verified_entities_path, delimiter=config.verified_entities_delimiter) + elif config.reference_kg is not None: + # `get_entity_uri_label_type_pairs` is a generator; downstream alignment uses indexing. + return list(get_entity_uri_label_type_pairs(KgManager.load_kg(config.reference_kg))) + else: + raise ValueError("No verified entities path or reference KG provided") + +# Specific alignment methods + +def align_entities_by_label_embedding(tg: TripleGraph, config: EntityAlignmentConfig) -> list[EntityAlignment]: + model = get_model() + ref_entity_uri_label_type_pairs = load_entity_uri_label_type_pairs(config) + ref_labels = [pair.label for pair in ref_entity_uri_label_type_pairs] + ref_labels_embeddings = model.encode(ref_labels, convert_to_numpy=True, show_progress_bar=False) + + gen_entity_uri_label_type_pairs = list(get_entity_uri_label_type_pairs(tg, config.ignored_entities)) + gen_labels = [pair.label for pair in gen_entity_uri_label_type_pairs] + gen_labels_embeddings = model.encode(gen_labels, convert_to_numpy=True, show_progress_bar=False) + + + sims = np.dot(gen_labels_embeddings, ref_labels_embeddings.T) + + alignments = [] + for i in range(sims.shape[0]): + best_j = np.argmax(sims[i]) + if sims[i][best_j] >= config.entity_sim_threshold: + alignments.append(EntityAlignment(source=gen_entity_uri_label_type_pairs[i].uri, target=ref_entity_uri_label_type_pairs[best_j].uri, score=sims[i][best_j])) + return alignments + +def align_by_label_alias_embedding(triple_graph: TripleGraph, model="", similarity="cosine", threshold=0.5): + pass + + +if TYPE_CHECKING: # avoid circular import at runtime + from kgpipe_eval.metrics.triple_alignment import TripleAlignmentConfig + + +def _is_literal(term: Term) -> bool: + return isinstance(term, RdLiteral) + + +def _literal_text(lit: RdLiteral) -> str: + # Prefer lexical form; fall back to python value string. + try: + return str(lit) + except Exception: + return str(lit.toPython()) + + +def align_triples_by_value_embedding(tg: TripleGraph, config: "TripleAlignmentConfig") -> list[TripleAlignment]: + """ + Align generated triples in `tg` to reference triples using: + - entity alignment (for URI/BNode subjects/objects) + - embedding similarity for literal object values (for same subject+predicate) + """ + ref_tg = KgManager.load_kg(config.reference_kg) + + # 0) Blank node mapping. + # + # rdflib assigns fresh IDs to BNodes on parse/load, so loading the "same" KG + # twice will not preserve BNode identifiers. We map BNodes by an outgoing-edge + # signature (predicate + object lexical form) to make exact-equal graphs align. + def _term_key(t: Term) -> str: + return str(t) + + def _bnode_signature(g: TripleGraph, b: BNode) -> tuple[tuple[str, str], ...]: + pairs: list[tuple[str, str]] = [] + for _, p, o in g.triples((b, None, None)): + if _is_literal(o): + ok = _literal_text(o) + else: + ok = _term_key(o) + pairs.append((_term_key(p), ok)) + pairs.sort() + return tuple(pairs) + + def _build_bnode_map(gen_g: TripleGraph, ref_g: TripleGraph) -> dict[str, Term]: + ref_by_sig: dict[tuple[tuple[str, str], ...], list[BNode]] = {} + for s, _, _ in ref_g.triples((None, None, None)): + if isinstance(s, BNode): + print(f"Ref bnode: {s}") + sig = _bnode_signature(ref_g, s) + ref_by_sig.setdefault(sig, []).append(s) + + gen_by_sig: dict[tuple[tuple[str, str], ...], list[BNode]] = {} + for s, _, _ in gen_g.triples((None, None, None)): + if isinstance(s, BNode): + print(f"Gen bnode: {s}") + sig = _bnode_signature(gen_g, s) + gen_by_sig.setdefault(sig, []).append(s) + + # Accept signature matches. If a signature occurs multiple times in both graphs, + # map deterministically by sorting node IDs and zipping. This makes identical KGs + # align even when they contain repeated blank-node structures. + out: dict[str, Term] = {} + for sig, gen_nodes in gen_by_sig.items(): + ref_nodes = ref_by_sig.get(sig, []) + if not ref_nodes: + continue + if len(gen_nodes) != len(ref_nodes): + continue + for gnode, rnode in zip(sorted(gen_nodes, key=_term_key), sorted(ref_nodes, key=_term_key)): + out[_term_key(gnode)] = rnode + return out + + gen_bnode_to_ref: dict[str, Term] = _build_bnode_map(tg, ref_tg) + + # 1) Entity alignments (generated -> reference) + ent_cfg = config.entity_alignment_config + if getattr(ent_cfg, "reference_kg", None) is None and getattr(ent_cfg, "verified_entities_path", None) is None: + # Ensure validator requirements are met; default to using the reference KG. + ent_cfg = ent_cfg.model_copy(update={"reference_kg": config.reference_kg}) + + entity_alignments = align_entities_by_label_embedding(tg, ent_cfg) + + if DEBUG: print("Entity alignments: ", len(entity_alignments)) + + def _as_term(t: Term | str) -> Term: + # Entity alignment currently carries string IDs; convert to rdflib Terms so + # aligned triples are comparable to `ref_tg.triples(...)` output. + if isinstance(t, str): + try: + from rdflib import URIRef + return URIRef(t) + except Exception: + # Fall back to raw string (will likely not match ref triples, but + # avoids crashing on non-URI identifiers). + return t # type: ignore[return-value] + return t + + gen_to_ref_entity: dict[str, Term] = {} + best_score_by_gen: dict[str, float] = {} + for a in entity_alignments: + gen_key = str(a.source) + if gen_key not in best_score_by_gen or a.score > best_score_by_gen[gen_key]: + best_score_by_gen[gen_key] = float(a.score) + gen_to_ref_entity[gen_key] = _as_term(a.target) + + # 2) Index generated triples, both raw and entity-mapped + mapped_gen_triples: list[tuple[Triple, Triple]] = [] # (raw_gen, mapped_to_ref_space) + gen_by_sp_literal: dict[tuple[Term, Term], list[tuple[Triple, str]]] = {} + gen_by_sp_entity: dict[tuple[Term, Term], set[Triple]] = {} + + if DEBUG: print("Gen by sp literal: ", len(gen_by_sp_literal)) + if DEBUG: print("Gen by sp entity: ", len(gen_by_sp_entity)) + + sp_iter = getattr(tg, "iter_sp_groups", None) + if callable(sp_iter): + sp_groups = sp_iter() + for s, p, os in sp_groups: + for o in os: + mapped_s = gen_to_ref_entity.get(str(s), gen_bnode_to_ref.get(str(s), s)) + mapped_o = gen_to_ref_entity.get(str(o), gen_bnode_to_ref.get(str(o), o)) if not _is_literal(o) else o + mapped = (mapped_s, p, mapped_o) + raw = (s, p, o) + mapped_gen_triples.append((raw, mapped)) + + # Normalize keys to string form to avoid rdflib Term vs string mismatches. + sp = (_term_key(mapped_s), _term_key(p)) + if _is_literal(o): + gen_by_sp_literal.setdefault(sp, []).append((raw, _literal_text(o))) + else: + gen_by_sp_entity.setdefault(sp, set()).add(raw) + else: + for s, p, o in tg.triples((None, None, None)): + mapped_s = gen_to_ref_entity.get(str(s), gen_bnode_to_ref.get(str(s), s)) + mapped_o = gen_to_ref_entity.get(str(o), gen_bnode_to_ref.get(str(o), o)) if not _is_literal(o) else o + mapped = (mapped_s, p, mapped_o) + raw = (s, p, o) + mapped_gen_triples.append((raw, mapped)) + + # Normalize keys to string form to avoid rdflib Term vs string mismatches. + sp = (_term_key(mapped_s), _term_key(p)) + if _is_literal(o): + gen_by_sp_literal.setdefault(sp, []).append((raw, _literal_text(o))) + else: + gen_by_sp_entity.setdefault(sp, set()).add(raw) + + if DEBUG: print("Mapped gen triples: ", len(mapped_gen_triples)) + + # 3) Prepare literal embedding caches (optional) + model = get_model() + alignments: list[TripleAlignment] = [] + + # Encode generated literal texts once, cache by text. + cache_gen_literals = bool(getattr(config, "cache_literal_embeddings", True)) + gen_lit_emb_by_text: dict[str, np.ndarray] = {} + if cache_gen_literals and gen_by_sp_literal: + unique_texts = sorted({txt for candidates in gen_by_sp_literal.values() for _, txt in candidates}) + if unique_texts: + emb = model.encode(unique_texts, convert_to_numpy=True, show_progress_bar=True) + gen_lit_emb_by_text = {t: emb[i : i + 1] for i, t in enumerate(unique_texts)} + + # Reference literal embedding cache (by text). + cache_ref_literals = bool(getattr(config, "cache_ref_literal_embeddings", True)) + ref_lit_emb_by_text: dict[str, np.ndarray] = {} + if cache_ref_literals: + unique_texts = sorted({_literal_text(ro) for _, _, ro in ref_tg.triples((None, None, None))}) + if unique_texts: + emb = model.encode(unique_texts, convert_to_numpy=True, show_progress_bar=True) + ref_lit_emb_by_text = {t: emb[i : i + 1] for i, t in enumerate(unique_texts)} + + def get_ref_literal_embedding(texts: list[str]) -> np.ndarray: + if cache_ref_literals and ref_lit_emb_by_text: + return np.concatenate([ref_lit_emb_by_text[t] for t in texts], axis=0) + else: + return model.encode(texts, convert_to_numpy=True, show_progress_bar=True) + + + def get_gen_literal_embedding(texts: list[str]) -> np.ndarray: + if cache_gen_literals and gen_lit_emb_by_text: + return np.concatenate([gen_lit_emb_by_text[t] for t in texts], axis=0) + else: + return model.encode(texts, convert_to_numpy=True, show_progress_bar=True) + + if DEBUG: print("gen_lit_emb_by_text: ", len(gen_lit_emb_by_text)) + if DEBUG: print("ref_lit_emb_by_text: ", len(ref_lit_emb_by_text)) + + from rdflib import Graph + gen_graph : Graph = tg._graph() + ref_graph : Graph = ref_tg._graph() + + if DEBUG: print("Gen graph: ", len(list(gen_graph.triples((None, None, None))))) + if DEBUG: print("Ref graph: ", len(list(ref_graph.triples((None, None, None))))) + + for gs, gp in tqdm(gen_graph.subject_predicates(unique=True), desc="Aligning triples by value embedding"): + # sp = (_term_key(gs), _term_key(gp)) + + # check for s mapping in reference space + ref_s = gen_to_ref_entity.get(str(gs), gen_bnode_to_ref.get(str(gs), gs)) + if ref_s is None: + continue # s is not mapped to reference space + + gen_objects = list(gen_graph.objects(gs, gp)) + gen_literal_objs = [o for o in gen_objects if _is_literal(o)] + + # IMPORTANT: query reference objects in reference-space subject + ref_objects = list(ref_graph.objects(ref_s, gp)) + ref_literal_objs = [o for o in ref_objects if _is_literal(o)] + + # print("gs: ", gs, "gp: ", gp) + # print("ref_s: ", ref_s) + # print("gen_literal_objs: ", len(gen_literal_objs)) + # print("ref_literal_objs: ", len(ref_literal_objs)) + # print("gen_objects: ", len(gen_objects)) + # print("ref_objects: ", len(ref_objects)) + + if len(gen_literal_objs) > 0 and len(ref_literal_objs) > 0: + + gen_object_texts = [_literal_text(o) for o in gen_literal_objs] + ref_object_texts = [_literal_text(o) for o in ref_literal_objs] + + gen_object_embeddings = get_gen_literal_embedding(gen_object_texts) + ref_object_embeddings = get_ref_literal_embedding(ref_object_texts) + + sims = np.dot(gen_object_embeddings, ref_object_embeddings.T) # shape (n_gen, n_ref) + best_flat = int(np.argmax(sims)) + best_i, best_j = np.unravel_index(best_flat, sims.shape) + + if float(sims[best_i, best_j]) >= float(config.value_sim_threshold): + alignments.append( + TripleAlignment( + source=(gs, gp, gen_literal_objs[best_i]), + target=(ref_s, gp, ref_literal_objs[best_j]), + ) + ) + + # get all non-literal objects mapped to reference space + gen_object_non_literal = [o for o in gen_objects if not _is_literal(o)] + ref_object_non_literal = [o for o in ref_objects if not _is_literal(o)] + + # find if any of the non-literal objects in the generated graph are mapped to the same object in the reference graph + for gen_obj in gen_object_non_literal: + ref_o = gen_to_ref_entity.get(str(gen_obj), gen_bnode_to_ref.get(str(gen_obj), gen_obj)) + if ref_o in ref_object_non_literal: + alignments.append( + TripleAlignment( + source=(gs, gp, gen_obj), + target=(ref_s, gp, ref_o), + ) + ) + + return alignments \ No newline at end of file diff --git a/src/kgpipe_eval/utils/annotation_utils.py b/src/kgpipe_eval/utils/annotation_utils.py new file mode 100644 index 0000000..c196a77 --- /dev/null +++ b/src/kgpipe_eval/utils/annotation_utils.py @@ -0,0 +1,51 @@ +from typing import Literal + +# Labels + +def get_labeled_entities(kg: KG, reference_kg: KG, method: Literal["exact", "fuzzy", "semantic"] = "exact") -> list[Entity]: + return kg.entities.intersection(reference_kg.entities) + +def get_labeled_triples(kg: KG, reference_kg: KG, method: Literal["exact", "fuzzy", "semantic"] = "exact") -> list[Triple]: + return kg.triples.intersection(reference_kg.triples) + + +def label_triples_with_llm(Triple): + """ +You are validating RDF triples. + +Task 1: +For each triple, decide whether it is: +- plausible in isolation +- implausible in isolation +- unclear + +Task 2: +Considering that all triples refer to the same subject node, decide whether the set is: +- coherent +- ambiguous +- conflated +- temporally inconsistent +- geographically inconsistent + +Task 3: +Explain which triples are mutually incompatible and why. +{ + "triple_labels": [ + { + "triple": ":Paris :locatedIn :France .", + "label": "plausible_in_isolation" + }, + { + "triple": ":Paris :population \"2,100,000\" .", + "label": "plausible_in_isolation" + }, + { + "triple": ":Paris :locatedIn :Texas .", + "label": "plausible_in_isolation" + } + ], + "entity_label": "conflated", + "graph_label": "contextually_incompatible", + "explanation": "The subject :Paris appears to merge Paris, France and Paris, Texas." +} + """ \ No newline at end of file diff --git a/src/kgpipe_eval/utils/entailment_utils.py b/src/kgpipe_eval/utils/entailment_utils.py new file mode 100644 index 0000000..19e8cb0 --- /dev/null +++ b/src/kgpipe_eval/utils/entailment_utils.py @@ -0,0 +1,7 @@ + + +def check_entailment(): + pass + +def check_entailment_by_llm(): + pass \ No newline at end of file diff --git a/src/kgpipe_eval/utils/kg_utils.py b/src/kgpipe_eval/utils/kg_utils.py new file mode 100644 index 0000000..8d0317d --- /dev/null +++ b/src/kgpipe_eval/utils/kg_utils.py @@ -0,0 +1,205 @@ +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +from typing import Iterable, Protocol, Union, runtime_checkable, Optional, Tuple, Literal +from collections import defaultdict + +from rdflib import RDF, Graph, RDFS +from rdflib.term import Identifier, Literal, URIRef + +from kgpipe.common import KG + +KgLike = Union[KG, Graph, str, Path] + +Term = Union[Identifier, str, URIRef, Literal] + +Triple = tuple[Term, Term, Term] + +TriplePattern = Tuple[ + Optional[Term], Optional[Term], Optional[Term] +] + +@runtime_checkable +class TripleGraph(Protocol): + """ + TripleGraph is a protocol that defines the interface for a graph that can be used to evaluate metrics. + It is used to abstract the underlying graph implementation and allow for different graph implementations to be used. + + This is intentionally small: metrics should depend on *these* operations, + not on a specific in-memory representation (RDFLib Graph today; Spark later). + """ + + def triples(self, triple_pattern: TriplePattern) -> Iterable[Triple]: + pass + + def subjects(self) -> Iterable[Term]: + pass + + def entities(self) -> Iterable[Term]: + pass + + def labels(self, term: Term) -> Literal: + pass + + def types(self, term: Term) -> Iterable[Term]: + pass + + def close(self) -> None: + pass + + def cache(self) -> None: + pass + +# def iter_triples(self) -> Iterable[Triple]: +# """Iterate (s, p, o) triples.""" + +# @property +# def triples(self) -> frozenset[Triple]: +# """Materialized triple set (may be expensive).""" + +# @property +# def entities(self) -> frozenset[Term]: +# """All subjects/objects that are IRIs or blank nodes (no literals).""" + +# @property +# def relations(self) -> frozenset[Term]: +# """All predicates.""" + +# @property +# def classes(self) -> frozenset[Term]: +# """All classes used in rdf:type assertions.""" + +# @property +# def class_occurrences(self) -> Mapping[Term, int]: +# """Class β†’ number of rdf:type occurrences.""" + +@dataclass(frozen=True) +class SparkTripleGraph(TripleGraph): + """ + KG backend that exposes evaluation-friendly views derived from a Spark DataFrame. + """ + # df: SparkDataFrame + + def triples(self, triple_pattern: TriplePattern) -> Iterable[Triple]: + # return self.df.filter(triple_pattern).collect() + pass + + def close(self) -> None: + pass + + def cache(self) -> None: + pass + +@dataclass(frozen=True) +class RdfLibTripleGraph(TripleGraph): + """ + KG backend that exposes evaluation-friendly views derived from an RDFLib `Graph`. + + Accepts: + - `kgpipe.common.KG` (uses `get_graph()`) + - an RDFLib `Graph` + - a path/str (parsed by RDFLib) + """ + kg: KgLike + + def _graph(self) -> Graph: + if isinstance(self.kg, Graph): + return self.kg + elif isinstance(self.kg, KG): + return self.kg.get_graph() + elif isinstance(self.kg, Path): + return Graph().parse(str(self.kg)) + else: + raise ValueError(f"Unsupported KG type: {type(self.kg)}") + + def get_graph(self) -> Graph: + return self._graph() + + def get_ontology_graph(self) -> Graph: + if isinstance(self.kg, KG): + return self.kg.get_ontology_graph() + else: + raise ValueError(f"Unsupported KG type: {type(self.kg)}") + + def triples(self, triple_pattern: TriplePattern) -> Iterable[Triple]: + g = self._graph() + # RDFLib yields (s, p, o) as Identifiers + return g.triples(triple_pattern) + + def subjects(self) -> Iterable[Term]: + g = self._graph() + return g.subjects(unique=True) + + def iter_sp_groups(self) -> Iterable[tuple[Term, Term, list[Term]]]: + """Yield (s, p, [o1, o2, ...]) for all subjects/predicates.""" + g = self._graph() + by_sp: dict[tuple[Term, Term], list[Term]] = defaultdict(list) + for s, p, o in g.triples((None, None, None)): + by_sp[(s, p)].append(o) + for (s, p), objs in by_sp.items(): + yield (s, p, objs) + + def subject_predicate_pairs(self) -> Iterable[tuple[Term, Term]]: + """Yield (s, p) for all subjects/predicates.""" + g = self._graph() + return g.subject_predicates(unique=True) + + def objects(self, subject: Term, predicate: Term) -> Iterable[Term]: + """Yield (o) for all objects of (s, p).""" + g = self._graph() + return g.objects(subject, predicate) + + def entities(self) -> Iterable[Term]: + return self.subjects() # TODO inlcude objects that are not subjects + + def labels(self, term: Term) -> Literal: + g = self._graph() + return g.triples((term, RDFS.label, None)) + + def types(self, term: Term) -> Iterable[Term]: + g = self._graph() + return g.triples((term, RDF.type, None)) + +class KgManager: + """ + KgManager is a class that manages the loading and unloading of KGs. + It is used to abstract the underlying graph implementation and allow for different graph implementations to be used. + """ + + @staticmethod + def load_kg(kg: KgLike, backend: Literal["rdflib", "spark"] = "rdflib") -> TripleGraph: + if backend == "rdflib": + return RdfLibTripleGraph(kg=kg) + else: + raise ValueError(f"Unsupported backend: {backend}") + + @staticmethod + def load_kg_from_path(path: Path, backend: Literal["rdflib", "spark"] = "rdflib") -> TripleGraph: + if backend == "rdflib": + return RdfLibTripleGraph(kg=path) + else: + raise ValueError(f"Unsupported backend: {backend}") + + @staticmethod + def cache_kg(kg: TripleGraph) -> None: + kg.cache() + + @staticmethod + def unload_kg(kg: TripleGraph) -> None: + kg.close() + + + @staticmethod + def substract_kg(kg: TripleGraph, other_kg: TripleGraph) -> TripleGraph: + """ + Substract the other_kg from the kg. + """ + # TODO can be improved later by using a more efficient algorithm + triples = kg._graph().triples((None, None, None)) + other_triples = other_kg._graph() + new_graph = Graph() + for triple in triples: + if triple not in other_triples: + new_graph.add(triple) + return RdfLibTripleGraph(kg=new_graph) \ No newline at end of file diff --git a/src/kgpipe_eval/utils/measurement_utils.py b/src/kgpipe_eval/utils/measurement_utils.py new file mode 100644 index 0000000..065338b --- /dev/null +++ b/src/kgpipe_eval/utils/measurement_utils.py @@ -0,0 +1,47 @@ +from pydantic import BaseModel + +class BinaryClassificationMeasurement(BaseModel): + tp: int + fp: int + tn: int + fn: int + + def accuracy(self) -> float: + denom = (self.tp + self.tn + self.fp + self.fn) + return (self.tp + self.tn) / denom if denom else 0.0 + + def precision(self) -> float: + denom = (self.tp + self.fp) + return self.tp / denom if denom else 0.0 + + def recall(self) -> float: + denom = (self.tp + self.fn) + return self.tp / denom if denom else 0.0 + + def f1_score(self) -> float: + p = self.precision() + r = self.recall() + denom = (p + r) + return 2 * p * r / denom if denom else 0.0 + + def __str__(self): + return f"tp: {self.tp}, fp: {self.fp}, tn: {self.tn}, fn: {self.fn}, accuracy: {self.accuracy()}, precision: {self.precision()}, recall: {self.recall()}, f1_score: {self.f1_score()}" + + def to_dict(self) -> dict: + """ + Convenience export including derived measures. + + Note: do not override BaseModel internals like `__dict__`. + """ + return { + "tp": self.tp, + "fp": self.fp, + "tn": self.tn, + "fn": self.fn, + "accuracy": self.accuracy(), + "precision": self.precision(), + "recall": self.recall(), + "f1_score": self.f1_score(), + } + +BCMeasurement = BinaryClassificationMeasurement \ No newline at end of file diff --git a/src/kgpipe_eval/utils/metric_utils.py b/src/kgpipe_eval/utils/metric_utils.py new file mode 100644 index 0000000..71e0c71 --- /dev/null +++ b/src/kgpipe_eval/utils/metric_utils.py @@ -0,0 +1,197 @@ +from __future__ import annotations + +import csv +import json +from dataclasses import dataclass +from pathlib import Path +from collections.abc import Mapping, Sequence +from typing import Any, Iterable + +JsonValue = Any + + +@dataclass(frozen=True) +class MeasurementKey: + metric: str + measurement: str + unit: str + + +Allowlist = Mapping[str, Mapping[str, str]] + +from kgpipe_eval.api import MetricResult + + +def render_metric_result(metric_result: MetricResult, truncate: bool = False, truncate_value: int = 5) -> str: + """ + Render a MetricResult into a human-readable table-like string. + + This is intended for CLI/test output (not machine-parseable export). + """ + + def _metric_key(mr: MetricResult) -> str: + metric = mr.metric + return getattr(metric, "key", metric.__class__.__name__) + + def _fmt_value(v: Any) -> str: + if isinstance(v, float): + # stable, compact representation for test output + return f"{v:.6g}" + if isinstance(v, (int, bool)) or v is None: + return str(v) + if isinstance(v, str): + if truncate: + lines = v.splitlines()[:truncate_value] + return "\n".join(lines) + "\n..." + return v + if isinstance(v, Mapping): + rendered = json.dumps(v, indent=2, sort_keys=True, default=str) + if truncate: + return "\n".join(rendered.splitlines()[:truncate_value]) + "\n..." + return rendered + if isinstance(v, Sequence) and not isinstance(v, (str, bytes, bytearray)): + rendered = json.dumps(v, indent=2, sort_keys=True, default=str) + if truncate: + return "\n".join(rendered.splitlines()[:truncate_value]) + "\n..." + return rendered + return str(v) + + key = _metric_key(metric_result) + summary = metric_result.summary or "" + + ms = sorted(metric_result.measurements, key=lambda m: m.name) + name_w = max([len("measurement"), *(len(m.name) for m in ms)] or [len("measurement")]) + unit_w = max([len("unit"), *(len(m.unit or "") for m in ms)] or [len("unit")]) + + lines: list[str] = [] + lines.append("=" * 80) + lines.append(f"metric: {key}") + if summary: + lines.append(f"summary: {summary}") + if not ms: + lines.append("(no measurements)") + return "\n".join(lines) + + lines.append("") + lines.append(f"{'measurement':<{name_w}} {'value'}{' ' * max(1, 2)}{'unit':<{unit_w}}") + lines.append(f"{'-' * name_w} {'-' * 20} {'-' * unit_w}") + + for m in ms: + unit = m.unit or "" + rendered = _fmt_value(m.value) + rendered_lines = rendered.splitlines() or [""] + lines.append(f"{m.name:<{name_w}} {rendered_lines[0]:<20} {unit:<{unit_w}}") + for cont in rendered_lines[1:]: + lines.append(f"{'':<{name_w}} {cont}") + + return "\n".join(lines) + + +def parse_eval_results(path: Path) -> dict[MeasurementKey, JsonValue]: + """ + Parse a single `eval_results.json` and return a flattened mapping. + + Expected file schema (per entry): + - metric: str + - measurements: [{name: str, value: any-json, unit: str|null}, ...] + """ + raw = json.loads(path.read_text()) + if not isinstance(raw, list): + raise ValueError(f"{path} must contain a JSON list, got {type(raw).__name__}") + + out: dict[MeasurementKey, JsonValue] = {} + for entry in raw: + if not isinstance(entry, Mapping): + raise ValueError(f"{path} entries must be objects, got {type(entry).__name__}") + + metric = entry.get("metric") + if not isinstance(metric, str) or not metric: + raise ValueError(f"{path} entry missing 'metric' string") + + measurements = entry.get("measurements", []) + if not isinstance(measurements, list): + raise ValueError(f"{path} entry 'measurements' must be a list") + + for m in measurements: + if not isinstance(m, Mapping): + continue + name = m.get("name") + unit = m.get("unit") + if not isinstance(name, str) or not name: + continue + if unit is None: + unit = "" + if not isinstance(unit, str): + unit = str(unit) + out[MeasurementKey(metric=metric, measurement=name, unit=unit)] = m.get("value") + + return out + + +def allowlist_to_columns(allowlist: Allowlist) -> list[str]: + cols: list[str] = [] + for metric in sorted(allowlist.keys()): + for measurement in sorted(allowlist[metric].keys()): + unit = allowlist[metric][measurement] + cols.append(f"{metric}__{measurement}__{unit}") + return cols + + +def eval_results_jsons_to_rows( + paths: Sequence[Path], + *, + allowlist: Allowlist, +) -> list[dict[str, Any]]: + rows: list[dict[str, Any]] = [] + + for path in paths: + if path.name != "eval_results.json": + raise ValueError(f"Expected eval_results.json file, got {path}") + stage_dir = path.parent + stage = stage_dir.name + if not stage.startswith("stage_"): + raise ValueError(f"Expected stage directory named stage_*, got {stage_dir}") + + pipeline_dir = stage_dir.parent + pipeline = pipeline_dir.name + if not pipeline: + raise ValueError(f"Could not derive pipeline name from {path}") + + flat = parse_eval_results(path) + + row: dict[str, Any] = {"pipeline": pipeline, "stage": stage} + for metric, measurements in allowlist.items(): + for measurement, unit in measurements.items(): + key = MeasurementKey(metric=metric, measurement=measurement, unit=unit) + col = f"{metric}__{measurement}__{unit}" + row[col] = flat.get(key, "") + + rows.append(row) + + return rows + + +def write_eval_csv( + paths: Sequence[Path], + *, + out_path: Path, + allowlist: Allowlist, + delimiter: str = ",", + round_ndigits: int | None = None, +) -> None: + rows = eval_results_jsons_to_rows(paths, allowlist=allowlist) + columns = ["pipeline", "stage", *allowlist_to_columns(allowlist)] + + out_path.parent.mkdir(parents=True, exist_ok=True) + with out_path.open("w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=columns, extrasaction="ignore", delimiter=delimiter) + writer.writeheader() + for r in rows: + # Ensure blanks for missing keys + row = {k: r.get(k, "") for k in columns} + if round_ndigits is not None: + for k, v in list(row.items()): + if isinstance(v, float): + row[k] = round(v, round_ndigits) + writer.writerow(row) + diff --git a/src/kgpipe_eval/utils/verbalize_utils.py b/src/kgpipe_eval/utils/verbalize_utils.py new file mode 100644 index 0000000..7b31cfc --- /dev/null +++ b/src/kgpipe_eval/utils/verbalize_utils.py @@ -0,0 +1,16 @@ +from kgpipe_eval.utils.kg_utils import Triple, TripleGraph, TriplePattern + +def verbalize_triple_simple(triple: Triple, TripleGraph) -> str: + """ + using label of subject, predicate, object to verbalize the triple + """ + return f"{triple[0]} {triple[1]} {triple[2]}" + +def verbalize_triples(triples: list[Triple]) -> list[str]: + pass + +def verbalize_triple_graph(triple_graph: TripleGraph) -> list[str]: + pass + +def verbalize_triple_graph_subject_groups(triple_graph: TripleGraph) -> list[list[str]]: + pass \ No newline at end of file diff --git a/src/kgpipe_llm/any_extraction.py b/src/kgpipe_llm/any_extraction.py new file mode 100644 index 0000000..9509030 --- /dev/null +++ b/src/kgpipe_llm/any_extraction.py @@ -0,0 +1,198 @@ +# Generalized variant of RDF triple generation + +from kgpipe.common import Registry, DataFormat, Data, TaskInput, TaskOutput +from kgpipe.common.model.configuration import ConfigurationDefinition, Parameter, ParameterType, ConfigurationProfile +from kgpipe_llm.common.snippets import generate_ontology_snippet_v3 +from kgcore.api.ontology import OntologyUtil +from pathlib import Path +from kgpipe_llm.common.core import LLMClient + +from shutil import RegistryError +from pydantic import BaseModel, AnyUrl + +# class OntologyGroundedSurfaceTriple(BaseModel): +# subject_label: str +# predicate_uri: AnyUrl +# object_label: str + +from pydantic import BaseModel, Field, AnyUrl + + +class SurfaceTriple(BaseModel): + subject: str = Field( + description="Surface-form subject label. Not a URI." + ) + predicate_uri: AnyUrl = Field( + description="Ontology property URI." + ) + object: str = Field( + description="Surface-form object label or literal. Not a URI." + ) + + +class SurfaceTripleExtractionResult(BaseModel): + triples: list[SurfaceTriple] + +# ontology-guided semantic triple extraction. +# surface semantic triples +# ontology-grounded surface triples + + +def get_ontology_grounded_surface_triples_prompt_template(ontology: str, input_data: str) -> str: + return """ +You are an ontology-guided semantic triple extraction system. + +Your task is to extract ontology-grounded surface triples from the provided input data. + +A valid triple has the form: + + + +Where: +- subject is a surface-form string, label, name, or textual identifier. +- predicate_uri is a URI from the provided ontology vocabulary. +- object is a surface-form string, label, value, literal, or textual identifier. +- subject and object MUST NOT be converted into URIs. +- predicate_uri MUST be selected only from the ontology vocabulary. +- Do not invent ontology properties. +- Do not invent facts not supported by the input. +- Prefer the most specific ontology property that correctly matches the input. +- If a relation or attribute is present in the input but cannot be mapped to the ontology, place it in unmapped_candidates. +- Preserve meaningful entity names as they appear in the input, normalizing only whitespace and obvious formatting artifacts. +- Extract both attributes and relations when they can be represented with an ontology property. +- Return only valid structured output matching the provided schema. + +Ontology vocabulary: + +{ontology} + +Input data: + +{input_data} + +Extraction guidance: +1. Identify named entities, records, rows, objects, or document subjects. +2. Identify attributes and relations expressed in the input. +3. Map each attribute or relation to the best matching ontology property URI. +4. Emit triples using string labels for subject and object. +5. Include evidence when possible. +6. Include confidence between 0.0 and 1.0. +7. Report unmapped relation or attribute candidates. +""".format(ontology=ontology, input_data=input_data) + +def extract_ontology_surface_triples(data: str, ontology: Path, client: LLMClient) -> SurfaceTripleExtractionResult: + + ontology_snippet = generate_ontology_snippet_v3(OntologyUtil.load_ontology_from_file(ontology)) + + prompt = get_ontology_grounded_surface_triples_prompt_template(ontology_snippet, data) + response = client.send_prompt(prompt, SurfaceTripleExtractionResult) + + return response + +@Registry.task( + input_spec={"input": DataFormat.ANY}, + output_spec={"output": DataFormat.RDF_NTRIPLES}, + description="Generate RDF triples for a schema", + config_spec=ConfigurationDefinition( + name="extract_ontology_surface_triples", + parameters=[ + Parameter( + name="ontology", + datatype=ParameterType.string, + description="The schema to generate RDF triples for" + ), + Parameter( + name="prompt_template", + datatype=ParameterType.string, + description="The prompt template to use for the LLM" + ), + ] + ) +) +def extract_ontology_surface_triples_task(input: TaskInput, output: TaskOutput, config: ConfigurationProfile): + pass + + + +# from typing import Any, Literal +# from pydantic import BaseModel, Field, AnyUrl + + +# class OntologyTerm(BaseModel): +# uri: AnyUrl = Field( +# description="The ontology URI identifying a class, attribute, or relation." +# ) +# label: str | None = Field( +# default=None, +# description="Optional human-readable label for the ontology term." +# ) +# description: str | None = Field( +# default=None, +# description="Optional description or definition of the ontology term." +# ) + + +# class OntologyGroundedSurfaceTriple(BaseModel): +# subject: str = Field( +# description="Surface-form name or label of the subject entity. This is not a URI." +# ) + +# predicate_uri: AnyUrl = Field( +# description="URI of the ontology property, attribute, or relation used as the predicate." +# ) + +# object: str = Field( +# description="Surface-form value, entity name, label, literal, or textual object. This is not a URI." +# ) + +# subject_type_uri: AnyUrl | None = Field( +# default=None, +# description="Optional ontology class URI for the subject, if inferable from the input and ontology." +# ) + +# object_type_uri: AnyUrl | None = Field( +# default=None, +# description="Optional ontology class URI for the object, if inferable from the input and ontology." +# ) + +# evidence: str | None = Field( +# default=None, +# description="Short quote or compact excerpt from the input that supports this triple." +# ) + +# confidence: float = Field( +# ge=0.0, +# le=1.0, +# description="Model confidence that the triple is correct and uses the appropriate ontology predicate." +# ) + + +# class TripleExtractionIssue(BaseModel): +# message: str = Field( +# description="Description of an ambiguity, missing ontology term, or extraction problem." +# ) + +# severity: Literal["info", "warning", "error"] = Field( +# description="Severity of the issue." +# ) + +# related_text: str | None = Field( +# default=None, +# description="Optional source text related to the issue." +# ) + + +# class OntologySurfaceTripleExtractionResult(BaseModel): +# triples: list[OntologyGroundedSurfaceTriple] = Field( +# description="Extracted ontology-grounded surface triples." +# ) + +# unmapped_candidates: list[str] = Field( +# default_factory=list, +# description="Candidate relations or attributes found in the input that could not be mapped to the ontology." +# ) + +# issues: list[TripleExtractionIssue] = Field( +# default_factory=list, +# description="Warnings or errors encountered during extraction." +# ) \ No newline at end of file diff --git a/src/kgpipe_llm/common/core.py b/src/kgpipe_llm/common/core.py index d1ee620..7c3af9f 100644 --- a/src/kgpipe_llm/common/core.py +++ b/src/kgpipe_llm/common/core.py @@ -3,7 +3,7 @@ """ import os -from typing import Any, Dict, Generic, Optional, TypeVar +from typing import Any, Dict, Generic, Optional, TypeVar, cast from dotenv import load_dotenv from pydantic import BaseModel @@ -92,7 +92,7 @@ def send_prompt( ) print(f"INFO: {self.api_type}_call_with_tool {type(schema_class)}") - dict_val, _model_val = tool_call( + dict_val, model_val = tool_call( endpoint_url=endpoint, api_key=self.token, model_name=self.model_name, @@ -101,10 +101,16 @@ def send_prompt( system_prompt=system_prompt, seed=self.seed, ) + # Prefer returning the validated Pydantic instance when possible. + if isinstance(schema_class, type) and issubclass(schema_class, BaseModel): + if isinstance(model_val, BaseModel): + return model_val + if isinstance(dict_val, dict): + return cast(type[T], schema_class).model_validate(dict_val) return dict_val print(f"INFO: ollama_call with {type(schema_class)}") - return ollama_call( + raw_val = ollama_call( endpoint_url=self.endpoint_url, api_key=self.token, model_name=self.model_name, @@ -113,6 +119,10 @@ def send_prompt( system_prompt=system_prompt, seed=self.seed, ) + # Ollama path currently returns raw JSON; upgrade to a validated model when requested. + if isinstance(schema_class, type) and issubclass(schema_class, BaseModel) and isinstance(raw_val, dict): + return cast(type[T], schema_class).model_validate(raw_val) + return raw_val class BaseTask(Generic[T]): diff --git a/src/kgpipe_llm/test/test_any_extraction.py b/src/kgpipe_llm/test/test_any_extraction.py new file mode 100644 index 0000000..ab7ded6 --- /dev/null +++ b/src/kgpipe_llm/test/test_any_extraction.py @@ -0,0 +1,24 @@ +from kgpipe_llm.any_extraction import extract_ontology_surface_triples +from pathlib import Path +from kgpipe_llm.common.core import LLMClient +import os + +TEXT=""" +Titanic is a 1997 American epic historical romance film written and directed by James Cameron. Incorporating both historical and fictional aspects, it is based on accounts of the sinking of RMS Titanic in 1912. Leonardo DiCaprio and Kate Winslet star as members of different social classes who fall in love during the ship's ill-fated maiden voyage. The ensemble cast includes Billy Zane, Kathy Bates, Frances Fisher, Bernard Hill, Jonathan Hyde, Danny Nucci, David Warner and Bill Paxton. Cameron's inspiration came from his fascination with shipwrecks. He felt a love story interspersed with human loss would be essential to convey the emotional impact of the disaster. Production began on September 1, 1995, when Cameron shot footage of the Titanic wreck. The modern scenes were shot on board the Shirshov Institute of Oceanology research vessel Akademik Mstislav Keldysh, which Cameron had used as a base when filming the wreck. Scale models, computer-generated imagery (CGI), and a reconstruction of the Titanic built at Baja Studios were used to recreate the sinking. Titanic was initially in development at 20th Century Fox, but delays and a mounting budget resulted in Fox partnering with Paramount Pictures for financial help. It was the most expensive film ever made at the time, with a production budget of $200 million. Filming took place from July 1996 to March 1997. Titanic premiered at the Tokyo International Film Festival on November 1, 1997, and was released in the United States on December 19. It was distributed by Paramount Pictures in the United States and Canada and by 20th Century Fox in other territories. It was praised for its visual effects, performances (particularly those of DiCaprio, Winslet, and Gloria Stuart), production values, direction, score, cinematography, story, and emotional depth. Among other awards, the film received fourteen nominations at the 70th Academy Awards and won eleven, including Best Picture and Best Director. In doing so, it tied both All About Eve (1950) for the record for the most Academy Award nominations, and Ben-Hur (1959) for the most Academy Awards won by a film, making Titanic the most successful individual film in Academy Award history (these records would be matched by 2016's La La Land and 2003's The Lord of the Rings: The Return of the King respectively, although the nomination record was surpassed by 2025's Sinners in 2026). With an initial worldwide gross of over $1.84 billion, Titanic was the first film to reach the billion-dollar mark (1993's Jurassic Park would later become the earliest-released film to achieve this feat, via subsequent re-releases), and was the highest-grossing film of all time until Cameron's next film, Avatar (2009), surpassed it in 2010. Income from the initial theatrical release, retail video, and soundtrack sales and US broadcast rights exceeded $3.2 billion. Releases pushed the worldwide theatrical total to $2.264 billion, making Titanic the second film to gross more than $2 billion worldwide after Avatar; as of 2023, it is the fourth-highest-grossing film. In 2017, the Library of Congress selected it for preservation in the United States National Film Registry as "culturally, historically, or aesthetically significant +""" + +API_KEY = os.getenv("OPENAI_API_KEY") +if not API_KEY: + raise ValueError("OPENAI_API_KEY is not set") + +model_name="o4-mini" +ontology_path = Path("/home/marvin/phd/data/moviekg/datasets/film_10k/ontology.ttl") + +def test_extract_ontology_surface_triples(): + client = LLMClient( + model_name=model_name, + token=API_KEY, + api_type="openai", + ) + result = extract_ontology_surface_triples(TEXT, ontology_path, client) + print(result.model_dump_json(indent=2)) \ No newline at end of file diff --git a/src/kgpipe_parameters/config_mapper.py b/src/kgpipe_parameters/config_mapper.py new file mode 100644 index 0000000..a9d854b --- /dev/null +++ b/src/kgpipe_parameters/config_mapper.py @@ -0,0 +1,20 @@ + +""" +Maps a GLOBAL configuration to a local Parameter of a task implementation. +""" + +from kgpipe.common.model.configuration import Parameter, ConfigurationProfile +from kgpipe.common.model.task import KgTask, Data, TaskInput, TaskOutput, KgTask + +class ConfigMapper: + def __init__(self, task: KgTask): + self.task = task + + def map_config(self, config: ConfigurationMapping): + return self.task.config + + + + +def example_task(i: TaskInput, o: TaskOutput, p: ConfigurationProfile): + pass \ No newline at end of file diff --git a/src/kgpipe_parameters/tests/test_visualization.py b/src/kgpipe_parameters/tests/test_visualization.py index 0b19589..187b826 100644 --- a/src/kgpipe_parameters/tests/test_visualization.py +++ b/src/kgpipe_parameters/tests/test_visualization.py @@ -174,3 +174,6 @@ def test_scatter_too_few_points(self, tmp_path): path = viz.plot_embedding_scatter() assert path.exists() + + + diff --git a/src/kgpipe_parameters/visualization/__init__.py b/src/kgpipe_parameters/visualization/__init__.py index 0b7baa9..0bb436b 100644 --- a/src/kgpipe_parameters/visualization/__init__.py +++ b/src/kgpipe_parameters/visualization/__init__.py @@ -4,3 +4,6 @@ __all__ = ["ParameterVisualizer"] + + + diff --git a/src/kgpipe_tasks/entity_resolution/fusion/union.py b/src/kgpipe_tasks/entity_resolution/fusion/union.py index 3af3e01..f31f49d 100644 --- a/src/kgpipe_tasks/entity_resolution/fusion/union.py +++ b/src/kgpipe_tasks/entity_resolution/fusion/union.py @@ -7,8 +7,9 @@ import json from kgpipe.common.registry import Registry import os -from kgcore.model.ontology import OntologyUtil -from kgpipe.execution.config import SOURCE_NAMESPACE, TARGET_ONTOLOGY_NAMESPACE, TARGET_RESOURCE_NAMESPACE + +from kgcore.api.ontology import OntologyUtil +from kgpipe.common.config import SOURCE_NAMESPACE, TARGET_ONTOLOGY_NAMESPACE, TARGET_RESOURCE_NAMESPACE def fuse_rdf_files(f1,f2,er): diff --git a/src/kgpipe_view/kgpipe.owl.ttl b/src/kgpipe_view/kgpipe.owl.ttl index aea0dbb..a05c816 100644 --- a/src/kgpipe_view/kgpipe.owl.ttl +++ b/src/kgpipe_view/kgpipe.owl.ttl @@ -29,9 +29,10 @@ :TaskRun a owl:Class, :RunLayer . :PipelineRun a owl:Class, :RunLayer . -:Artifact a owl:Class, :DataLayer . -:ArtifactType a owl:Class, :DataLayer . -:Schema a owl:Class, :DataLayer . +:DataArtifact a owl:Class, :DataLayer . +:DataDataArtifactSpec a owl:Class, :DataLayer . +:DataDataArtifactType a owl:Class, :DataLayer . +#:Schema a owl:Class, :DataLayer . :Parameter a owl:Class, :ParameterLayer . :ParameterBinding a owl:Class, :ParameterLayer . @@ -95,9 +96,9 @@ rdfs:domain :PipelineDefinition ; rdfs:range :Tool . -:hasSourceArtifact a owl:ObjectProperty ; +:hasSourceDataArtifact a owl:ObjectProperty ; rdfs:domain :PipelineDefinition ; - rdfs:range :Artifact . + rdfs:range :DataArtifact . ### Execution / runs :executesTask a owl:ObjectProperty ; @@ -121,31 +122,35 @@ rdfs:range :TaskRun . ### Data flow (runtime) -:hasInputArtifact a owl:ObjectProperty ; +:hasInputDataArtifact a owl:ObjectProperty ; rdfs:domain :TaskRun ; - rdfs:range :Artifact . + rdfs:range :DataArtifact . -:hasOutputArtifact a owl:ObjectProperty ; +:hasOutputDataArtifact a owl:ObjectProperty ; rdfs:domain :TaskRun ; - rdfs:range :Artifact . + rdfs:range :DataArtifact . ### Data flow typing (design-time) -:expectsInputType a owl:ObjectProperty ; +:expectsInputSpec a owl:ObjectProperty ; rdfs:domain :Implementation ; - rdfs:range :ArtifactType . + rdfs:range :DataDataArtifactSpec . -:producesOutputType a owl:ObjectProperty ; +:producesOutputSpec a owl:ObjectProperty ; rdfs:domain :Implementation ; - rdfs:range :ArtifactType . + rdfs:range :DataDataArtifactSpec . -### Artifact typing / schema -:hasArtifactType a owl:ObjectProperty ; - rdfs:domain :Artifact ; - rdfs:range :ArtifactType . +:requiresType a owl:ObjectProperty ; + rdfs:domain :DataDataArtifactSpec ; + rdfs:range :DataDataArtifactType . -:conformsToSchema a owl:ObjectProperty ; - rdfs:domain :Artifact ; - rdfs:range :Schema . +### DataArtifact typing / schema +:hasDataDataArtifactType a owl:ObjectProperty ; + rdfs:domain :DataArtifact ; + rdfs:range :DataDataArtifactType . + +#:conformsToSchema a owl:ObjectProperty ; +# rdfs:domain :DataArtifact ; +# rdfs:range :Schema . ### Parameters :hasParameter a owl:ObjectProperty ; @@ -165,6 +170,10 @@ ################################################################# ### Implementation +:implementationName a owl:DatatypeProperty ; + rdfs:domain :Implementation ; + rdfs:range xsd:string . + :commandTemplate a owl:DatatypeProperty ; rdfs:domain :Implementation ; rdfs:range xsd:string . @@ -177,11 +186,47 @@ rdfs:domain :Implementation ; rdfs:range xsd:string . +### Method +:methodName a owl:DatatypeProperty ; + rdfs:domain :Method ; + rdfs:range xsd:string . + ### Tool :toolVersion a owl:DatatypeProperty ; rdfs:domain :Tool ; rdfs:range xsd:string . +:toolName a owl:DatatypeProperty ; + rdfs:domain :Tool ; + rdfs:range xsd:string . + +:toolPage a owl:DatatypeProperty ; + rdfs:domain :Tool ; + rdfs:range xsd:string . + +:toolName a owl:DatatypeProperty ; + rdfs:domain :Tool ; + rdfs:range xsd:string . + +### DataArtifact +:location a owl:DatatypeProperty ; + rdfs:domain :DataArtifact ; + rdfs:range xsd:anyURI . + +### DataDataArtifactSpec +:dataType a owl:DatatypeProperty ; + rdfs:domain :DataDataArtifactSpec ; + rdfs:range xsd:string . + +### DataDataArtifactType +:dataFormat a owl:DatatypeProperty ; + rdfs:domain :DataDataArtifactType ; + rdfs:range xsd:string . + +:dataSchema a owl:DatatypeProperty ; + rdfs:domain :DataDataArtifactType ; + rdfs:range xsd:string . + ### Parameter :paramName a owl:DatatypeProperty ; rdfs:domain :Parameter ; @@ -238,12 +283,3 @@ rdfs:domain :PipelineRun ; rdfs:range xsd:string . -### Artifact -:location a owl:DatatypeProperty ; - rdfs:domain :Artifact ; - rdfs:range xsd:anyURI . - -### ArtifactType -:format a owl:DatatypeProperty ; - rdfs:domain :ArtifactType ; - rdfs:range xsd:string . \ No newline at end of file