Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7f98468
chore: gitignore
Vehnem Apr 2, 2026
58899ab
exp(ontologies): experiments on infering or building target ontologies
Vehnem Mar 19, 2026
3b3d408
stash
Vehnem Mar 22, 2026
b06cad5
refactor(common): move graph/systemgraph into common.graph and reshap…
Vehnem Mar 26, 2026
86a0e81
feat(eval): init eval api changes
Vehnem Mar 31, 2026
37f67e4
exp(examples): keep up examples with new api
Vehnem Apr 2, 2026
fa406d3
exp(moviekg): added dup rate and entity precision to table
Vehnem Apr 2, 2026
442ef68
feat(eval): init refactor
Vehnem Apr 2, 2026
34b4927
feat(eval): stash
Vehnem Apr 6, 2026
9454790
feat(eval): added refactored metric impls
Vehnem Apr 8, 2026
4bb2c61
feat(eval): yaml config loader and evaluator impl
Vehnem Apr 8, 2026
584da5a
feat(eval): tests for refactor
Vehnem Apr 8, 2026
67d9647
feat(eval): commited missing util functions
Vehnem Apr 8, 2026
eba5063
exp(moviekg): new eval api implementation
Vehnem Apr 9, 2026
ce1c09d
feat(eval): finished structure and api of new eval, addded some worki…
Vehnem Apr 10, 2026
a223d83
feat(eval): changes to core, for new eval
Vehnem Apr 10, 2026
32b5790
exp(moviekg): new eval for duprate and entity count using new eval api
Vehnem Apr 10, 2026
c9213b9
Merge pull request #7 from ScaDS/refactor-eval
Vehnem Apr 10, 2026
aa29208
Merge pull request #8 from ScaDS/refactor-syskg
Vehnem Apr 10, 2026
cd69c4e
feature: parameter: vis scatter plot
Vehnem Feb 27, 2026
faa8cd2
exp(params): added agreement-maker-light
Vehnem Mar 19, 2026
a46f7c6
feat(params): init config_mapper idea (global to local tool specific …
Vehnem Mar 19, 2026
86945dc
stash
Vehnem Apr 2, 2026
41e4d88
exp(conf): added mockup experiments for paper
Vehnem Apr 10, 2026
28811be
exp(conf): missing mockup code
Vehnem Apr 10, 2026
ec08d89
fix: fusion task imports
Vehnem Apr 10, 2026
52072ac
feat(eval): add ignored-entity filtering and intersecting-type alignment
Vehnem Apr 14, 2026
ea02c15
stash
Vehnem Apr 16, 2026
1ff7cbd
stash
Vehnem Apr 16, 2026
5ff6c95
exp(params): draft config experiments; KgPipe consume configs
Vehnem Apr 21, 2026
719afd0
feat(llm): added first version of any_extract llm task
Vehnem Apr 25, 2026
2c8265c
Merge pull request #10 from ScaDS/parameters
Vehnem Apr 25, 2026
72860dd
changes to kgi-bench mov eval
Vehnem May 12, 2026
657ba84
Merge pull request #11 from ScaDS/refactor-cleanup
Vehnem May 12, 2026
a9cef34
stash
Vehnem Apr 27, 2026
71b0db8
exp(params): final selection of pipelines
Vehnem May 12, 2026
6dc3e40
fix(eval): renamed triple align metric
Vehnem May 13, 2026
9987cb2
exp(moviekg): changed paths
Vehnem May 13, 2026
36e0489
feat(genie-task)
Vehnem May 13, 2026
a32377f
exp(params): ignore test data, READE note to run sge pipelines
Vehnem Jun 1, 2026
5a5b022
docu: changed docs to mkdocs and workflow
Vehnem Jun 2, 2026
41520b5
Merge pull request #12 from ScaDS/cleanup-eval
Vehnem Jun 2, 2026
ccdf436
cleanup: docu refactor; consitency metrics update
Vehnem Jun 3, 2026
4c89c76
Revise README.md with updated KGI-Bench links
Vehnem Jun 6, 2026
a45df9d
Merge pull request #13 from ScaDS/cleanup-eval
Vehnem Jun 6, 2026
d76e982
docu: gh-page workflow
Vehnem Jun 6, 2026
61c74e2
Merge branch 'cleanup-eval'
Vehnem Jun 6, 2026
ff6360d
docs + moviekg: mkdocs fix; updated zenodo download links; Makefile fix
Vehnem Jun 8, 2026
3bae476
Merge pull request #14 from ScaDS/cleanup-eval
Vehnem Jun 8, 2026
2bda809
update mkdocs
Vehnem Jun 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: docs

on:
push:
branches: [ "main" ]
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: "pages"
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install docs dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[docs]"

- name: Build site
run: mkdocs build --strict

- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: site

deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ poetry.lock
.idea/
target/

# agents
.cursor/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
73 changes: 70 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,50 @@
# KGpipe: A Framework for Knowledge Graph Integration Pipelines

- 📊 [Benchmark Datasets](https://doi.org/10.5281/zenodo.17246357)
## Related benchmarks & datasets

- **KGI-Bench**: benchmark specification + tooling for KG integration evaluation. See `https://github.com/ScaDS/KGI-Bench`.
- **KGI-Bench (Movies)**: Movie-domain benchmark dataset release (Zenodo). See `https://doi.org/10.5281/zenodo.17246357`.


KGpipe is an open-source framework for defining, executing, and evaluating knowledge graph (KG) integration pipelines.
It enables the reuse and composition of existing tools (e.g., OpenIE, PARIS, JedAI) and Large Language Models (LLMs) into modular pipelines that integrate heterogeneous data sources into a unified KG.

![KGpipe workflow](docs/workflow.png)

**Who is this for?**
- You have multiple heterogeneous sources (RDF/JSON/text) and want a **reproducible, modular pipeline**.
- You want to **reuse existing tooling** (Python libs, Dockerized CLIs, remote APIs/LLMs) without rewriting everything.
- You want to **evaluate** generated KGs with a growing set of metrics (`kgpipe_eval`).

**Key features:**
- Modular and extensible pipeline specification.
- Support for multiple execution backends (Python, Docker, HTTP services).
- Standardized I/O between tasks for reproducibility and interoperability.
- Novel benchmark for systematic evaluation of pipelines across RDF, JSON, and text sources.
- Metrics covering structural, semantic, and reference-based evaluation.

## Quickstart (5 minutes)

Install from source (editable):

```bash
pip install -e .
kgpipe --help
```

Bootstrap a minimal example project and discover its tasks:

```bash
cd experiments/examples
./init.sh

cd "<your-new-experiment-dir>"
pip install -e .

kgpipe discover --package <your_python_package> --show-results
kgpipe list --type tasks
```

## Architecture

Each pipeline is a sequence of tasks with well-defined input/output contracts.
Expand Down Expand Up @@ -49,7 +81,42 @@ KGpipe provides Single-Source Pipelines (SSPs) and Multi-Source Pipelines (MSPs)

## Usage

For documentation see the [docs](docs/reproduce.md)
Documentation lives in `docs/`:
- **Start here**: `docs/index.md` and `docs/quickstart.md`
- **Adopting KGpipe / wrapping existing tools**: `docs/adoption.md`
- **Evaluation (new API)**: `docs/evaluation.md` (uses `kgpipe_eval`)
- **MovieKG reproduction**: `docs/reproduce.md`

### Documentation site (GitHub Pages)

This repo is set up to build docs with **MkDocs + Material**:
- config: `mkdocs.yml`
- local build instructions: `docs/README.md`
- deploy workflow: `.github/workflows/docs.yml` (GitHub Pages via Actions)

## Installation notes (CPU vs CUDA)

Some optional ML dependencies (e.g. `sentence_transformers`) pull in PyTorch (`torch`). Depending on which PyTorch wheel gets selected, you may see large downloads like `nvidia-*` and `triton`.

KGpipe keeps the ML stack out of the default install; install it explicitly when needed. For `uv`, PyTorch is pinned to the official PyTorch wheel indexes to avoid accidentally pulling CUDA wheels from PyPI.

### Base install (fast, no torch)

```bash
uv pip install .
```

### ML install with CPU-only PyTorch (no `nvidia-*`)

```bash
uv pip install ".[ml,cpu]"
```

### ML install with CUDA-enabled PyTorch (will download `nvidia-*`)

```bash
uv pip install ".[ml,cuda]"
```

## Experiments
- **[moviekg](experiments/moviekg/README.md)** evalaution of a pipelines, building a Movie KG from three sources (rdf,json,text).
- **[moviekg](experiments/moviekg/README.md)** evaluation of pipelines, building a Movie KG from three sources (rdf, json, text).
85 changes: 85 additions & 0 deletions docs/adoption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Adopting KGpipe (integrating existing pipelines/tools)

This page explains how to **adopt KGpipe** when you already have:
- an existing KG pipeline (e.g., DBpedia-style multi-step workflows), and/or
- existing implementations you want to reuse (Python code, Dockerized tools, external APIs).

The goal is to map “what you already have” onto KGpipe’s building blocks:
- **Tasks**: reusable steps with typed inputs/outputs (`input_spec` / `output_spec`)
- **Pipelines**: ordered task graphs (`KgPipe`) that transform `Data` from seed → result
- **Configuration**: parameters passed into tasks (often via env/config profiles)

## 1) Convert an existing pipeline into a KGpipe pipeline

When you have a pipeline described elsewhere (scripts, Airflow, Makefile, DBpedia extraction steps, etc.), do this:

1. **List pipeline steps** (one row per step): name, inputs, outputs, and “how it runs” (Python/Docker/API).
2. **Define formats** for each boundary artifact (RDF formats, CSV, JSON, text). If needed, extend formats.
3. **Wrap each step as a KGpipe task** (see sections below).
4. **Compose tasks into a `KgPipe`** and verify the input/output formats connect.

Practical tip: start by wrapping a *single* step and run it via `kgpipe task ...`, then grow into a pipeline.

## 2) Wrap existing tasks (three common patterns)

### A) Wrap a Dockerized CLI tool

Use this when the tool is a command-line program and can run inside a container.

Reference example:
- `src/kgpipe_tasks/entity_resolution/matcher/paris_rdf_matcher.py`

What to document for each wrapper:
- Docker image name + how to build/pull it
- command template (mapping KGpipe input/output keys to CLI args)
- volume mounts / working dir assumptions
- required environment variables

### B) Wrap existing Python code

Use this when you have Python functions/classes you want to call directly.

Reference example:
- `experiments/param-opti/src/param_opti/tasks/base_linker.py`

What to document for each wrapper:
- the function/class you call
- how you read from `inputs[...]` and write to `outputs[...]`
- how you map configuration parameters into function args (or config objects)

### C) Wrap an external API (HTTP service)

Use this when the implementation is “some service endpoint” (DBpedia Spotlight, LLM providers, etc.).

Reference examples:
- `experiments/param-opti/src/param_opti/tasks/spotlight_lib.py`
- `experiments/param-opti/src/param_opti/tasks/spotlight.py`

What to document for each wrapper:
- endpoint URL + auth
- request/response format
- retry/timeouts and caching
- how you handle rate limits and partial failures

## 3) Discovery (making your tasks available)

Once tasks exist in a Python package, KGpipe can discover them (they register when imported).

```bash
kgpipe discover --package <your_package> --show-results
kgpipe list --type tasks
```

## 4) Recommended structure for “adopted” pipelines

A maintainable layout usually separates:
- `tasks/`: wrappers (Python/Docker/API)
- `pipelines/`: composition (KgPipe builders or pipeline configs)
- `configs/`: pipeline/task configuration profiles
- `docker/`: Dockerfiles and wrapper scripts (if needed)

## Status

This page is the intended replacement for `migration.md` (which was a misleading name). It will be expanded with
copy-pastable code snippets for each wrapper type using the referenced files above as canonical examples.

45 changes: 45 additions & 0 deletions docs/create-docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Docs (MkDocs)

This repository uses **MkDocs + Material** to build the documentation site from the Markdown files in `docs/`.

## Local preview

### Option A: pip

```bash
python -m pip install -e ".[docs]"
mkdocs serve
```

Then open the URL shown in the terminal (usually `http://127.0.0.1:8000/`).

### Option B: uv (recommended if you use uv)

```bash
uv pip install -e ".[docs]"
mkdocs serve
```

## Build

```bash
mkdocs build --strict
```

The static site is written to `site/`.

## Navigation / sidebar

Edit `mkdocs.yml` (`nav:` section) to control:
- sidebar structure
- ordering
- page titles

## Deployment (GitHub Pages)

Deployment is handled by the GitHub Actions workflow:
- `.github/workflows/docs.yml`

In your GitHub repo settings, set:
- **Settings → Pages → Source**: **GitHub Actions**

Loading
Loading