A scalable GPU based implementation of RDF2Vec embeddings for large and dense Knowledge Graphs.
Important
This package is under active development in the beta phase. The overall class/ method design will most probably change and introduce breaking changes between releases
The content of this repository readme can be found here:
- gpuRDF2Vec
- Table of contents
- Package installation
- Repository setup
- gpuRDF2Vec overview
- Quick start
- Implementation Details
- License
- Roadmap
- Report issues and bugs
Install the package rdf2vecgpu by running the following command:
pip install rdf2vecgpu
Important
Make sure to install the accompanying cuda version as outlined in the following section
The repository setup builds on top of two major libraries. Both Pytorch lightning as well as the RAPIDS libraries cuDF and cuGraph. We provide the exeplanatory installation details for Cuda 12.6:
- Pytorch installation page Cuda 12.6 installation
pip install torch torchvision torchaudio- Detailed cudf installation instruction here. Cudf Cuda 12 install
pip install \
--extra-index-url=https://pypi.nvidia.com \
"cudf-cu12==25.4.*" "dask-cudf-cu12==25.4.*" \
"cugraph-cu12==25.4.*" "nx-cugraph-cu12==25.4.*" \
"nx-cugraph-cu12==25.4.*"The requirement files and conda environment files can be found here:
RDF2Vec is a powerful technique to generate vector embeddings of entities in RDF graphs via random walks and Word2Vec. This repository provides a GPU-optimized reimplementation, enabling:
- Speedups on dense graphs with millions of nodes
- Scalability to industrial-scale knowledge bases
- Reproducible experiments to test and qualify the overall implementation details
.
├── README.md
├── data
├── data_preparation
│ ├── converstion_to_ttl.py
│ └── merge_text_file.py
├── img
│ └── github_repo_header.png
├── jrdf2vec-1.3-SNAPSHOT.jar
├── performance
│ ├── env_files
│ │ ├── jrdf2vec_environment.yml
│ │ ├── jrdf2vec_requirements.txt
│ │ ├── pyrdf2vec_environment.yml
│ │ ├── pyrdf2vec_requirements.txt
│ │ ├── rdf2vecgpu_environment.yml
│ │ ├── rdf2vecgpu_requirements.txt
│ │ ├── sparkrdf2vec_environment.yml
│ │ └── sparkrdf2vec_requirements.txt
│ ├── evaluation_parameters.py
│ ├── gpu_rdf2vec_performance.py
│ ├── graph_creation.py
│ ├── graph_statistics.py
│ ├── jrdf2vec_based_performance.py
│ ├── pyrdf2vec_based_performance.py
│ ├── spark_rdf2vec_performance.py
│ └── wandb_analysis.py
├── src
│ ├── __init__.py
│ ├── corpus
│ │ ├── __init__.py
│ │ └── walk_corpus.py
│ ├── cpu_based_rdf2vec_approach.py
│ ├── embedders
│ │ ├── __init__.py
│ │ ├── word2vec.py
│ │ └── word2vec_loader.py
│ ├── gpu_rdf2vec.py
│ ├── helper
│ │ ├── __init__.py
│ │ └── functions.py
│ └── reader
│ ├── __init__.py
│ └── kg_reader.py
└── test
├── helper
└── reader
├── functions_test.py
└── kg_reader_test.py- GPU-backed walk generation over CUDA Kernels
- Batched Word2Vec training with Pytorch lightning
- Pluggable rdf loaders and parquet, csv, txt integration
- Performance comparison can be found in the following folder
from rdf2vecgpu import GPU_RDF2Vec, RDF2VecConfig
# Bundle all hyperparameters in a config object
config = RDF2VecConfig(
walk_strategy="random",
walk_depth=4,
walk_number=100,
embedding_model="skipgram",
epochs=5,
batch_size=None,
vector_size=100,
window_size=5,
min_count=1,
learning_rate=0.01,
negative_samples=5,
random_state=42,
reproducible=False,
multi_gpu=False,
generate_artifact=False,
cpu_count=20,
)
# Instantiate the pipeline
gpu_rdf2vec_model = GPU_RDF2Vec(config=config)
# Path to the triple dataset
path = "data/wikidata5m/wikidata5m_kg.parquet"
# Read data and receive edge data
edge_data = gpu_rdf2vec_model.read_data(path)
# Fit the Word2Vec model and transform the dataset to an embedding
embeddings = gpu_rdf2vec_model.fit_transform(edge_df=edge_data, walk_vertices=None)
# Write embedding to file format. Return format is a cuDF dataframe
embeddings.to_parquet("data/wikidata5m/wikidata5m_embeddings.parquet", index=False)- Supported file formats:
.csv.parquet.orc.nt,.nq- All supported RDFlib file formats
- Core
RDF2VecConfigparameters (see Configuration reference for the full list):walk_strategy:["random", "bfs"]walk_depth:intwalk_number:intwalk_weighted:bool(uses cuGraph biased random walks; requires aweightscolumn)embedding_model:["skipgram", "cbow"]epochs:intbatch_size:int | None— ifNone, a heuristic batch size is picked based on the data loader and the available GPU memoryvector_size:intwindow_size:intmin_count:intlearning_rate:floatnegative_samples:intrandom_state:intreproducible:boolmulti_gpu:boolgenerate_artifact:boolcpu_count:intliteral_predicates,literal_strategy,literal_n_bins,literal_bin_strategy— see the literal handling section of the docstracker:["none", "mlflow", "wandb"]— pluggable experiment tracking backend
The experiment tracking backends and the test suite are opt-in:
pip install "rdf2vecgpu[mlflow]"
pip install "rdf2vecgpu[wandb]"
pip install "rdf2vecgpu[test]"We achieve order-of-magnitude for large and dense graphs over CPU-bound RDF2Vec by engineering both the walk extraction and the Word2Vec training pipelines:
-
GPU-Native Walk Extraction
- All random-walk and BFS operations leverage cuDF/cuGraph kernels to avoid CPU–GPU data transfers and minimize latency.
- To generate k walks per node in one pass, we replicate node indices in a single cuDF DataFrame rather than looping—fully utilizing GPU parallelism and eliminating Python-loop overhead (∼15× speedup).
- BFS walks currently use GPU-side recursive joins; future work will reconstruct walks entirely in CUDA to remove join overhead.
-
cuDF→PyTorch Lightning Handoff
- Replaced Lightning’s default CPU-based DataLoader with a cuDF-backed pipeline: context/center columns live on GPU as DLPack tensors.
- Initial deep-copy loads incur extra VRAM, but thereafter all sampling/preprocessing occurs on-device, eliminating PCIe stalls.
- An “index-only” strategy (workers pull tensor indices instead of slices) uses CUDA’s pointer arithmetic for constant-time access, collapsing DataLoader overhead from ~85% of epoch time to near parity with model compute.
-
Optimized Word2Vec Training
- Batch-Size Heuristic: Estimate per-sample GPU footprint from cuDF loader, then set initial batch = (total VRAM) / (4 × footprint). This “divide-by-four” rule quickly homes in on a viable batch size, reducing tuning runs.
- Kernel Fusion: All sampling and tensor transforms migrated into PyTorch’s C++ back end, removing Python loops and the GIL, for consistent high throughput.
-
Scalable Data-Parallel Training
- We use PyTorch Distributed + NCCL: each GPU holds the same graph shard but a unique walk corpus.
- Gradients are synchronized via
all_reduceat regular intervals (~500 ms), amortizing PCIe/NVLink costs and ensuring linear scaling across nodes.
-
Distributed correctness — when to
persistThe multi-GPU code paths use
dask_cudffor the vocab build, the broadcast-encode of edge triples, and the random-walk reshape. Two patterns are easy to get wrong and produce silent data drift rather than visible crashes; both are documented inline at the relevant call sites and re-stated here for contributors.(a) Any non-deterministic dask output that is read more than once must be
.persist()-ed before the second consumer. Examples:-
MultiGPUWalkCorpus.random_walkpersists the cuGraph walk output before it computes per-partition sizes for the reshape (seecorpus/walk_corpus.pyaround thewalks_s.persist()call). -
_generate_vocabpersists the post-shuffle deduplicated vocabulary before reading partition sizes for thecumsumoffsets that drive the per-partitioncupy.arangetoken assignment (seehelper/functions.pyaround thevocabulary_df.persist()call).Without persist, every consumer triggers a re-roll of the underlying hash-shuffle / random-walk generator. Because
drop_duplicatesover a non-deterministic shuffle is itself non-deterministic, partition assignments drift between rolls — we observed 23 M → 15 M row loss between a partition-sizes pass and the eventual write on the standalone tool that pioneered this pattern.
(b) When using
.merge()against a small table, preferbroadcast=True. Without it, dask defaults to a hash-shuffle join. With a small right-side partition count, the large left side gets funneled through a single worker — observed 6+ hours at 100 % single-GPU utilization on Wikidata-scale_generate_vocabruns before this fix landed.broadcast=Truereplicates the small side to every worker so each large-side partition does a local hash join with no shuffle.Compatibility patches.
_compat.apply_patches()(called fromGPU_RDF2Vec.__init__whenmulti_gpu=True) currently fixes two upstream issues that are still alive at the cuGraph 25.6 / dask >= 2025.5 versions we pin:_patch_convert_to_cudf—cugraph.dask's random-walkconvert_to_cudfhelper crashes onrenumber=Falsegraphs because it unconditionally dereferencesnumber_map.implementation.numbered._patch_dask_cudf_from_cudf— works around a dask-expr backend breakage in dask >= 2025.5 that affectsdask_cudf.from_cudf.
Both patches are self-contained workarounds; the long-term fix is upstream in cuGraph / dask-cudf. Track those there and remove the patches once they land.
-
The overview of the used MIT license can be found here
- Order aware Word2Vec following the details of Ling, Wang, et al. "Two/too simple adaptations of word2vec for syntax problems.. Issue item
- Provide spilling to single GPU training to work around potential OOM issues faced during rdf2vec training Issue Item
- Provide weighted walks for spatial datasets Issue item
- Provide logging capabilities of complete Word2Vec pipeline for Wandb and mlflow. Issue item
- Optional gensim Word2Vec trainer backend (
backend="gensim") alongside the default PyTorch Lightning trainer. The gensim C path is 5–10× faster on CPU for very large corpora (~390 M-token vocab, multi-billion walks) and supports constant-memory streaming viapyarrow.parquet.ParquetFile.iter_batches. Useful when running on a CPU-rich + GPU-light box where the PyTorch trainer is bottlenecked on data movement rather than compute. Shipped inv0.4.0; opt in viapip install rdf2vecgpu[gensim]andRDF2VecConfig(backend="gensim").
In case you have found a bug or unexpected behaviour, please reach out by opening an issue:
-
When opening an issue, please tag the issue with the label Bug. Please include the following information:
- Environment: OS, Python/CUDA/PyTorch/RAPIDS versions (cuDF, cuGraph)
- Reproduction steps: Exact commands or small code snippet
- Input data graph format & size (attach a minimal sample if possible)
- Observed vs. expected behavior
- Error messages/ stack traces (copy-paste or attach logs)
-
We aim to respond to open issues within 3 business days
-
If you have identified a fix, fork the repo, branch off
main, implement & test then open a PR referencing the issue.
If you use gpuRDF2Vec in your research, please cite the following paper:
@InProceedings{10.1007/978-3-032-09530-5_14,
author="B{\"o}ckling, Martin and Paulheim, Heiko",
editor="Garijo, Daniel
and Kirrane, Sabrina
and Salatino, Angelo
and Shimizu, Cogan
and Acosta, Maribel
and Nuzzolese, Andrea Giovanni
and Ferrada, Sebasti{\'a}n
and Soulard, Thibaut
and Kozaki, Kouji
and Takeda, Hideaki
and Gentile, Anna Lisa",
title="gpuRDF2vec -- Scalable GPU-Based RDF2vec",
booktitle="The Semantic Web -- ISWC 2025",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="240--257",
isbn="978-3-032-09530-5"
}
