Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility by HernandezEduin · Pull Request #40 · HalcyonSolutions/MultiHopDatasetConstruction

HernandezEduin · 2025-10-02T17:49:58Z

Pull Request Summary

Overview

This PR introduces a major refactor and upgrade to the data and code structure of the repository, aligning with best practices for multi-hop Knowledge Graph Question Answering (KGQA) and enabling compatibility with the HalcyonSolutions/MultiHopKG project.

Key Changes

1. Dataset and Directory Reorganization

New Data Subdirectories:
- embeddings/ for all embedding files
- metadata/ for node and relationship data/info
- mappings/ for mapping files (e.g., mid2name)
- vocabs/ for vocabularies/sets (entities, relationships)
- qa/ for question-answer datasets (FreebaseQA, Jeopardy, KinshipHinton, MetaQA, etc.)
- link_prediction/ for link prediction datasets (FB15k-237, FB15k, FJ-Wiki, etc.)
- source/ for raw source datasets
Files Moved & Renamed:
- All major data and QA files have been moved into their respective subfolders and often renamed for clarity and consistency. @ottersome WITH THE SOLE EXCEPTION OF MQuAKE!! This is will be done later.

2. Expanded Dataset Coverage

Link Prediction Datasets:
- Added support and scripts for FB15k-237, FB15k, FJ-Wiki, Fb-Wiki-V2, Fb-Wiki, FamilyBodon (plus multiple subsets), KinshipHinton, MetaQA, WN18RR, and more.
QA Data:
- Integrated new and expanded QA splits for KinshipHinton, MetaQA, and others.
Neo4j Dumps:
- Included new and renamed dumps for graph database import.

3. Script Refactor & Functional Upgrades

Script Arguments:
- All scripts now default to new data paths, use modern argparse patterns (including action='store_true' and nargs), and improved parameter validation.
Batch Processing:
- Enhanced or added batch processing utilities for entity/relationship metadata and triplet extraction.
Loading/Saving:
- Utilities now support both single files and lists, and robust DataFrame merging.
Consistent API:
- Triplet processing utilities now operate on DataFrames for better extensibility.

4. New Functionality

Subgraph & Neighborhood Extraction:
- Scripts for exporting subgraphs and neighborhoods from triplet data for sampling/analysis.
Random Walk Statistics:
- Scripts for generating random walk statistics over knowledge graphs.
MetaQA Conversion:
- New converters for MetaQA format to multi-hop KGQA format.
Graph Builder:
- Simplified Neo4j graph builder for rapid node/link loading.
Test Coverage:
- Added test scripts for Wikidata utilities and triplet sanity checks.

5. Bugfixes & API Consistency

Entity/Relationship Pruning:
- Improved logic for vocabulary and relationship extraction, duplicate removal, and inverse mapping.
Argument Consistency:
- All scripts now use consistent argument names and paths, and support the updated data structure.

6. Documentation & Examples

Script Comments & Docstrings:
- Expanded inline documentation for functions and scripts.
Usage Examples:
- Added shell (.sh) and batch (.bat) scripts for typical workflows.
Test Suite:
- Sample test scripts for functional verification.

7. Removals

Legacy Scripts & Data:
- Deprecated scripts and data files have been removed or superseded.

Wikidata Compliance Enhancements

Wikidata scraping, entity, and property processing utilities have been updated to be fully Wikimedia-compliant:

Custom User-Agent:
- All network requests (including API, SPARQL, and HTML scraping) now use a user-configurable User-Agent string as required by Wikimedia/Wikidata policies.
- The User-Agent is loaded from a config file (config_wiki.ini) and applied globally to all requests.
```
[Wikimedia]
project = <project_name>
repo = <github repo>
mail = <email>
```
Rate Limiting and Backoff:
- Thread-local sessions, retries, and exponential backoff are implemented to respect API rate limits and Retry-After headers.
- SPARQL queries include maxlag and agent information.
Thread Safety:
- Thread-local clients and sessions are used for concurrent entity/relationship fetching, ensuring robust and respectful network use.
API Error Handling:
- Improved error reporting and retry logic for HTTP errors, maxlag, and backend throttling.
Documentation:
- Inline docstrings and comments specify Wikimedia compliance and expected configuration.

This ensures all data collection and scraping can be performed without risking blocks or violating Wikimedia usage rules, in accordance with the repository's new standards.

Example: How to Use

Most scripts now expect inputs in the new directory structure, e.g.:

python fbwiki_triplet_creation.py \
    --entity-list-path ./data/vocabs/nodes_fb15k.txt \
    --triplet-output-path ./data/temp/triplet_creation_fb15k_wiki.txt \
    --forwarding-output-path ./data/temp/forwarding_creation_fb15k_wiki.txt

See the script_samples/ folder for more shell/batch examples.

Impact

Unified and clear file structure for easier navigation and maintenance.
Expanded support for datasets and tasks: link prediction, QA, graph construction, and analysis.
Cleaner, more reliable code with improved parameter handling and error resilience.
Ready for integration with downstream projects such as HalcyonSolutions/MultiHopKG.
Wikidata scraping is now fully compliant with Wikimedia's terms of service, user-agent, and rate limitations, ensuring long-term maintainability.

For Reviewers

Please check that your workflows and data pipelines point to the new folder and file locations.
All scripts and data organization have been updated to be modular and extensible for future datasets and tasks.
Let us know if you encounter any missing links or backward compatibility issues.

No edit to the code was made, only moving around of functions.

- Instead of Continuing after limit has been reached, inner and outer loops are broken to avoid wasting time. - Process Entity Triplets now requires forwarding_file_path and will check if files already exists. If it doesn't, it will create a blank file.

If the function contains more than 2 arguments, they are separated line by line to improve readability.

Now only the triplets sets is necesary to provide the statistics of the dataset. If the entity set and relationship set are in Wikidata format, additional information will be autanatically provided (i.e., categories). If the node_data and relation_data are provided, the plotting can be used to show the rankings and names.

Sample scripts are provided under `script_ sample` that showcase how to run their respective codes for the particular dataset listed.

…me instead of manually loading files inside Making the modification so the functions receive pd.Dataframe instead of the path to these triplets and the respective changes to files that are affected by these updates. VERY IMPORTANT: KEEP AN EYE OUT FOR POSSIBLE ERRORS DUE TO MODIFICATIONS OF DATAFRAMES INSTEAD OF COPYING THEM!!!

Modified load_pandas, load_to_set, and load_to_dict so they can handle loading of multiple files at once.

Removing load_to_set from functions to load the entities/relations, instead these sets must be passed. Updated affected functions

Improvement to make the QA Dataset compatible with the MultiHopKG (ours)

Including codes that are meant to be used with Kinship (Geoffrey Hinton).

Added code to export the subset of a graph given the original graph, a seed node, and n-hop values

Added a file that calculates the random walk statistics for a given KG, QA set, and hop size.

Simple test file to inspect if the name search, detail extraction, and triplet collection works.

Created a subfolder for each of the triplet datasets (Fb-Wiki, Fb15k, KinshipHinton), excluding MQuAKE (for now). Each folder should contain (but not limited to: - test.txt - train.txt - valid.txt - triplets.txt (entire KG)

- Create a subfolder in question for each QA dataset - Added MetaQA and KinshipHinton Processed Questions

Added MetaQA and KinshipHinton

…'Source'

Consider converting the embedding files from csv to some better format or deleting them altogether.

- Modified affected code according to the new data names and location - Additional Minor changes - Replaced str2bool with action='store_true' whereever possible - Added TODO to some files - Minor quick fixes

…QA-KG Navigation

Copilot

Pull Request Overview

This PR introduces a major refactor of the repository structure and codebase to enhance compatibility with multi-hop Knowledge Graph Question Answering (KGQA) systems, particularly the HalcyonSolutions/MultiHopKG project. The refactor reorganizes data directories, expands dataset support, and ensures Wikidata compliance.

Comprehensive directory restructuring with dedicated folders for embeddings, metadata, mappings, vocabs, QA datasets, and link prediction datasets
Expanded dataset coverage including FB15k-237, MetaQA, KinshipHinton, and FamilyBodon with proper train/test/validation splits
Script modernization with improved argument parsing, batch processing utilities, and consistent API design

Reviewed Changes

Copilot reviewed 181 out of 251 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
triplet_creations/utils/statistics_triplets.py	Added Wikidata format detection and category calculation logic
triplet_creations/utils/simple_graph.py	New SimpleGraph class for Neo4j graph operations with path finding and neighborhood extraction
triplet_creations/utils/process_triplets.py	Refactored functions to accept DataFrames instead of file paths for better modularity
triplet_creations/utils/basic.py	Enhanced file loading functions to support multiple file paths and additional parameters
Multiple test files	New test scripts for Wikidata utilities and triplet validation
Script samples	Added shell and batch script examples for common workflows
Various other scripts	Updated argument parsing, modernized parameter handling, and aligned with new directory structure

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-02T17:51:59Z

    return counts[counts > 1].index

-def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> (dict, set):
+def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> Tuple[dict, set]:


The function signature uses Tuple[dict, set] but should use specific types for better type safety. Consider using Tuple[Dict[str, str], Set[int]] to indicate the dictionary maps strings to strings and the set contains integer indices.

Copilot · 2025-10-02T17:52:00Z


    args = parse_args()

+    assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True."


The assertion message refers to setting flags 'to True', but these are action='store_true' flags that don't require explicit True values. Consider updating the message to 'must be specified' or 'must be provided' for clarity.

Suggested change

assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True."

assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be specified."

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Added a file that will check if the triplets & splits are consistent and whether there are any missing entities/relationships in the metadata.

Added a script that verifies the local wikidata utility package is able to - Initialize a Client - Search Entity by name - Entity/Relation Metadata Retrieval - Triplet Extraction (Head only)

HernandezEduin · 2026-02-27T11:16:30Z

@ottersome Bump

HernandezEduin added 30 commits October 1, 2025 15:41

WIKIDATA V2: Update in Compaliance to WikiMedia Restrictions

43f46d0

WIKIDATA V2: Updating Summary and Reordering Imports

fb68d31

WIKIDATA V2: Reordering and Categorizing of Functions (Part I)

8d9292c

WIKIDATA V2: Reordering and Categorizing of Fuctions (Part II)

e6e8ecb

No edit to the code was made, only moving around of functions.

WIKIDATA V2: Reordering and Categorizing of Functions (Part III)

b28330a

WIKIDATA V2: Reordering and Categorizing of Functions (Part IV - Final)

e859ff4

WIKIDATA V2: Minor Corrections

54961cb

- Instead of Continuing after limit has been reached, inner and outer loops are broken to avoid wasting time. - Process Entity Triplets now requires forwarding_file_path and will check if files already exists. If it doesn't, it will create a blank file.

WIKIDATA V2: Improving Function Syntax

09e30b4

If the function contains more than 2 arguments, they are separated line by line to improve readability.

Sample Scripts

31e4c2d

Sample scripts are provided under `script_ sample` that showcase how to run their respective codes for the particular dataset listed.

Minor Code Updates

53d324d

Basic: Updating Loading functions for Multiple Loading

173843b

Modified load_pandas, load_to_set, and load_to_dict so they can handle loading of multiple files at once.

WIKIDATA V2: Removing load_to_set from functions

c1c01f3

Removing load_to_set from functions to load the entities/relations, instead these sets must be passed. Updated affected functions

FreebaseQA 2 Fb15k: Modification for MultiHopKG Compatibility

a682b3a

Improvement to make the QA Dataset compatible with the MultiHopKG (ours)

Kinship Code

8231a75

Including codes that are meant to be used with Kinship (Geoffrey Hinton).

Export Subset Graph

1a7c0b8

Added code to export the subset of a graph given the original graph, a seed node, and n-hop values

Random Walk Stats

0ac985d

Added a file that calculates the random walk statistics for a given KG, QA set, and hop size.

WIKIDATA V2: Test File

64e0cc4

Simple test file to inspect if the name search, detail extraction, and triplet collection works.

Link Prediction Datasets: Subfolder for each Dataset

ca0a46a

Created a subfolder for each of the triplet datasets (Fb-Wiki, Fb15k, KinshipHinton), excluding MQuAKE (for now). Each folder should contain (but not limited to: - test.txt - train.txt - valid.txt - triplets.txt (entire KG)

QA Datasets: Each QA dataset has their own folder

ce215d9

- Create a subfolder in question for each QA dataset - Added MetaQA and KinshipHinton Processed Questions

Link Prediction: Adding/Moving Triplets file to Corresponding Folder

1c1a1d7

Auto Authentication Fix (WIP)

6f4df5e

Modify folder Names: questions --> qa

35988df

Neo4j: Additional Dumps

072f954

Added MetaQA and KinshipHinton

Moving Node and Relation Data into Metadata folder

acc311f

Source: Moving Original Raw/Source file use to create triplets/qa to …

d1e448f

…'Source'

Embeddings: Moved Embeddings files into Embedding Folder

62573b7

Consider converting the embedding files from csv to some better format or deleting them altogether.

Mappings: Moved files used mainly for mapping between IDs to Mapping

5d9e7d5

Vocabs: Moved Entity and Relationship set files (txt) into Vocabs folder

7d780bf

HernandezEduin added 10 commits October 2, 2025 17:52

Removing unused/problematic Data

3a8908b

Git ignore: Adding .vscode

323c00d

Git attributes: Adding *.tsv and *.data into GCS-LFS

329c220

Dataset: Relationship_hierarchy swap path from vocabs to mappings

847b5e2

Modified code according to dataset renaming and folder moving

38d99dc

- Modified affected code according to the new data names and location - Additional Minor changes - Replaced str2bool with action='store_true' whereever possible - Added TODO to some files - Minor quick fixes

Removing unused codes

2d9675e

MetaQA: Added file that converts MetaQA into a compatible format for …

1daf76f

…QA-KG Navigation

Moving Triplet Sanity Check into Test Folder

79134ee

Script Samples: Updates according to Datasets Changes

cc98da3

Minute Changes

afe9a28

HernandezEduin requested review from Copilot and ottersome October 2, 2025 17:49

Copilot AI reviewed Oct 2, 2025

View reviewed changes

HernandezEduin and others added 3 commits October 3, 2025 01:53

Update triplet_creations/utils/simple_graph.py

e0de6f5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update triplet_creations/jeopardy_2_wikidata_bert.py

946d598

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

KG Integrity Checker: Verify KG triplets and metadata

421c6d4

Added a file that will check if the triplets & splits are consistent and whether there are any missing entities/relationships in the metadata.

HernandezEduin force-pushed the master branch from d7d03a6 to 421c6d4 Compare October 3, 2025 09:33

Wikidata Basic: Verify if core functionality works

eabd0db

Added a script that verifies the local wikidata utility package is able to - Initialize a Client - Search Entity by name - Entity/Relation Metadata Retrieval - Triplet Extraction (Head only)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40

Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40
HernandezEduin wants to merge 44 commits into
HalcyonSolutions:masterfrom
HernandezEduin:master

HernandezEduin commented Oct 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

HernandezEduin commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		args = parse_args()

		assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True."

Conversation

HernandezEduin commented Oct 2, 2025

Pull Request Summary

Overview

Key Changes

1. Dataset and Directory Reorganization

2. Expanded Dataset Coverage

3. Script Refactor & Functional Upgrades

4. New Functionality

5. Bugfixes & API Consistency

6. Documentation & Examples

7. Removals

Wikidata Compliance Enhancements

Example: How to Use

Impact

For Reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

HernandezEduin commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants