Skip to content

Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40

Open
HernandezEduin wants to merge 44 commits into
HalcyonSolutions:masterfrom
HernandezEduin:master
Open

Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40
HernandezEduin wants to merge 44 commits into
HalcyonSolutions:masterfrom
HernandezEduin:master

Conversation

@HernandezEduin

Copy link
Copy Markdown
Contributor

Pull Request Summary

Overview

This PR introduces a major refactor and upgrade to the data and code structure of the repository, aligning with best practices for multi-hop Knowledge Graph Question Answering (KGQA) and enabling compatibility with the HalcyonSolutions/MultiHopKG project.


Key Changes

1. Dataset and Directory Reorganization

  • New Data Subdirectories:
    • embeddings/ for all embedding files
    • metadata/ for node and relationship data/info
    • mappings/ for mapping files (e.g., mid2name)
    • vocabs/ for vocabularies/sets (entities, relationships)
    • qa/ for question-answer datasets (FreebaseQA, Jeopardy, KinshipHinton, MetaQA, etc.)
    • link_prediction/ for link prediction datasets (FB15k-237, FB15k, FJ-Wiki, etc.)
    • source/ for raw source datasets
  • Files Moved & Renamed:
    • All major data and QA files have been moved into their respective subfolders and often renamed for clarity and consistency. @ottersome WITH THE SOLE EXCEPTION OF MQuAKE!! This is will be done later.

2. Expanded Dataset Coverage

  • Link Prediction Datasets:
    • Added support and scripts for FB15k-237, FB15k, FJ-Wiki, Fb-Wiki-V2, Fb-Wiki, FamilyBodon (plus multiple subsets), KinshipHinton, MetaQA, WN18RR, and more.
  • QA Data:
    • Integrated new and expanded QA splits for KinshipHinton, MetaQA, and others.
  • Neo4j Dumps:
    • Included new and renamed dumps for graph database import.

3. Script Refactor & Functional Upgrades

  • Script Arguments:
    • All scripts now default to new data paths, use modern argparse patterns (including action='store_true' and nargs), and improved parameter validation.
  • Batch Processing:
    • Enhanced or added batch processing utilities for entity/relationship metadata and triplet extraction.
  • Loading/Saving:
    • Utilities now support both single files and lists, and robust DataFrame merging.
  • Consistent API:
    • Triplet processing utilities now operate on DataFrames for better extensibility.

4. New Functionality

  • Subgraph & Neighborhood Extraction:
    • Scripts for exporting subgraphs and neighborhoods from triplet data for sampling/analysis.
  • Random Walk Statistics:
    • Scripts for generating random walk statistics over knowledge graphs.
  • MetaQA Conversion:
    • New converters for MetaQA format to multi-hop KGQA format.
  • Graph Builder:
    • Simplified Neo4j graph builder for rapid node/link loading.
  • Test Coverage:
    • Added test scripts for Wikidata utilities and triplet sanity checks.

5. Bugfixes & API Consistency

  • Entity/Relationship Pruning:
    • Improved logic for vocabulary and relationship extraction, duplicate removal, and inverse mapping.
  • Argument Consistency:
    • All scripts now use consistent argument names and paths, and support the updated data structure.

6. Documentation & Examples

  • Script Comments & Docstrings:
    • Expanded inline documentation for functions and scripts.
  • Usage Examples:
    • Added shell (.sh) and batch (.bat) scripts for typical workflows.
  • Test Suite:
    • Sample test scripts for functional verification.

7. Removals

  • Legacy Scripts & Data:
    • Deprecated scripts and data files have been removed or superseded.

Wikidata Compliance Enhancements

Wikidata scraping, entity, and property processing utilities have been updated to be fully Wikimedia-compliant:

  • Custom User-Agent:
    • All network requests (including API, SPARQL, and HTML scraping) now use a user-configurable User-Agent string as required by Wikimedia/Wikidata policies.
    • The User-Agent is loaded from a config file (config_wiki.ini) and applied globally to all requests.
    [Wikimedia]
    project = <project_name>
    repo = <github repo>
    mail = <email>
  • Rate Limiting and Backoff:
    • Thread-local sessions, retries, and exponential backoff are implemented to respect API rate limits and Retry-After headers.
    • SPARQL queries include maxlag and agent information.
  • Thread Safety:
    • Thread-local clients and sessions are used for concurrent entity/relationship fetching, ensuring robust and respectful network use.
  • API Error Handling:
    • Improved error reporting and retry logic for HTTP errors, maxlag, and backend throttling.
  • Documentation:
    • Inline docstrings and comments specify Wikimedia compliance and expected configuration.

This ensures all data collection and scraping can be performed without risking blocks or violating Wikimedia usage rules, in accordance with the repository's new standards.


Example: How to Use

Most scripts now expect inputs in the new directory structure, e.g.:

python fbwiki_triplet_creation.py \
    --entity-list-path ./data/vocabs/nodes_fb15k.txt \
    --triplet-output-path ./data/temp/triplet_creation_fb15k_wiki.txt \
    --forwarding-output-path ./data/temp/forwarding_creation_fb15k_wiki.txt

See the script_samples/ folder for more shell/batch examples.


Impact

  • Unified and clear file structure for easier navigation and maintenance.
  • Expanded support for datasets and tasks: link prediction, QA, graph construction, and analysis.
  • Cleaner, more reliable code with improved parameter handling and error resilience.
  • Ready for integration with downstream projects such as HalcyonSolutions/MultiHopKG.
  • Wikidata scraping is now fully compliant with Wikimedia's terms of service, user-agent, and rate limitations, ensuring long-term maintainability.

For Reviewers

  • Please check that your workflows and data pipelines point to the new folder and file locations.
  • All scripts and data organization have been updated to be modular and extensible for future datasets and tasks.
  • Let us know if you encounter any missing links or backward compatibility issues.

No edit to the code was made, only moving around of functions.
- Instead of Continuing after limit has been reached, inner and outer
loops are broken to avoid wasting time.
- Process Entity Triplets now requires forwarding_file_path and will
check if files already exists. If it doesn't, it will create a blank
file.
If the function contains more than 2 arguments, they are separated line
by line to improve readability.
Now only the triplets sets is necesary to provide the statistics of the
dataset. If the entity set and relationship set are in Wikidata format,
additional information will be autanatically provided (i.e.,
categories). If the node_data and relation_data are provided, the
plotting can be used to show the rankings and names.
Sample scripts are provided under `script_ sample` that showcase how to
run their respective codes for the particular dataset listed.
…me instead of manually loading files inside

Making the modification so the functions receive pd.Dataframe instead of
the path to these triplets and the respective changes to files that are
affected by these updates.
VERY IMPORTANT: KEEP AN EYE OUT FOR POSSIBLE ERRORS DUE TO MODIFICATIONS
OF DATAFRAMES INSTEAD OF COPYING THEM!!!
Modified load_pandas, load_to_set, and load_to_dict so they can handle
loading of multiple files at once.
Removing load_to_set from functions to load the entities/relations,
instead these sets must be passed. Updated affected functions
Improvement to make the QA Dataset compatible with the MultiHopKG (ours)
Including codes that are meant to be used with Kinship (Geoffrey
Hinton).
Added code to export the subset of a graph given the original graph, a
seed node, and n-hop values
Added a file that calculates the random walk statistics for a given KG,
QA set, and hop size.
Simple test file to inspect if the name search, detail extraction, and
triplet collection works.
Created a subfolder for each of the triplet datasets (Fb-Wiki, Fb15k,
KinshipHinton), excluding MQuAKE (for now).
Each folder should contain (but not limited to:
- test.txt
- train.txt
- valid.txt
- triplets.txt (entire KG)
- Create a subfolder in question for each QA dataset
- Added MetaQA and KinshipHinton Processed Questions
Added MetaQA and KinshipHinton
Consider converting the embedding files from csv to some better format
or deleting them altogether.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a major refactor of the repository structure and codebase to enhance compatibility with multi-hop Knowledge Graph Question Answering (KGQA) systems, particularly the HalcyonSolutions/MultiHopKG project. The refactor reorganizes data directories, expands dataset support, and ensures Wikidata compliance.

  • Comprehensive directory restructuring with dedicated folders for embeddings, metadata, mappings, vocabs, QA datasets, and link prediction datasets
  • Expanded dataset coverage including FB15k-237, MetaQA, KinshipHinton, and FamilyBodon with proper train/test/validation splits
  • Script modernization with improved argument parsing, batch processing utilities, and consistent API design

Reviewed Changes

Copilot reviewed 181 out of 251 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
triplet_creations/utils/statistics_triplets.py Added Wikidata format detection and category calculation logic
triplet_creations/utils/simple_graph.py New SimpleGraph class for Neo4j graph operations with path finding and neighborhood extraction
triplet_creations/utils/process_triplets.py Refactored functions to accept DataFrames instead of file paths for better modularity
triplet_creations/utils/basic.py Enhanced file loading functions to support multiple file paths and additional parameters
Multiple test files New test scripts for Wikidata utilities and triplet validation
Script samples Added shell and batch script examples for common workflows
Various other scripts Updated argument parsing, modernized parameter handling, and aligned with new directory structure

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return counts[counts > 1].index

def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> (dict, set):
def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> Tuple[dict, set]:

Copilot AI Oct 2, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature uses Tuple[dict, set] but should use specific types for better type safety. Consider using Tuple[Dict[str, str], Set[int]] to indicate the dictionary maps strings to strings and the set contains integer indices.

Copilot uses AI. Check for mistakes.
Comment thread triplet_creations/utils/simple_graph.py Outdated
Comment thread triplet_creations/utils/simple_graph.py
Comment thread triplet_creations/jeopardy_2_wikidata_bert.py Outdated

args = parse_args()

assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True."

Copilot AI Oct 2, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion message refers to setting flags 'to True', but these are action='store_true' flags that don't require explicit True values. Consider updating the message to 'must be specified' or 'must be provided' for clarity.

Suggested change
assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True."
assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be specified."

Copilot uses AI. Check for mistakes.
HernandezEduin and others added 3 commits October 3, 2025 01:53
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Added a file that will check if the triplets & splits are consistent and
whether there are any missing entities/relationships in the metadata.
Added a script that verifies the local wikidata utility package is able
to
- Initialize a Client
- Search Entity by name
- Entity/Relation Metadata Retrieval
- Triplet Extraction (Head only)
@HernandezEduin

Copy link
Copy Markdown
Contributor Author

@ottersome Bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants