Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40
Refactor: Dataset Structure, Triplet Processing, and Script Updates for MultiHopKG Compatibility#40HernandezEduin wants to merge 44 commits into
Conversation
No edit to the code was made, only moving around of functions.
- Instead of Continuing after limit has been reached, inner and outer loops are broken to avoid wasting time. - Process Entity Triplets now requires forwarding_file_path and will check if files already exists. If it doesn't, it will create a blank file.
If the function contains more than 2 arguments, they are separated line by line to improve readability.
Now only the triplets sets is necesary to provide the statistics of the dataset. If the entity set and relationship set are in Wikidata format, additional information will be autanatically provided (i.e., categories). If the node_data and relation_data are provided, the plotting can be used to show the rankings and names.
Sample scripts are provided under `script_ sample` that showcase how to run their respective codes for the particular dataset listed.
…me instead of manually loading files inside Making the modification so the functions receive pd.Dataframe instead of the path to these triplets and the respective changes to files that are affected by these updates. VERY IMPORTANT: KEEP AN EYE OUT FOR POSSIBLE ERRORS DUE TO MODIFICATIONS OF DATAFRAMES INSTEAD OF COPYING THEM!!!
Modified load_pandas, load_to_set, and load_to_dict so they can handle loading of multiple files at once.
Removing load_to_set from functions to load the entities/relations, instead these sets must be passed. Updated affected functions
Improvement to make the QA Dataset compatible with the MultiHopKG (ours)
Including codes that are meant to be used with Kinship (Geoffrey Hinton).
Added code to export the subset of a graph given the original graph, a seed node, and n-hop values
Added a file that calculates the random walk statistics for a given KG, QA set, and hop size.
Simple test file to inspect if the name search, detail extraction, and triplet collection works.
Created a subfolder for each of the triplet datasets (Fb-Wiki, Fb15k, KinshipHinton), excluding MQuAKE (for now). Each folder should contain (but not limited to: - test.txt - train.txt - valid.txt - triplets.txt (entire KG)
- Create a subfolder in question for each QA dataset - Added MetaQA and KinshipHinton Processed Questions
Added MetaQA and KinshipHinton
Consider converting the embedding files from csv to some better format or deleting them altogether.
- Modified affected code according to the new data names and location - Additional Minor changes - Replaced str2bool with action='store_true' whereever possible - Added TODO to some files - Minor quick fixes
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a major refactor of the repository structure and codebase to enhance compatibility with multi-hop Knowledge Graph Question Answering (KGQA) systems, particularly the HalcyonSolutions/MultiHopKG project. The refactor reorganizes data directories, expands dataset support, and ensures Wikidata compliance.
- Comprehensive directory restructuring with dedicated folders for embeddings, metadata, mappings, vocabs, QA datasets, and link prediction datasets
- Expanded dataset coverage including FB15k-237, MetaQA, KinshipHinton, and FamilyBodon with proper train/test/validation splits
- Script modernization with improved argument parsing, batch processing utilities, and consistent API design
Reviewed Changes
Copilot reviewed 181 out of 251 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| triplet_creations/utils/statistics_triplets.py | Added Wikidata format detection and category calculation logic |
| triplet_creations/utils/simple_graph.py | New SimpleGraph class for Neo4j graph operations with path finding and neighborhood extraction |
| triplet_creations/utils/process_triplets.py | Refactored functions to accept DataFrames instead of file paths for better modularity |
| triplet_creations/utils/basic.py | Enhanced file loading functions to support multiple file paths and additional parameters |
| Multiple test files | New test scripts for Wikidata utilities and triplet validation |
| Script samples | Added shell and batch script examples for common workflows |
| Various other scripts | Updated argument parsing, modernized parameter handling, and aligned with new directory structure |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| return counts[counts > 1].index | ||
|
|
||
| def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> (dict, set): | ||
| def _process_duplicate_inverse_relations(df: pd.DataFrame, rel_subprop: pd.DataFrame, duplicate_values: pd.Index, column: str) -> Tuple[dict, set]: |
There was a problem hiding this comment.
The function signature uses Tuple[dict, set] but should use specific types for better type safety. Consider using Tuple[Dict[str, str], Set[int]] to indicate the dictionary maps strings to strings and the set contains integer indices.
|
|
||
| args = parse_args() | ||
|
|
||
| assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True." |
There was a problem hiding this comment.
The assertion message refers to setting flags 'to True', but these are action='store_true' flags that don't require explicit True values. Consider updating the message to 'must be specified' or 'must be provided' for clarity.
| assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be set to True." | |
| assert args.scrape_list or args.scrape_data or args.create_hierarchy, "Error: At least one of --scrape-list, --scrape-data, or --create-hierarchy must be specified." |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Added a file that will check if the triplets & splits are consistent and whether there are any missing entities/relationships in the metadata.
Added a script that verifies the local wikidata utility package is able to - Initialize a Client - Search Entity by name - Entity/Relation Metadata Retrieval - Triplet Extraction (Head only)
|
@ottersome Bump |
Pull Request Summary
Overview
This PR introduces a major refactor and upgrade to the data and code structure of the repository, aligning with best practices for multi-hop Knowledge Graph Question Answering (KGQA) and enabling compatibility with the HalcyonSolutions/MultiHopKG project.
Key Changes
1. Dataset and Directory Reorganization
embeddings/for all embedding filesmetadata/for node and relationship data/infomappings/for mapping files (e.g., mid2name)vocabs/for vocabularies/sets (entities, relationships)qa/for question-answer datasets (FreebaseQA, Jeopardy, KinshipHinton, MetaQA, etc.)link_prediction/for link prediction datasets (FB15k-237, FB15k, FJ-Wiki, etc.)source/for raw source datasets2. Expanded Dataset Coverage
3. Script Refactor & Functional Upgrades
argparsepatterns (includingaction='store_true'andnargs), and improved parameter validation.4. New Functionality
5. Bugfixes & API Consistency
6. Documentation & Examples
.sh) and batch (.bat) scripts for typical workflows.7. Removals
Wikidata Compliance Enhancements
Wikidata scraping, entity, and property processing utilities have been updated to be fully Wikimedia-compliant:
config_wiki.ini) and applied globally to all requests.Retry-Afterheaders.maxlagand agent information.This ensures all data collection and scraping can be performed without risking blocks or violating Wikimedia usage rules, in accordance with the repository's new standards.
Example: How to Use
Most scripts now expect inputs in the new directory structure, e.g.:
python fbwiki_triplet_creation.py \ --entity-list-path ./data/vocabs/nodes_fb15k.txt \ --triplet-output-path ./data/temp/triplet_creation_fb15k_wiki.txt \ --forwarding-output-path ./data/temp/forwarding_creation_fb15k_wiki.txtSee the
script_samples/folder for more shell/batch examples.Impact
For Reviewers