Refactor sqlite toolchain: build_db pipeline, argparse CLI, README #702
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract the insertion logic into a reusable insert_one_sample() function so build_db.py can import it directly instead of duplicating the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for your contribution! |
- init_db: compute migrates_dir from script location instead of CWD-relative path - build_db: use main(args), add --op_names_path_prefix as required arg, auto-create db via migrate() - Remove unused GRAPH_NET_ROOT and graph_net import from build_db Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Auto-collect sample paths by scanning for model.py when list file is missing - Use loop over sample_types instead of repeated code blocks - Track and print success/fail counts and order range per type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Skip non-full_graph types when directory is missing - Print sample dir and list file paths before processing each type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fangfangssj
approved these changes
May 11, 2026
Comment on lines
+459
to
+461
| print( | ||
| "insert {sample_type} failed: integrity error (possible duplicate uuid or graph_hash)" | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Other
Description
本PR的工作:
build_db.py: 一站式批量建库脚本,自动初始化 DB、遍历 4 种sample_type插入样本,完成后自动执行分桶(generate_buckets)和分组(generate_groups)。graphsample_insert.sh,使用 shell 脚本批量插入样本,效率太低下,开销主要在于进程启动generate_subgraph_dataset.sh中移除已被build_db.py替代的insert_graph_sample()和generate_database()函数。为了代码复用,需要对其他组件进行函数封装:
graphsample_insert.py重构: 提取insert_one_sample()可复用函数,支持op_names_path_prefix参数以在插入子图样本时同步写入算子名称和输入 tensor meta。graph_sample_bucket_generator.py和graph_sample_groups_insert.py新增generate_buckets()/generate_groups()公共接口,供build_db.py链式调用。一些代码优化工作:
upload_dataset.py/download_dataset.py: 从upload.py/download.py重命名,移除硬编码变量,改为argparse命令行参数方式。graph_sample_groups_insert.py中gid→group_id,c→candidate,seen_dtypes→picked_dtypes,v1/v2→v1_stats/v2_stats等。Readme.md→README.md): 补充数据表结构概览、全部脚本的使用说明,所有路径改为相对路径。优化效果:相比使用
graphsample_insert.sh进行数据库生成,时间从10+h减少到1h以内。