perf(datasets): cache OSS bucket, add server-side pagination and --filter for tasks#1064
perf(datasets): cache OSS bucket, add server-side pagination and --filter for tasks#1064xdlkc wants to merge 4 commits into
Conversation
…lter for tasks - Cache oss2.Bucket instance to avoid creating 25+ connections per list_all_datasets call (5.0s → 2.7s) - Add _PaginationCache with continuation_token to resume sequential page access in O(1) - Push offset/limit down to _extract_tasks_from_split with early termination (19.4s → 1.5s for --limit 10) - Adapt max_keys to actual limit needed instead of always 1000 - Add --filter flag to tasks command, pushed down as OSS prefix for server-side filtering - Update BaseDatasetRegistry and DatasetClient interfaces with offset/limit/task_filter params Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-6 AI-Contributed/Feature: 120/540 AI-Contributed/UT: 17/448
The client.py imports TaskFile but its definition was not included in the previous commit, causing ImportError in CI. Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-6 AI-Contributed/Feature: 6/6 AI-Contributed/UT: 0/0
alibabarock
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Comment(无 blocking 问题,有几点建议)
⚠️ Warnings
-
_fs_download目录下载时get_task_file被调了两次 — 先调一次探测"单文件还是目录",失败后再 list + 逐个 get。对大目录来说浪费了一次 RTT。建议先 list 再判断,避免无意义的 get。 -
_PaginationCache的tasks字段会缓存全量已扫描的 task_ids — 34k tasks 场景下内存里存了 34k 字符串。如果反复切不同 dataset/split 查询,前一个的 cache 会一直占着。建议加个 LRU 或至少 cache 过期。 -
--ouputtypo 作为 alias 被保留 —add_argument("-o", "--output", "--ouput", ...)。理解是兼容拼写错误,但不太常规,建议注明或直接去掉。
💡 Suggestions
-
_extract_tasks_from_split的max_keys自适应 — OSS 返回的 key 可能有重复(task 目录本身 + 子文件),去重后可能不够。建议min(1000, items_still_needed + 10)作为 buffer。 -
_upload_task_file返回int | None—None表示 skipped,1表示 uploaded,和_upload_task返回的int混在一起语义模糊。建议统一用 enum 或 dataclass 表示状态。 -
_normalize_task_path路径安全校验 — 防..做得好。.和空段被过滤是正确行为,建议在 docstring 里说明。
✅ Looks Good
- Bucket 缓存 + 分页下推 + filter prefix 下推,性能提升显著(13x)
_PaginationCache设计合理,顺序翻页 O(1)datasets fs子命令结构清晰,路径安全校验到位- 隐藏文件过滤覆盖了 org/dataset/split 三层
- JSON 输出统一用
_print_json,contextlib.redirect_stdout隔离 upload 噪音 - 99 个单测全过,覆盖充分
The new pagination loop in _extract_tasks_from_split used `while True` and read is_truncated/next_continuation_token via getattr, which on a MagicMock always returns a truthy value. Tests using `return_value` (and any real OSS response with a non-advancing token) spun forever, starving CPU/memory and crashing the self-hosted CI runner and local machines. Add a hard page budget (_MAX_PAGINATION_PAGES) and a non-advancing-token guard to both pagination paths; make the test helper set explicit pagination fields so mocks terminate correctly. refs alibaba#1063 AI-Model: claude-opus-4-8 AI-Contributed/Feature: 39/39 AI-Contributed/UT: 31/31
| return ivalue | ||
|
|
||
|
|
||
| def _is_json_output(args: argparse.Namespace) -> bool: |
fixes #1063
Summary
oss2.Bucketinstance — avoids creating 25+ HTTP sessions perlist_all_datasetscall (datasets list: 5.0s → 2.7s)offset/limitdown to_extract_tasks_from_splitwith early termination andcontinuation_tokencaching (datasets tasks --limit 10: 19.4s → 1.5s)--filterflag — prefix-based task search pushed down to OSS query prefix (datasets tasks --filter 0xerr0r: 1.4s)Design
Bucket caching
_build_bucket()now lazily initializes and caches theoss2.Bucketinstance. All methods on the sameOssDatasetRegistryshare one HTTP session.Pagination cache
A
_PaginationCachedataclass stores the last query's(query_prefix, tasks, continuation_token, is_exhausted). When the next query matches (same prefix, sequential page access), it resumes from the cached token instead of re-scanning from page 1. This turns sequential pagination (page 1 → 2 → 3) into O(1) per page.max_keysis adapted per request:min(1000, items_still_needed)instead of always 1000.Filter push-down
--filteris appended to the OSS query prefix (split_prefix + filter), so OSS only returns matching keys. Task names are extracted relative to the originalsplit_prefixto preserve full names.Interface changes
list_dataset_tasksgains keyword-onlyoffset,limit,task_filterparameters across the full chain:BaseDatasetRegistry→OssDatasetRegistry→DatasetClient→ CLI. All default to their zero-value, so existing callers are unaffected.Benchmark
34,732 tasks,
oss-ap-southeast-1, ~0.5s RTT.datasets listdatasets tasks --limit 10datasets tasks(no limit)datasets tasks --filter prefixTest plan
pytest tests/unit/datasets/— 99 tests passdatasets list,datasets tasks --limit,datasets tasks --filteragainstrock-agent-prebucketCo-Authored-By: Claude Code noreply@anthropic.com