feat(datasets): make 'rock datasets list' fast on cross-region OSS (#1010) by xdlkc · Pull Request #1011 · alibaba/ROCK

xdlkc · 2026-05-25T11:27:25Z

Summary

Replace deep recursive walk in rock datasets list with two-step OSS prefix listing (delimiter='/'), bringing cross-region listing down from ~30s+ to a few seconds (live verification on cn-shanghai → us-east-1 bucket).
Add --depth {1,2} (default 2) and --org <name> flags so users can scope listing — --depth 1 lists organizations only (one HTTP round-trip), --org skips the top-level scan and lists datasets under one org. The two are mutually exclusive.
Add new rock datasets splits --org X --dataset Y subcommand to list available splits, parallel to tasks.
Touches: registry → client → CLI; full TDD with 64 unit tests (32 new), all passing. No behavior change for paths not exercised by the new flags — default list output format is unchanged.

Change

rock/sdk/envhub/datasets/registry/oss.py — three new fast-path methods around bounded-concurrency prefix listing:

+    def list_organizations(self) -> list[str]:
+        prefix = self._dataset_prefix()  # e.g. "datasets/"
+        return self._list_common_prefixes(prefix)
+
+    def list_org_datasets(self, organization: str) -> list[str]:
+        prefix = f"{self._dataset_prefix()}{organization}/"
+        return self._list_common_prefixes(prefix)
+
+    def list_all_datasets(self, concurrency: int = 10) -> list[tuple[str, str]]:
+        orgs = self.list_organizations()
+        with ThreadPoolExecutor(max_workers=concurrency) as ex:
+            futs = {ex.submit(self.list_org_datasets, o): o for o in orgs}
+            return sorted(
+                (o, d) for f in as_completed(futs) for d in f.result()
+            )
+
+    def list_dataset_splits(self, organization: str, dataset: str) -> list[str]:
+        prefix = f"{self._dataset_prefix()}{organization}/{dataset}/"
+        return self._list_common_prefixes(prefix)

rock/cli/command/datasets.py — rewrite _list to dispatch by --org / --depth, plus a new _splits handler:

     async def _list(self, args: argparse.Namespace) -> None:
         registry_info = self._build_oss_registry_info(args)
         client = DatasetClient(registry_info)
-        # old: walk every dataset, then every split, then every task
-        datasets = client.list_datasets(getattr(args, 'org', None))
-        for spec in datasets:
-            ...
+        if getattr(args, "org", None):
+            datasets = client.list_org_datasets(args.org)
+            self._render_org_dataset_pairs([(args.org, d) for d in datasets])
+            return
+        if getattr(args, "depth", 2) == 1:
+            self._render_orgs(client.list_organizations())
+            return
+        self._render_org_dataset_pairs(client.list_all_datasets())

argparse: --depth {1,2} and --org are placed in a mutually-exclusive group on list; new splits subparser requires --org and --dataset.

File diff

File	Change
`rock/sdk/envhub/datasets/registry/oss.py`	+44 lines (4 new methods, shared `_list_common_prefixes` helper)
`rock/sdk/envhub/datasets/registry/base.py`	+18 lines (abstract method declarations)
`rock/sdk/envhub/datasets/client.py`	+11 lines (delegation)
`rock/cli/command/datasets.py`	+85 / -23 (new `_list` dispatch, `_splits`, render helpers, parser changes)
`tests/unit/datasets/test_oss_registry.py`	+185 lines
`tests/unit/datasets/test_client.py`	+39 lines
`tests/unit/datasets/test_datasets_command.py`	+172 lines

Test plan

uv run pytest tests/unit/datasets/ -v — 64 passed (32 existing + 32 new)
uv run ruff check rock/ tests/ — clean
uv run ruff format rock/ tests/ — clean
Live verification on cn-shanghai → us-east-1 cross-region OSS bucket: rock datasets list --depth 1 and rock datasets list --org <name> return in ~1s; rock datasets list (depth=2) returns in ~3s with 10-way concurrency.
Reviewer: confirm CLI default behavior — rock datasets list with no flags still shows 'Organization | Dataset' two-column output; only the underlying calls changed.
Reviewer: confirm --depth and --org mutual exclusion error message is acceptable.

Notes

No protocol or schema change. The OSS layout assumed (<base>/<org>/<dataset>/<split>/) is the same one already used by tasks and upload; this PR just stops descending past the level you asked for.
Concurrency is bounded (default 10) and only kicks in for the depth-2 default path; depth-1 and --org paths are single-call.

fix #1010

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 48/48 AI-Contributed/UT: 107/107

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 24/24 AI-Contributed/UT: 63/63

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 36/53 AI-Contributed/UT: 44/65

Defer the default list depth until runtime so argparse still treats an explicit --depth value as present in the mutually exclusive group on Python 3.10. Update envhub CLI docs for the rewritten list output and splits command. Co-Authored-By: Codex <noreply@openai.com> AI-Model: gpt-5 AI-Contributed/Feature: 47/47 AI-Contributed/UT: 8/8

BCeZn

LGTM

…libaba#1010) (alibaba#1011) * feat(datasets): add list_organizations to registry * feat(datasets): add list_org_datasets to registry * feat(datasets): add list_dataset_splits to registry * feat(datasets): add list_all_datasets with bounded concurrency * feat(datasets): add fast-path methods to DatasetClient * feat(datasets): rewrite list with --depth and fast paths AI-Model: claude-opus-4-7 AI-Contributed/Feature: 48/48 AI-Contributed/UT: 107/107 * feat(datasets): add splits subcommand AI-Model: claude-opus-4-7 AI-Contributed/Feature: 24/24 AI-Contributed/UT: 63/63 * chore(datasets): apply lint and format AI-Model: claude-opus-4-7 AI-Contributed/Feature: 36/53 AI-Contributed/UT: 44/65 * fix(datasets): preserve list parser exclusivity Defer the default list depth until runtime so argparse still treats an explicit --depth value as present in the mutually exclusive group on Python 3.10. Update envhub CLI docs for the rewritten list output and splits command. Co-Authored-By: Codex <noreply@openai.com> AI-Model: gpt-5 AI-Contributed/Feature: 47/47 AI-Contributed/UT: 8/8

xdlkc added 8 commits May 25, 2026 19:30

feat(datasets): add list_organizations to registry

cd62b76

feat(datasets): add list_org_datasets to registry

fb9032f

feat(datasets): add list_dataset_splits to registry

f3bc241

feat(datasets): add list_all_datasets with bounded concurrency

8b438a8

feat(datasets): add fast-path methods to DatasetClient

2b71412

feat(datasets): rewrite list with --depth and fast paths

92f7221

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 48/48 AI-Contributed/UT: 107/107

feat(datasets): add splits subcommand

49fa462

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 24/24 AI-Contributed/UT: 63/63

chore(datasets): apply lint and format

a18d497

AI-Model: claude-opus-4-7 AI-Contributed/Feature: 36/53 AI-Contributed/UT: 44/65

xdlkc force-pushed the feat/datasets-list-fast branch from ef1dd52 to a18d497 Compare May 25, 2026 11:30

BCeZn approved these changes May 27, 2026

View reviewed changes

BCeZn merged commit cefdaa6 into alibaba:master May 27, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datasets): make 'rock datasets list' fast on cross-region OSS (#1010)#1011

feat(datasets): make 'rock datasets list' fast on cross-region OSS (#1010)#1011
BCeZn merged 9 commits into
alibaba:masterfrom
xdlkc:feat/datasets-list-fast

xdlkc commented May 25, 2026 •

edited

Loading

Uh oh!

BCeZn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xdlkc commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

File diff

Test plan

Notes

Uh oh!

BCeZn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xdlkc commented May 25, 2026 •

edited

Loading