Skip to content

feat(datasets): make 'rock datasets list' fast on cross-region OSS (#1010)#1011

Merged
BCeZn merged 9 commits into
alibaba:masterfrom
xdlkc:feat/datasets-list-fast
May 27, 2026
Merged

feat(datasets): make 'rock datasets list' fast on cross-region OSS (#1010)#1011
BCeZn merged 9 commits into
alibaba:masterfrom
xdlkc:feat/datasets-list-fast

Conversation

@xdlkc

@xdlkc xdlkc commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Replace deep recursive walk in rock datasets list with two-step OSS prefix listing (delimiter='/'), bringing cross-region listing down from ~30s+ to a few seconds (live verification on cn-shanghai → us-east-1 bucket).
  • Add --depth {1,2} (default 2) and --org <name> flags so users can scope listing — --depth 1 lists organizations only (one HTTP round-trip), --org skips the top-level scan and lists datasets under one org. The two are mutually exclusive.
  • Add new rock datasets splits --org X --dataset Y subcommand to list available splits, parallel to tasks.
  • Touches: registry → client → CLI; full TDD with 64 unit tests (32 new), all passing. No behavior change for paths not exercised by the new flags — default list output format is unchanged.

Change

rock/sdk/envhub/datasets/registry/oss.py — three new fast-path methods around bounded-concurrency prefix listing:

+    def list_organizations(self) -> list[str]:
+        prefix = self._dataset_prefix()  # e.g. "datasets/"
+        return self._list_common_prefixes(prefix)
+
+    def list_org_datasets(self, organization: str) -> list[str]:
+        prefix = f"{self._dataset_prefix()}{organization}/"
+        return self._list_common_prefixes(prefix)
+
+    def list_all_datasets(self, concurrency: int = 10) -> list[tuple[str, str]]:
+        orgs = self.list_organizations()
+        with ThreadPoolExecutor(max_workers=concurrency) as ex:
+            futs = {ex.submit(self.list_org_datasets, o): o for o in orgs}
+            return sorted(
+                (o, d) for f in as_completed(futs) for d in f.result()
+            )
+
+    def list_dataset_splits(self, organization: str, dataset: str) -> list[str]:
+        prefix = f"{self._dataset_prefix()}{organization}/{dataset}/"
+        return self._list_common_prefixes(prefix)

rock/cli/command/datasets.py — rewrite _list to dispatch by --org / --depth, plus a new _splits handler:

     async def _list(self, args: argparse.Namespace) -> None:
         registry_info = self._build_oss_registry_info(args)
         client = DatasetClient(registry_info)
-        # old: walk every dataset, then every split, then every task
-        datasets = client.list_datasets(getattr(args, 'org', None))
-        for spec in datasets:
-            ...
+        if getattr(args, "org", None):
+            datasets = client.list_org_datasets(args.org)
+            self._render_org_dataset_pairs([(args.org, d) for d in datasets])
+            return
+        if getattr(args, "depth", 2) == 1:
+            self._render_orgs(client.list_organizations())
+            return
+        self._render_org_dataset_pairs(client.list_all_datasets())

argparse: --depth {1,2} and --org are placed in a mutually-exclusive group on list; new splits subparser requires --org and --dataset.

File diff

File Change
rock/sdk/envhub/datasets/registry/oss.py +44 lines (4 new methods, shared _list_common_prefixes helper)
rock/sdk/envhub/datasets/registry/base.py +18 lines (abstract method declarations)
rock/sdk/envhub/datasets/client.py +11 lines (delegation)
rock/cli/command/datasets.py +85 / -23 (new _list dispatch, _splits, render helpers, parser changes)
tests/unit/datasets/test_oss_registry.py +185 lines
tests/unit/datasets/test_client.py +39 lines
tests/unit/datasets/test_datasets_command.py +172 lines

Test plan

  • uv run pytest tests/unit/datasets/ -v — 64 passed (32 existing + 32 new)
  • uv run ruff check rock/ tests/ — clean
  • uv run ruff format rock/ tests/ — clean
  • Live verification on cn-shanghai → us-east-1 cross-region OSS bucket: rock datasets list --depth 1 and rock datasets list --org <name> return in ~1s; rock datasets list (depth=2) returns in ~3s with 10-way concurrency.
  • Reviewer: confirm CLI default behavior — rock datasets list with no flags still shows 'Organization | Dataset' two-column output; only the underlying calls changed.
  • Reviewer: confirm --depth and --org mutual exclusion error message is acceptable.

Notes

  • No protocol or schema change. The OSS layout assumed (<base>/<org>/<dataset>/<split>/) is the same one already used by tasks and upload; this PR just stops descending past the level you asked for.
  • Concurrency is bounded (default 10) and only kicks in for the depth-2 default path; depth-1 and --org paths are single-call.

fix #1010

@xdlkc xdlkc force-pushed the feat/datasets-list-fast branch from ef1dd52 to a18d497 Compare May 25, 2026 11:30
Defer the default list depth until runtime so argparse still treats an explicit --depth value as present in the mutually exclusive group on Python 3.10.

Update envhub CLI docs for the rewritten list output and splits command.

Co-Authored-By: Codex <noreply@openai.com>

AI-Model: gpt-5

AI-Contributed/Feature: 47/47

AI-Contributed/UT: 8/8

@BCeZn BCeZn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BCeZn BCeZn merged commit cefdaa6 into alibaba:master May 27, 2026
7 of 8 checks passed
jake11-oho pushed a commit to jake11-oho/ROCK that referenced this pull request Jun 10, 2026
…libaba#1010) (alibaba#1011)

* feat(datasets): add list_organizations to registry

* feat(datasets): add list_org_datasets to registry

* feat(datasets): add list_dataset_splits to registry

* feat(datasets): add list_all_datasets with bounded concurrency

* feat(datasets): add fast-path methods to DatasetClient

* feat(datasets): rewrite list with --depth and fast paths

AI-Model: claude-opus-4-7
AI-Contributed/Feature: 48/48
AI-Contributed/UT: 107/107

* feat(datasets): add splits subcommand

AI-Model: claude-opus-4-7
AI-Contributed/Feature: 24/24
AI-Contributed/UT: 63/63

* chore(datasets): apply lint and format

AI-Model: claude-opus-4-7
AI-Contributed/Feature: 36/53
AI-Contributed/UT: 44/65

* fix(datasets): preserve list parser exclusivity

Defer the default list depth until runtime so argparse still treats an explicit --depth value as present in the mutually exclusive group on Python 3.10.

Update envhub CLI docs for the rewritten list output and splits command.

Co-Authored-By: Codex <noreply@openai.com>

AI-Model: gpt-5

AI-Contributed/Feature: 47/47

AI-Contributed/UT: 8/8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] feat(datasets): make rock datasets list fast on cross-region OSS

2 participants