Feature Category
Problem Statement
There is currently no way to inspect or fetch individual files inside a single
task from the OSS dataset registry. Users can list task IDs (datasets tasks)
but to look at or download a task's contents they must reach for raw ossutil
and reconstruct the datasets/{org}/{dataset}/{split}/{task}/ key layout by
hand. Two concrete gaps:
- No file listing per task: no command answers "what files does task X
contain?"
- No file read / download: no command prints a single task file to stdout
or downloads a task file/directory to a local path.
Proposed Solution
Add a datasets fs subcommand group (alias files) with three operations,
backed by new registry/client methods.
-
Registry layer (OssDatasetRegistry / BaseDatasetRegistry /
DatasetClient):
list_task_files(org, dataset, split, task_id, path="") -> list[TaskFile]
— list files under a task, paths relative to the task root.
get_task_file(org, dataset, split, task_id, path) -> bytes | None
— read one task file by relative path; None when the object is absent.
- New
TaskFile dataclass (path: str, size: int | None).
- Support both layouts: directory-style tasks (
{task}/...) and
single-file tasks ({task}.json directly under the split).
-
CLI layer (datasets fs ...):
datasets fs ls --org --dataset --split --task [--path] — list files.
datasets fs get --org --dataset --split --task [--path] — print one
file to stdout (resolves the only file automatically when the task has
exactly one).
datasets fs download --org --dataset --split --task --path --dest —
download a single file or a whole directory subtree to a local path.
- JSON output (
-o json) for ls/get/download.
-
Path safety: relative task paths are normalized and validated — reject
absolute paths and .. traversal, drop ./empty segments — so a crafted
--path cannot escape the task root.
Notes
Feature Category
Problem Statement
There is currently no way to inspect or fetch individual files inside a single
task from the OSS dataset registry. Users can list task IDs (
datasets tasks)but to look at or download a task's contents they must reach for raw
ossutiland reconstruct the
datasets/{org}/{dataset}/{split}/{task}/key layout byhand. Two concrete gaps:
contain?"
or downloads a task file/directory to a local path.
Proposed Solution
Add a
datasets fssubcommand group (aliasfiles) with three operations,backed by new registry/client methods.
Registry layer (
OssDatasetRegistry/BaseDatasetRegistry/DatasetClient):list_task_files(org, dataset, split, task_id, path="") -> list[TaskFile]— list files under a task, paths relative to the task root.
get_task_file(org, dataset, split, task_id, path) -> bytes | None— read one task file by relative path;
Nonewhen the object is absent.TaskFiledataclass (path: str,size: int | None).{task}/...) andsingle-file tasks (
{task}.jsondirectly under the split).CLI layer (
datasets fs ...):datasets fs ls --org --dataset --split --task [--path]— list files.datasets fs get --org --dataset --split --task [--path]— print onefile to stdout (resolves the only file automatically when the task has
exactly one).
datasets fs download --org --dataset --split --task --path --dest—download a single file or a whole directory subtree to a local path.
-o json) forls/get/download.Path safety: relative task paths are normalized and validated — reject
absolute paths and
..traversal, drop./empty segments — so a crafted--pathcannot escape the task root.Notes
datasets taskspagination/cache/
--filterwork; this issue tracks the independentdatasets fsfile-access feature.rock/cli/command/datasets.py) will be refactoredtoward a clearer object-oriented structure as part of this change (per review
feedback on PR perf(datasets): cache OSS bucket, add server-side pagination and --filter for tasks #1064): path normalization and output handling extracted into
dedicated
TaskPath/OutputWriterhelpers, with the module-level helperfunctions folded into the command class.