v0.8.2: CLI cleanup — push/pull + drop reward/init#17
Merged
Conversation
After the v0.8.1 e2e runs, the actual user workflow is repo2rlenv generate → repo2rlenv validate → repo2rlenv push → harbor run so the existing CLI had three pieces of dead weight: - `repo2rlenv reward` — diff-similarity scoring as a CLI command is misleading. The real reward signal comes from `harbor run`. Diff similarity is useful for RL training loops, but those should call `repo2rlenv.reward.calculate_diff_similarity_reward()` directly from Python, not shell out per rollout. - `repo2rlenv init` — wrote a stale 30-line YAML template nobody used. Same job done better by a README example. - `generate --out hf://...` magic — hidden push baked into the destination string; you couldn't re-push or push an existing local dir without re-running generation. This commit: - Adds `repo2rlenv push <local-dir> <hf://owner/dataset>` — explicit publish, supports `--private` and `--message`. Wraps the existing `hub.push_to_hub`. - Adds `repo2rlenv pull <hf://owner/dataset> [<local-dir>]` — fetch from Hub. Supports `--task <name>` (single task) and `--force`. New `hub.pull_from_hub` helper wraps `huggingface_hub.snapshot_download` and flattens the staged `tasks/<id>/` layout back to `<dir>/<id>/`. - Removes `repo2rlenv reward` (the Python function stays for training loops — only the CLI wrapper goes away). - Removes `repo2rlenv init` + the `_SAMPLE_CONFIG` template string. - Soft-deprecates `generate --out hf://...` with a one-version warning; to be removed in v0.9. Tests: 19 new in `tests/test_cli_push_pull.py` covering URI parsing (both `hf://owner/name` and bare `owner/name`, plus malformed-input rejection) + cmd_push / cmd_pull argument plumbing with mocked Hub I/O. 461/461 pass; ruff + format clean. Docs: README, CLAUDE.md, docs/quickstart.md, docs/pipelines/pr_diff.md, docs/reference/API.md, docs/reference/SPEC.md all updated. New flow documented end-to-end (generate → validate → push → pull → harbor).
Live e2e on pallets/click surfaced a real bug: the bootstrap agent
sometimes saves test_cmds with a tail-truncator pipe like
python -m pytest -q 2>&1 | head -50
and `targeted_test_cmds_for_pr` then appended PR test files at the end,
landing them AFTER the pipe and breaking the shell:
cd /workspace && python -m pytest -q 2>&1 | head -50 -v tests/...
^^^^^^^^^^^^ args land here
Side effect of v0.8.1's STOP CONDITION prompt: telling the agent to
SAVE_SETUP early led some agents to embed the same truncator they were
using to keep their own diagnostic output short.
Fix in `normalize_test_cmds_for_runtime`: strip `| head/tail [...]`,
`2>&1`, `> /dev/null`, `&> /dev/null` BEFORE any per-runner normalization.
4 regression tests added.
Confirmed live: harbor run --agent oracle on the pallets/click task that
was returning Mean=0.0 now returns Mean=1.000 in 5s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #16.
End-to-end-validated via a live HF Hub round-trip on
pallets/click(no E2B / cloud sandbox).What's changing
repo2rlenv push <local-dir> <hf://owner/dataset>— supports--private,--messagerepo2rlenv pull <hf://owner/dataset> [<local-dir>]— supports--task,--forcerepo2rlenv reward(Python functionrepo2rlenv.reward.calculate_diff_similarity_rewardstays for training-loop users)repo2rlenv init(replaced by README example)--out hf://...magic insidegenerate— emits a warning; to be removed in v0.9Resulting CLI (5 commands, every one load-bearing):
Live round-trip evidence (Phase 4, all green)
Real Hub round-trip captured in
plans/v0.8.2_e2e.md:generate(pr_runtime onpallets/click)validate(local)push hf://AdithyaSK/click-r2e-v08282de5c57, registry.json publishedpull hf://AdithyaSK/click-r2e-v082 ... --forcevalidate(pulled copy)harbor run --agent oracle --path <pulled task>Harness bug surfaced + fixed during e2e
Initial Harbor oracle attempt returned Mean = 0.0 — root cause was the bootstrap agent saving
test_cmdswith a trailing| head -50pipe, sotargeted_test_cmds_for_prappended test files after the pipe → broken shell. Fix:normalize_test_cmds_for_runtimenow strips| head/tail,2>&1,> /dev/nullbefore per-runner normalization. 4 regression tests added.Test plan
ruff check+ruff format --checkcleanAdithyaSK/click-r2e-v082README.md,CLAUDE.md,docs/quickstart.md,docs/pipelines/pr_diff.md,docs/reference/API.md,docs/reference/SPEC.mdall updated