Skip to content

Self-review: gaps, bugs, and feature ideas (P0/P1/P2 + architecture) #1

Description

@narugo1992

Self-review: gaps, bugs, and feature ideas

A critical agent-perspective audit of the repo after the initial flurry of features. Grouped by priority. Each item names the file/area and a concrete next step. Tick boxes as we land fixes.

Summary

Priority Count Rough effort
🟥 P0 — must fix 3 ~6 h
🟧 P1 — should fix 5 ~12 h
🟨 P2 — nice to have 7 ~30 h
🟦 Architecture thoughts 3 open-ended

🟥 P0 — must fix

These risk data loss, broken end-user experience, or future bugs we can't catch with current tests.

  • apply.py is 870 lines — split it

    • Currently mixes docker prefix detection, image build/probe, MIG mgmt, clock/power/persistence mgmt, up(), down(), reset(), list_cosplay_containers(), host_gpu_status().
    • Suggested split:
      gpu_cosplay/
        _docker.py     # _docker(), _container_exists, _image_*, list_cosplay_containers
        _gpu.py        # _enable_mig, _lock_clock, _set_power, _query_persistence_mode
        apply.py       # up() / down() / reset() — orchestration only
        status.py      # host_gpu_status, list_cosplay_containers reports
      
    • Each file < 200 lines, single responsibility.
  • state.json has no file lock — concurrent writes corrupt it

    • state.add() does load → modify → save without serialization. Two parallel gpu-cosplay up (very likely in cluster scheduling scenarios) race → silent entry loss → reset can't find the orphan.
    • Fix: wrap _load + modify + _save in fcntl.flock(LOCK_EX). ~15 lines.
  • Zero unit tests for gpu_cosplay_runtime.py and nvidia-smi shim

    • Runtime hook (269 lines) + shim (326 lines) = 600 lines of regex + monkey-patch code, entirely uncovered. The column-alignment bug, the pynvml mid-load bug, and the fraction-set-too-early bug were each caught only by E2E on a real H200.
    • Fix: add two test files:
      • tests/test_shim_rewrite.py — feed real nvidia-smi output snapshots in, assert name / MiB / column widths come out right.
      • tests/test_runtime_patches.py — mock sys.modules['torch'] / sys.modules['pynvml'], verify _patch_torch / _patch_pynvml behavior.
    • Both runnable on a GPU-less CI host.

🟧 P1 — should fix

Friction and maintainability problems. Not blocking but the kind of debt that compounds.

  • Half-migrated naming: Card / card_key / find_card vs user-facing "GPU"

    • User strings all say "GPU"; internals still say "Card". Confusing for future agents reading the code.
    • Two options: (a) document the convention in AGENTS.md and accept it; (b) one-shot rename with state.json migration (we have test_session_compat.py to cover legacy state).
  • plan.py has hardcoded host arch / SM / BW tables

    • _full_sm_count, _host_total_bw, _host_arch are dicts inside Python. Every new host generation (B100/B200/GB200/R200) requires editing three places.
    • Fix: move to gpu_cosplay/data/host_gpus.yaml, mirror the structure of gpus.yaml.
  • apply.up() is a monolithic 150-line try/except

    • Forward flow and rollback flow are interleaved. Adding a new mutation requires editing both branches.
    • Suggested refactor: _RollbackStack context manager.
      with _RollbackStack() as rb:
          _set_persistence(gpu.index, True)
          rb.push(lambda: _set_persistence(gpu.index, original_persistence))
          _set_power(...)
          rb.push(lambda: _set_power(..., original_power))
          # ...
          rb.commit()  # success — don't roll back
    • Forward logic drops to ~30 lines; rollback mirrors it automatically.
  • Runtime hook mutates builtins.__import__ globally

    • If user code / IPython / debugpy also patches __import__, last writer wins. We'd win but break theirs.
    • Fix: preserve the previous hook, chain it. Or use sys.meta_path finder which is the proper extension point.
  • No integration test runs gpu-cosplay verify end-to-end

    • CI runs lint + unit + docker build smoke. Nothing exercises the runtime hooks or the shim against a real GPU.
    • Options:
      • Self-hosted GH runner with NVIDIA T4/A10. Nightly workflow: spin up → gpu-cosplay verify → all 18 checks must PASS.
      • Stopgap: tests/test_verify_script.py that mocks subprocess.run to validate the verify script's PASS/FAIL/SKIP logic.

🟨 P2 — nice to have

Polish, ergonomics, and ecosystem.

  • No structured logging

    • Scattered print(...) everywhere. No timestamps, no level filtering. Use stdlib logging; expose --verbose to drop to DEBUG.
  • No pre-built image on ghcr.io/deepghs/gpu-cosplay

    • First gpu-cosplay up builds locally for ~8 min. CI already builds the image; add a push step + GHCR login + tag with git SHA and :latest.
  • No multi-GPU cosplay

    • Many research questions need a 3090×2-style setup. Single-process multi-device is feasible via multiple MIG slices in the same container (NCCL P2P won't work — that's a known limit).
    • Surface: gpu-cosplay up 3090 --num 2; container sees CUDA_VISIBLE_DEVICES=uuid1,uuid2; runtime hook applies memory_fraction per device.
  • No gpu-cosplay benchmark

    • verify checks identity disguise. We also need a command that runs a standard FP32/BF16/bandwidth microbench, compares to the target's datasheet, and prints the delta table. Quantifies "this cosplay is 33% slow on BF16 vs real 3090".
    • Most of the code exists in examples/bench.py; just needs a CLI wrapper.
  • Missing CONTRIBUTING.md and SECURITY.md

    • AGENTS.md serves AI agents but human contributors who fork the repo don't find the conventions. Extract the "how to add a new card / a new host family / a new verify check" sections from AGENTS.md into CONTRIBUTING.md.
  • examples/ only contains bench.py

    • Add at least:
      • examples/sd_inference.py — Stable Diffusion inference (3090 vs 4090 cosplay comparison)
      • examples/llm_oom.py — load a 13B model and assert OOM behavior matches target VRAM
      • examples/lora_train.py — LoRA training on a consumer-card cosplay
  • "Hide MIG" mode incomplete

    • torch.cuda.get_device_properties(0).multi_processor_count still returns the MIG slice's real SM count (32 for 2g.35gb) instead of the target's (e.g. 82 for RTX 3090). NSight and other profilers read this.
    • Fix: extend _CosplayDeviceProperties proxy to lie about multi_processor_count, major/minor, etc.
  • No GPU spec auto-fetch

    • When a new GPU launches, someone manually scrapes the whitepaper into gpus.yaml. Could implement gpu-cosplay scrape rtx_5090 --from epoch-ai to automate.

🟦 Architecture thoughts (not bugs, but worth discussing)

  • Cosplay is a single-host concept. Cluster-scale deployment isn't designed.

    • Picture an 8-GPU H200 node with 56 × 1g.18gb slices serving 20 team members. Current CLI is single-user; no reservations / quotas / queues / GC.
    • Natural evolution: SLURM integration, K8s operator, central scheduler, web dashboard, 24h idle-session GC.
    • Not the repo's core mission, but a likely next layer.
  • Fidelity loss is undocumented quantitatively

    • README has prose like "BF16 TC will be too strong". No numbers.
    • Should ship docs/fidelity.md: per (target × host) pair, run standard workloads (GEMM, SD inference, LLM forward), table real vs datasheet delta. Most convincing "honest claim".
  • Reverse direction: simulate stronger features on weaker hardware?

    • Currently we throttle strong hardware to look weak. Could we go the other way? "I have an RTX 4090, want to develop FP8 training as if on H100" — NVIDIA's PTX JIT lets FP8 software-emulate via FP16. If supported, the tool becomes "cosplay any direction".
    • Research value is high. Not P0/P1 but worth scoping.

Process notes

  • All numbers above are from wc -l of the current main (commit b6fe07d) plus a manual read-through.
  • Priority is the reviewer's call, not a contract. Argue if you disagree.
  • Anything marked 🟥 should land before the next round of feature additions; otherwise we keep digging the hole.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions