Skip to content

Release 0.3.0#204

Merged
guillaumejaume merged 13 commits into
mainfrom
gja/feature/slidereaders
May 5, 2026
Merged

Release 0.3.0#204
guillaumejaume merged 13 commits into
mainfrom
gja/feature/slidereaders

Conversation

@guillaumejaume

@guillaumejaume guillaumejaume commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Multi-format readers, multi-GPU pipeline, run tracking, and lock hardening

TL;DR

This branch promotes TRIDENT from a single-GPU CLI into a production-grade WSI
processing pipeline:

  • Two new WSI readers (Zeiss CZI, OpenSlide-DICOM) and a corruption-tolerant
    fallback in the OpenSlide path.
  • Multi-GPU + multi-CPU sharding via --gpus 0 1 2 3 (and --gpus -1 -1 …),
    in both standard and cache-pipeline modes.
  • Self-describing .lock files + safe stale-lock cleanup (--clear_dead_locks),
    replacing the previous "delete every lock at startup" behavior.
  • Per-run / per-slide state and reports: summary.md, runs/<id>.json,
    wsi_states/<slide>__<hash>.json. Re-running on the same --job_dir is now
    idempotent by design.
  • New encoders: KEEP (patch), GenBio-PathFM already in main is now properly
    surfaced; offline support extended for gigapath.
  • Big test bump: 14 new test files (~2.6k lines), including real-data
    end-to-end integration tests and a real multi-GPU equivalence stress test.
  • Docs overhaul: index / quickstart / installation / tutorials / api / FAQ
    rewritten to surface the actually-useful features (multi-GPU, resume, cache,
    lock cleanup, run reports), with corrected output-directory paths.

No breaking CLI changes. --gpu N (singular) still works; --gpus is preferred.


What's new

1. New WSI readers

  • Zeiss CZI (trident/wsi_objects/CZIWSI.py, --reader_type czi,
    pip install -e ".[czi]").
  • DICOM through OpenSlide (--reader_type openslide on .dcm).
  • Corruption-tolerant level reads in OpenSlideWSI: when a single pyramid
    level is corrupt, TRIDENT now falls back to the next level instead of failing
    the whole slide.
  • WSIFactory updated to dispatch CZI and DCM cleanly.

2. Multi-GPU / multi-CPU pipeline

  • New --gpus flag (nargs='+'): pending slides are sharded round-robin across
    the listed device IDs.
  • Multi-CPU fallback: --gpus -1 -1 runs N independent CPU workers (useful for
    segmentation-only or otsu pipelines on machines without a GPU).
  • Smart dedup: duplicate positive GPU IDs are deduplicated (running two
    workers on the same CUDA device wastes memory), but -1 entries are kept
    (each is an independent CPU worker).
  • Cross-platform multiprocessing context: prefers forkserver on POSIX,
    spawn on Windows / when CUDA is in use, so the pipeline works on Linux,
    macOS, and Windows.
  • DataLoader pickling fallback in WSI.py: when the chosen mp context can't
    pickle a complex object (e.g. WSIPatcher), TRIDENT transparently retries
    with a different context, then with single-process loading.
  • Backward-compatible: --gpu N still works; if both are given, --gpus wins
    and a one-line warning is printed.

3. Lock hardening

  • Self-describing locks: .lock files are now JSON containing pid,
    hostname, created_at. This lets TRIDENT (and operators) tell whose lock
    this is.

  • Safe stale-lock cleanup: new --clear_dead_locks flag (and
    trident.IO.clear_dead_locks(...) API) removes locks only when one of:

    1. the target output already exists,
    2. the writer PID is dead on this host, or
    3. the lock is unreadable / legacy and older than --dead_lock_max_age_hours
      (default 24h).

    Active locks from running jobs are never removed.

  • The previous startup-time cleanup_files() (which deleted all .lock
    files unconditionally — dangerous for multi-user / multi-job dirs) has been
    split: cache cleanup is now its own opt-in step (cleanup_cache), and lock
    cleanup is gated by --clear_dead_locks.

4. Run tracking and per-slide state

New trident/State.py and trident/Summary.py give every run:

  • summary.md: appended once per run; counts (completed / skipped /
    errored), per-encoder breakdown, and a short error list.
  • runs/<run_id>.json: per-run manifest (CLI args, timestamps, status).
  • wsi_states/<slide>__<hash>.json: per-slide machine-readable state
    with task-level status, attempts (timings), outputs, and resume info.

Re-running on the same --job_dir skips already-completed (and unlocked)
work. This makes long jobs tolerant to wall-time cutoffs, node failures, and
SIGKILL-by-scheduler.

5. Patch / slide encoders

  • KEEP patch encoder added (768-d, Astaxanthin/KEEP), with both online
    and local-directory loading.
  • gigapath slide encoder now honors local_ckpts.json for offline
    clusters.
  • BasePatchEncoder.ensure_valid_weights_path accepts either a checkpoint
    file or a model directory (needed by HF-style local mirrors like KEEP).
  • trident-doctor no longer flags missing HF token unless --check-gated is
    passed (eliminates false alarms for users who only run non-gated models).
  • Removed dataclasses / pydantic from trident/: zero runtime dependency
    on either.

6. Processor API

  • New selected_wsi_paths= kwarg lets a worker process a pre-sharded slice of
    slides without re-running discovery.
  • Processor now uses ExitStack for slide context management, so a failure
    during init releases the slides that were already opened.
  • mpp lookup from --custom_list_of_wsis now respects per-slide ordering via
    the wsi column (previously it was a positional list and could mismatch
    mpp to the wrong slide).

7. Documentation

A full pass to surface the actually-useful features instead of generic bullet
points:

  • docs/index.rst: new "Highlights" section organized as Pipeline / Scale /
    Reliability / Models / Formats / Operability.
  • docs/quickstart.rst: rewrote into a working reference with sections on
    outputs, resume / skip behavior, multi-GPU + multi-worker, caching, stage-only
    examples, common failure modes, and a tight cheat-sheet table; the
    auto-generated parser help is still included verbatim at the bottom.
  • docs/installation.rst: documents .[czi] and .[omezarr] extras,
    expands trident-doctor usage (profiles, --check-gated, --format json for
    CI), and clarifies what .[full] does and doesn't include.
  • docs/tutorials.rst: new recipes for multi-GPU production runs and
    resuming after a crash.
  • docs/api.rst: adds KEEP and GenBio-PathFM rows; new "Notes for
    power users" section on resume / lock cleanup / multi-GPU from Python.
  • docs/faq.rst: documents --clear_dead_locks and adds a multi-GPU FAQ
    entry.
  • docs/generated/run_batch_of_slides_help.txt regenerated (now includes
    --gpus, --clear_dead_locks, --dead_lock_max_age_hours, keep).
  • README.md: rewritten Key Features list (specific encoder names, multi-GPU
    details, cache pipeline, run reports, lock cleanup); fixed wrong output paths
    (./trident_processed/20x_256px/..../trident_processed/20x_256px_0px_overlap/...).

Sphinx build is clean (no warnings).


Test coverage

14 new test files, ~2.6k lines added.

Tier New tests
Fast unit test_run_batch_of_slides.py (lock cleanup, dead-lock cleanup, GPU dedup, parser); test_processor_selected_wsi_paths.py; test_processor_czi_discovery.py; test_summary_md.py; test_wsi_states_v2.py
CZI reader test_czi_reader.py; test_czi_huggingface_feature_extraction.py
Multi-GPU equivalence (mocked) test_multi_gpu_equivalence_patch_encoders.py (uni_v1, conch_v1)
Multi-GPU equivalence (real) test_real_multi_gpu_equivalence.py — actually launches subprocesses on cuda:0 and cuda:1, processes 2 real WSIs end-to-end, asserts content-identical coords + features against the single-GPU run
Heavy real-data integration test_run_batch_of_slides_integration_outputs.py — 7 tests covering: exact UNI v1 first-embedding values, idempotent re-runs, --task allseg → coords → feat, single-worker ≡ multi-worker (CPU), coords determinism, --wsi_cache ≡ non-cache, --dump_patches count

Verified locally

  • Fast tier: 64 passed, 45 skipped (intentionally gated).
  • Integration tier (TRIDENT_RUN_INTEGRATION_TESTS=1): 69 passed.
  • GPU tier (TRIDENT_RUN_GPU_TESTS=1): 15 passed.
  • Heavy real-data tier: 7 passed (~2.5 min on CPU).
  • Real multi-GPU stress test (2× CUDA): 1 passed (~30 s).
  • Combined full sweep: 80 passed, 0 failures, 0 flakes.

Sphinx build: clean.


Migration notes

  • --gpu is still supported but --gpus is preferred. If both are passed,
    --gpus wins and TRIDENT prints a one-line deprecation warning.
  • .lock cleanup is no longer automatic on startup. If you previously
    relied on TRIDENT wiping locks for you, add --clear_dead_locks to the next
    run. Existing scripts that don't pass the flag are now safer (they won't
    step on another job's locks) but may need this one-time migration.
  • --wsi_cache directory is still wiped at startup (unchanged).
  • Output directory name unchanged: features still land under
    <job_dir>/<mag>x_<patch>px_<overlap>px_overlap/features_<encoder>/<slide>.h5
    — the README previously documented the wrong path; the actual code is
    unchanged.

@guillaumejaume guillaumejaume marked this pull request as draft April 15, 2026 16:12
@guillaumejaume

Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

@winglet0996 winglet0996 mentioned this pull request Apr 20, 2026
@guillaumejaume

Copy link
Copy Markdown
Contributor Author

@winglet0996, could you check this PR? Let me know if critical things are missing. thx!

@winglet0996

Copy link
Copy Markdown
Contributor

@guillaumejaume Thanks for the excellent engineering! I think everything works quite well and I just noticed the docs need update accordingly? Much appreciated!

@guillaumejaume guillaumejaume changed the title Gja/feature/slidereaders Release 0.3.0 May 5, 2026
@guillaumejaume guillaumejaume marked this pull request as ready for review May 5, 2026 13:47
@guillaumejaume guillaumejaume merged commit e0dbde6 into main May 5, 2026
2 checks passed
@guillaumejaume guillaumejaume deleted the gja/feature/slidereaders branch May 29, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants