Release 0.3.0#204
Merged
Merged
Conversation
Contributor
Author
|
@copilot resolve the merge conflicts in this pull request |
Merged
Contributor
Author
|
@winglet0996, could you check this PR? Let me know if critical things are missing. thx! |
Contributor
|
@guillaumejaume Thanks for the excellent engineering! I think everything works quite well and I just noticed the docs need update accordingly? Much appreciated! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-format readers, multi-GPU pipeline, run tracking, and lock hardening
TL;DR
This branch promotes TRIDENT from a single-GPU CLI into a production-grade WSI
processing pipeline:
fallback in the OpenSlide path.
--gpus 0 1 2 3(and--gpus -1 -1 …),in both standard and cache-pipeline modes.
.lockfiles + safe stale-lock cleanup (--clear_dead_locks),replacing the previous "delete every lock at startup" behavior.
summary.md,runs/<id>.json,wsi_states/<slide>__<hash>.json. Re-running on the same--job_diris nowidempotent by design.
KEEP(patch),GenBio-PathFMalready inmainis now properlysurfaced; offline support extended for
gigapath.end-to-end integration tests and a real multi-GPU equivalence stress test.
rewritten to surface the actually-useful features (multi-GPU, resume, cache,
lock cleanup, run reports), with corrected output-directory paths.
No breaking CLI changes.
--gpu N(singular) still works;--gpusis preferred.What's new
1. New WSI readers
trident/wsi_objects/CZIWSI.py,--reader_type czi,pip install -e ".[czi]").--reader_type openslideon.dcm).OpenSlideWSI: when a single pyramidlevel is corrupt, TRIDENT now falls back to the next level instead of failing
the whole slide.
WSIFactoryupdated to dispatch CZI and DCM cleanly.2. Multi-GPU / multi-CPU pipeline
--gpusflag (nargs='+'): pending slides are sharded round-robin acrossthe listed device IDs.
--gpus -1 -1runs N independent CPU workers (useful forsegmentation-only or otsu pipelines on machines without a GPU).
workers on the same CUDA device wastes memory), but
-1entries are kept(each is an independent CPU worker).
forkserveron POSIX,spawnon Windows / when CUDA is in use, so the pipeline works on Linux,macOS, and Windows.
WSI.py: when the chosen mp context can'tpickle a complex object (e.g.
WSIPatcher), TRIDENT transparently retrieswith a different context, then with single-process loading.
--gpu Nstill works; if both are given,--gpuswinsand a one-line warning is printed.
3. Lock hardening
Self-describing locks:
.lockfiles are now JSON containingpid,hostname,created_at. This lets TRIDENT (and operators) tell whose lockthis is.
Safe stale-lock cleanup: new
--clear_dead_locksflag (andtrident.IO.clear_dead_locks(...)API) removes locks only when one of:--dead_lock_max_age_hours(default 24h).
Active locks from running jobs are never removed.
The previous startup-time
cleanup_files()(which deleted all.lockfiles unconditionally — dangerous for multi-user / multi-job dirs) has been
split: cache cleanup is now its own opt-in step (
cleanup_cache), and lockcleanup is gated by
--clear_dead_locks.4. Run tracking and per-slide state
New
trident/State.pyandtrident/Summary.pygive every run:summary.md: appended once per run; counts (completed / skipped /errored), per-encoder breakdown, and a short error list.
runs/<run_id>.json: per-run manifest (CLI args, timestamps, status).wsi_states/<slide>__<hash>.json: per-slide machine-readable statewith task-level
status,attempts(timings),outputs, andresumeinfo.Re-running on the same
--job_dirskips already-completed (and unlocked)work. This makes long jobs tolerant to wall-time cutoffs, node failures, and
SIGKILL-by-scheduler.
5. Patch / slide encoders
KEEPpatch encoder added (768-d,Astaxanthin/KEEP), with both onlineand local-directory loading.
gigapathslide encoder now honorslocal_ckpts.jsonfor offlineclusters.
BasePatchEncoder.ensure_valid_weights_pathaccepts either a checkpointfile or a model directory (needed by HF-style local mirrors like KEEP).
trident-doctorno longer flags missing HF token unless--check-gatedispassed (eliminates false alarms for users who only run non-gated models).
dataclasses/pydanticfromtrident/: zero runtime dependencyon either.
6.
ProcessorAPIselected_wsi_paths=kwarg lets a worker process a pre-sharded slice ofslides without re-running discovery.
Processornow usesExitStackfor slide context management, so a failureduring init releases the slides that were already opened.
mpplookup from--custom_list_of_wsisnow respects per-slide ordering viathe
wsicolumn (previously it was a positional list and could mismatchmppto the wrong slide).7. Documentation
A full pass to surface the actually-useful features instead of generic bullet
points:
docs/index.rst: new "Highlights" section organized as Pipeline / Scale /Reliability / Models / Formats / Operability.
docs/quickstart.rst: rewrote into a working reference with sections onoutputs, resume / skip behavior, multi-GPU + multi-worker, caching, stage-only
examples, common failure modes, and a tight cheat-sheet table; the
auto-generated parser help is still included verbatim at the bottom.
docs/installation.rst: documents.[czi]and.[omezarr]extras,expands
trident-doctorusage (profiles,--check-gated,--format jsonforCI), and clarifies what
.[full]does and doesn't include.docs/tutorials.rst: new recipes for multi-GPU production runs andresuming after a crash.
docs/api.rst: addsKEEPandGenBio-PathFMrows; new "Notes forpower users" section on resume / lock cleanup / multi-GPU from Python.
docs/faq.rst: documents--clear_dead_locksand adds a multi-GPU FAQentry.
docs/generated/run_batch_of_slides_help.txtregenerated (now includes--gpus,--clear_dead_locks,--dead_lock_max_age_hours,keep).README.md: rewritten Key Features list (specific encoder names, multi-GPUdetails, cache pipeline, run reports, lock cleanup); fixed wrong output paths
(
./trident_processed/20x_256px/...→./trident_processed/20x_256px_0px_overlap/...).Sphinx build is clean (no warnings).
Test coverage
14 new test files, ~2.6k lines added.
test_run_batch_of_slides.py(lock cleanup, dead-lock cleanup, GPU dedup, parser);test_processor_selected_wsi_paths.py;test_processor_czi_discovery.py;test_summary_md.py;test_wsi_states_v2.pytest_czi_reader.py;test_czi_huggingface_feature_extraction.pytest_multi_gpu_equivalence_patch_encoders.py(uni_v1,conch_v1)test_real_multi_gpu_equivalence.py— actually launches subprocesses oncuda:0andcuda:1, processes 2 real WSIs end-to-end, asserts content-identical coords + features against the single-GPU runtest_run_batch_of_slides_integration_outputs.py— 7 tests covering: exact UNI v1 first-embedding values, idempotent re-runs,--task all≡seg → coords → feat, single-worker ≡ multi-worker (CPU), coords determinism,--wsi_cache≡ non-cache,--dump_patchescountVerified locally
TRIDENT_RUN_INTEGRATION_TESTS=1): 69 passed.TRIDENT_RUN_GPU_TESTS=1): 15 passed.Sphinx build: clean.
Migration notes
--gpuis still supported but--gpusis preferred. If both are passed,--gpuswins and TRIDENT prints a one-line deprecation warning..lockcleanup is no longer automatic on startup. If you previouslyrelied on TRIDENT wiping locks for you, add
--clear_dead_locksto the nextrun. Existing scripts that don't pass the flag are now safer (they won't
step on another job's locks) but may need this one-time migration.
--wsi_cachedirectory is still wiped at startup (unchanged).<job_dir>/<mag>x_<patch>px_<overlap>px_overlap/features_<encoder>/<slide>.h5— the README previously documented the wrong path; the actual code is
unchanged.