DM-54879: Support reprocessing when upstream outputs are selectively retained#561
Conversation
d4f69bd to
4bf23e3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 88.66% 88.77% +0.10%
==========================================
Files 160 160
Lines 22151 22357 +206
Branches 2627 2657 +30
==========================================
+ Hits 19640 19847 +207
Misses 1866 1866
+ Partials 645 644 -1 ☔ View full report in Codecov by Harness. |
4bf23e3 to
d6ef7aa
Compare
| return zstandard.ZstdCompressionDict(b"") | ||
| self.comms.log.info("Training compression dictionary.") | ||
| training_inputs: list[bytes] = [] | ||
| training_inputs: list[bytes | bytearray | memoryview[int]] = [] |
There was a problem hiding this comment.
I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.
There was a problem hiding this comment.
I took it from mypy's suggestion directly, but this has now been fixed by ac0bcb1
3e4cff4 to
52abd2a
Compare
8237d96 to
4bdec93
Compare
|
@hsinfang are you imminently merging this or is it going to be a few days? I am trying to sync with the v30 release so wondered what your plan was. |
|
@timj this won't be imminent, and can wait longer too if that makes other things easier. |
0fad777 to
9e00c45
Compare
9e00c45 to
abb7cc9
Compare
The skip_existing_in behavior of QuantumGraphBuilder was previously only covered through test_separable_pipeline_executor.py, where SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder. No tests exercised the builder directly at the unit level.
abb7cc9 to
671d9f0
Compare
Extract the read-only metadata check and the skeleton mutation in _skip_quantum_if_metadata_exists into two helpers _compute_skip_decision and _apply_skip_decision. No behavior change.
…quanta Daytime AP runs against data produced by Prompt Processing, which does not retain all intermediate outputs. With --skip-existing-in, tasks whose metadata exists are skipped even when their outputs are absent. When a downstream task needs to run, it may not see some inputs and is dropped as no work found. retained_dataset_types provides dataset types expected to exist in skip_existing_in. The non-retained types trigger backward unskipping of the ancestor quanta needed to regenerate them.
SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.
671d9f0 to
4aeecd6
Compare
Checklist
package-docs builddoc/changes