ADR: Remote pipeline inclusion#7213
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Great write up, thanks for this Ben! As you might expect, I'm most concerned about the params. You characterise it as a one-off cost which is mitigated by LLMs, however that doesn't take into account updates to included pipelines (a core functionality with included modules). The I'd still love to look into how we could bulk import nested config and apply it at root level. Even if it is a separate import + apply mechanism (eg. like config profiles in a sense?). I think without it, the use of the meta pipeline functionality is substantially limited. |
| 3. No use of project-level assets (`projectDir`, `bin`, `lib`) within the core workflow. Module-level assets can be used through the module `resources/` bundle and `moduleDir`. | ||
| 4. Declare software dependencies (`container`, `conda`) in the process definition, not in config. | ||
| 5. No default `ext` settings in config -- specify these defaults in the process definition or use explicit process inputs. Otherwise, any default `ext` settings must be replicated manually in the meta-pipeline. | ||
| 6. No plugin functions within the core workflow. |
There was a problem hiding this comment.
No clear about some of these best practices and what's the issue of not following them; maybe could be good to add an example.
There was a problem hiding this comment.
Following these guidelines makes it so that when you include the core workflow and its dependent modules/subworkflows, it is self-contained
For example:
- if the core workflow uses project-level assets like
binorlib, I have to remember to copy them into the meta-pipeline - if the core workflow uses a param directly and I import that into the meta-pipeline, I have to remember to define the same param (with the same meaning) in the meta-pipeline
- and so on
| > results/output-rnaseq.json | ||
| ``` | ||
|
|
||
| While pipeline chaining has always been possible in theory, new language features such as [workflow outputs](20251020-workflow-outputs.md) and [record types](20260306-record-types.md) make it much more practical. Each pipeline can define a structured output which can be passed to the next pipeline via JSON. Mismatches between an upstream output and downstream input (e.g. missing columns, different column names) can be resolved by a small adapter pipeline. |
There was a problem hiding this comment.
I remain to be convinced of the point of pipeline chaining if we can trivially make meta pipelines.
There was a problem hiding this comment.
Agreed, I think pipeline chaining is used because metapipelines don't work right now. If they did, the number of pipeline chains drops.
That's not to say they're never useful, but it's much less common.
Two main use cases:
- Run major pipeline (sarek, rnaseq) and add a few auxiliary processes
- Daisy chain two pipelines (fetchngs -> rnaseq)
Both are solved better by metapipelines than pipeline chaining.
The main use case for daisy chaining is actually wiring nextflow up to non Nextflow tools, e.g. Nextflow into an ETL system. In this case structured inputs and outputs are still very useful.
There was a problem hiding this comment.
at this point the value prop of pipeline chaining appears to be low development overhead (just plug A into B)
There was a problem hiding this comment.
Well, chaining has development overhead, it's quite a faff, all we have to do is bring meta-pipeline dev under that faff level
Agreed. Feel like we need some sort of auto-import of the params of child workflows, so e.g. they appear automatically in Platform, and I could say e.g. Then some auto-assembly of docs as well. Basically we need to standardise at the nextflow level where a bunch of the non-nextflow pieces need to live. |
| 3. No use of project-level assets (`projectDir`, `bin`, `lib`) within the core workflow. Module-level assets can be used through the module `resources/` bundle and `moduleDir`. | ||
| 4. Declare software dependencies (`container`, `conda`) in the process definition, not in config. | ||
| 5. No default `ext` settings in config -- specify these defaults in the process definition or use explicit process inputs. Otherwise, any default `ext` settings must be replicated manually in the meta-pipeline. | ||
| 6. No plugin functions within the core workflow. |
There was a problem hiding this comment.
Plugin support feels like a requirement, functions like a webhook or logging statement could be critical for the workflow. The main challenge might be supporting multiple versions (e.g. WORKFLOW1 uses plugin@1.2.3 and WORKFLOW2 uses plugin@2.4.1), but maybe we can just say "ONE PLUGIN ONLY"
There was a problem hiding this comment.
plugins for webhooks / logging typically live outside the core workflow. so the meta-pipeline would just import the core workflow logic and decide whether to include those plugins in its own shell
I have yet to see a plugin that is actually used in a workflow's core logic, although it's certainly possible. Most plugins provide third-party integrations at the pipeline boundary
There was a problem hiding this comment.
I agree but this might become more popular with the plugin registry + vibe coding.
Sounds like premature optimization by me, easier to just tell people to be careful and deal with it if it's a problem.
There was a problem hiding this comment.
sure, that's why I call them out as best practices instead of hard rules. you can use a plugin function as long as you remember to declare it in the meta-pipeline config
| 2. No `publishDir` -- use the `output` block. | ||
| 3. No use of project-level assets (`projectDir`, `bin`, `lib`) within the core workflow. Module-level assets can be used through the module `resources/` bundle and `moduleDir`. | ||
| 4. Declare software dependencies (`container`, `conda`) in the process definition, not in config. | ||
| 5. No default `ext` settings in config -- specify these defaults in the process definition or use explicit process inputs. Otherwise, any default `ext` settings must be replicated manually in the meta-pipeline. |
There was a problem hiding this comment.
ext.args is soooooo powerful, yet clearly breaks the interface contract for processes.
I still think we should promote args to a directive and it will solve a number of these issues (process.args) 😉 .
process {
args "--concise"
// etc...
}
// main.nf
my_process(ch_inputs, args: "--verbose")
// nextflow.config
process.withName 'my_process' {
args = "--verbose"
}There was a problem hiding this comment.
both ext.args and process.args can work, as long as the default value for the arg is defined in the process definition rather than in config
the core problem is that when I import a workflow, Nextflow doesn't know which config is "tied" to that workflow
| 5. No default `ext` settings in config -- specify these defaults in the process definition or use explicit process inputs. Otherwise, any default `ext` settings must be replicated manually in the meta-pipeline. | ||
| 6. No plugin functions within the core workflow. | ||
|
|
||
| For process directives, it is helpful to distinguish *what* is executed vs *how* it is executed. Directives that affect the *what* (`container`, `ext` settings) should be owned by the process definition. Directives that affect the *how* (`cpus`, `memory`, `executor`, `queue`, `errorStrategy`) should be owned by the meta-pipeline. |
There was a problem hiding this comment.
I don't understand the distinction here.
There was a problem hiding this comment.
in other words, some directives affect the task result while others don't
|
|
||
| Alternatively, these core plugin dependencies could be specified in the pipeline spec under `requires.plugins`. When installing a pipeline, Nextflow could copy these plugin declarations into the meta-pipeline config and/or spec. | ||
|
|
||
| Since this use case is rare -- plugin functions are typically used in the entry workflow outside the core workflow -- it can be deferred in the first iteration. |
There was a problem hiding this comment.
With more private plugin registries, I expect more utility methods in plugins (e.g. updateLims(sampleId, status)), but maybe this is premature optimization.
There was a problem hiding this comment.
A LIMS integration sounds like something that could live outside the core workflow
| > results/output-rnaseq.json | ||
| ``` | ||
|
|
||
| While pipeline chaining has always been possible in theory, new language features such as [workflow outputs](20251020-workflow-outputs.md) and [record types](20260306-record-types.md) make it much more practical. Each pipeline can define a structured output which can be passed to the next pipeline via JSON. Mismatches between an upstream output and downstream input (e.g. missing columns, different column names) can be resolved by a small adapter pipeline. |
There was a problem hiding this comment.
Agreed, I think pipeline chaining is used because metapipelines don't work right now. If they did, the number of pipeline chains drops.
That's not to say they're never useful, but it's much less common.
Two main use cases:
- Run major pipeline (sarek, rnaseq) and add a few auxiliary processes
- Daisy chain two pipelines (fetchngs -> rnaseq)
Both are solved better by metapipelines than pipeline chaining.
The main use case for daisy chaining is actually wiring nextflow up to non Nextflow tools, e.g. Nextflow into an ETL system. In this case structured inputs and outputs are still very useful.
|
|
||
| The Nextflow-in-Nextflow approach treats the included pipeline as a *black box* -- it preserves the exact pipeline behavior (core workflow + entry workflow + config) while forfeiting dataflow composition (separate dataflow graphs). | ||
|
|
||
| An ideal solution might combine the best of both: compose pipelines into a single dataflow graph (white box) while inheriting each pipeline's params, outputs, and config so they need not be replicated (black box). We considered such a model, where an included pipeline contributes its shell as namespaced, overridable defaults, but rejected it. Dataflow composition fundamentally requires exposing the core workflow as a set of channel ports, so the white-box mechanism is unavoidable; inheritance would only layer implicit behavior on top of it. That behavior comes at a steep cost: it relocates a one-time *write* cost (boilerplate) into a recurring *read* cost (hidden defaults, auto-bound arguments, auto-published outputs), burdens every tool that must now understand it (linter, type checker, config resolution, resume), and conflicts with the frozen-island philosophy that otherwise governs vendored code. |
There was a problem hiding this comment.
I agree with this. The added complexity is enormous.
There was a problem hiding this comment.
@ewels @pinin4fjords @adamrtalbot
Pulling everyone into this thread to talk about auto-inheritance
As you might expect, I'm most concerned about the params. You characterise it as a one-off cost which is mitigated by LLMs, however that doesn't take into account updates to included pipelines (a core functionality with included modules). The params drift with updates would be dangerous and a constant source of dev work.
That's fair, but not my main point. The core problem is this -- if you want to preserve dataflow concurrency between pipelines, then you can't really just auto-import params into the meta-pipeline. You have to define which params are replaced with inter-pipeline wiring vs exposed to the top-level. That amounts to just writing the meta-workflow.
The development overhead is what it is. I suggest the AI skill just as an idea. I'm sure it could also handle updates. All of that is better than having loads of hidden behavior that makes the meta-pipeline impossible to reason about
I'd still love to look into how we could bulk import nested config and apply it at root level. Even if it is a separate import + apply mechanism (eg. like config profiles in a sense?). I think without it, the use of the meta pipeline functionality is substantially limited.
Not sure I understand this point. Most of the config is just standard boilerplate, so it doesn't make sense to auto-import it because you will just get lots of duplicate config
Unless you are talking about ext config. That will depend on whether we can move the default ext settings into the process definition
There was a problem hiding this comment.
Building on what Adam said:
In a scenario where I update my workflow from v1.1 to v1.2, an update to params should be explicit in the input block, not implicit and I hope it doesn't change too much.
The nice thing about an explicit meta-pipeline definition is that when I update the included pipeline, the linter / language server will immediately pick up on any inconsistencies, because it's just regular code. I'm not sure the tooling would be able to do that if there was a lot of implicit behavior
| } | ||
| // perform RNAseq analysis | ||
| multiqc_report = NFCORE_RNASEQ( ch_samples ) |
There was a problem hiding this comment.
Side note - I would remove MultiQC from all nf-core pipelines and put them in the metapipelines, i.e. no MultiQC repeats, but that's a matter of opinion.
FETCHNGS(ch_inputs)
RNASEQ(fetchngs.out)
MULTIQC(RNASEQ.out.qc_files)There was a problem hiding this comment.
I was wondering about that. Wasn't sure if you would want a meta-pipeline to produce one multiqc report per pipeline or just one for the whole thing
I disagree. Having unpredictable global scope params blocks is just weird and if we were designed Nextflow today we would never include this behaviour. In other languages, globals need to be used with caution and are generally not advised. Having random In a scenario where I update my workflow from v1.1 to v1.2, an update to params should be explicit in the input block, not implicit and I hope it doesn't change too much. If we really want to make them importable, we could add a dedicated params block to the workflow definition: workflow THING {
params:
foo: Int
bar: Bool
baz: String
take:
// etc
}but this doesn't feel very different to: record ThingParams {
foo: Int
bar: Bool
baz: String
}
workflow THING {
take:
params: ThingParams
// etc
} |
|
My main concern here is versioning of imported workflows. Do we include a lock file or something to ensure consistency or just trust in the files that are copied into the workflow code? |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
|
|
||
| When a pipeline is included, it is vendored into the meta-pipeline project under `workflows/<scope>/<name>/`. Included pipelines are isolated -- each included pipeline has its own `modules/` and `workflows/` directories. This way, two pipelines can use different versions of the same module without compromising reproducibility. | ||
|
|
||
| Included pipelines should be committed to the meta-pipeline repository. The pipeline should have a *pipeline spec* (`nextflow_spec.json`) which specifies the pipeline version, so that Nextflow can track local changes. |
There was a problem hiding this comment.
My main concern here is versioning of imported workflows. Do we include a lock file or something to ensure consistency or just trust in the files that are copied into the workflow code?
See here. Like modules, we will likely want to have some sort of checksum verification (e.g. .pipeline-info)
I guess the simplest way would be to commit the entire pipeline, even though only the core workflow will be used. Then you can have a single checksum for the entire pipeline directory
It's probably still useful to keep the pipeline shell in the meta-pipeline repo, since e.g. your agent will want to refer to it when updating the meta-pipeline
There was a problem hiding this comment.
nf-core copy+pastes modules for subworkflows and it works well!
| ```groovy | ||
| include { NFCORE_FETCHNGS } from 'nf-core/fetchngs' | ||
| include { NFCORE_RNASEQ } from 'nf-core/rnaseq' |
There was a problem hiding this comment.
For anyone feeling adventurous, here is what Claude and I came up with while exploring auto-inheritance:
include { NFCORE_FETCHNGS } from 'nf-core/fetchngs'
include { NFCORE_RNASEQ } from 'nf-core/rnaseq'
params {
input: Path // meta entry point
strandedness: String = 'auto' // one new knob
// aligner / fasta / ... inherited from rnaseq's params, override on CLI as --rnaseq.fasta=...
}
workflow {
main:
ch_ids = channel.fromPath(params.input).splitCsv()
ch_samples = NFCORE_FETCHNGS( ch_ids )
ch_samples = samples.map { r -> r + record(strandedness: params.strandedness) }
// rnaseq.* params automagically passed to rnaseq workflow via named arguments
NFCORE_RNASEQ( samples: ch_samples )
// no publish/output blocks: each pipeline's outputs publish under <output-dir>/<pipeline>/
// question: what if I don't want to publish something (e.g. fetchngs output)?
}Feel free to take it and run with it...
There was a problem hiding this comment.
Exactly the way I was thinking. We just namespace the children's params
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
pditommaso
left a comment
There was a problem hiding this comment.
Thanks for putting this together, Ben — the dataflow-composition motivation and the rejection of the runtime-inheritance hybrid are both nicely argued. A few thoughts to share before this moves past draft:
1. The key technical challenge could be expanded. At its core this proposes a mechanism to include a fully-fledged Nextflow workflow into another, mimicking how we already include modules and sub-workflows. The part I'd love to see fleshed out is how channels and values get bound into the included workflow's inputs. The ADR sets out the policy (params live at the top level, the core workflow consumes everything via take:) but doesn't yet describe the binding mechanics: how a scalar value vs. a streaming channel is bound at the call site, the value-channel/queue-channel broadcast semantics, and whether a typed take: can accept a bare value type like String/Path. The example here (take: aligner: String) also reads a bit differently from the typed-workflows ADR, where every take: input is a channel type. Since this binding question largely determines feasibility, it'd be great to work it out explicitly.
2. The nomenclature can be better shaped. The document moves between "meta-pipeline" and "remote pipeline", and I think the framing could be sharpened. Terms like workflow modularisation / workflow inclusion / workflow composition might describe what's happening (composing one workflow into another) more directly than introducing a new "meta-pipeline" category.
3. There's some overlap with existing sub-workflow inclusion. Once you discard the entry workflow, params, and output block and import only the core workflow, what's left looks a lot like a sub-workflow. It'd be helpful to clarify how this differs from including a remote sub-workflow, and what the main benefit is that justifies a separate mechanism (separate storage layout, a new nextflow_spec.json, a separate CLI, etc.).
4. A possible framing. I'd lean toward framing the next step as enabling remote sub-workflows — the natural progression after remote modules (processes). Module (process) → sub-workflow → composition feels like a clean, incremental story that reuses the conventions we already have, rather than introducing a "pipeline" as a new top-level artifact with its own resolution rules, storage path, and spec file. If we get remote sub-workflow inclusion right, "meta-pipelines" might largely fall out of it as a usage pattern rather than a new concept.
|
@pditommaso thanks for the review
There isn't much to say here because it just works like normal. In the appendix example,
A workflow take can be a channel, a dataflow value, or a regular value. This is how it has always worked
"Meta-pipeline" is the top-line feature that everyone is after, but the only actual new feature proposed by the ADR is "remote pipeline inclusion" -- how to install a pipeline as a component and keep it in sync with the source. This is why the ADR is titled "Remote pipeline inclusion". Once you have that, everything else is just normal workflow composition and convention. They are distinct concepts -- the ADR does not treat them as interchangeable.
The core workflow looks like a subworkflow because it is a subworkflow 😄 The only new thing that we introduce here is installing a pipeline into a project as a component and keeping it in sync with the remote source (either from Git or the registry). For that you likely need a pipeline spec (version, checksum) and a CLI (installing, updating). I just haven't spelled all that out yet because the bigger question right now is how to minimize developer overhead
Looks like you arrived at the same place as me. Remote workflows are the real feature, meta-pipelines emerge naturally as a convention on top. I'm not sure whether it's worth trying to distinguish between pipelines / workflows / subworkflows. They're all basically the same thing. Especially if we add the ability to execute named workflows directly (#7208). The difference boils down to boilerplate, which we want to minimize anyway This is why I just talk about "remote pipeline inclusion", because when I import a workflow, I don't really care whether that workflow is a "pipeline" like rnaseq or a "subworkflow" like Happy to rename the ADR to "remote workflow inclusion" to align with the |
Modules have a |
@adamrtalbot agreed, I never said global. I would love it if the pipeline config is imported within a dedicated scope and treated as a baseline default. Then the import-ing pipeline can override anything, but doesn't need to duplicate config that isn't being changed. Doing this would not be trivial. The only way I can think of is to do something fairly radical like rendering the config at import time and saving that to a locked config file somewhere. Or some other crazy mechanism. |
Config or params? In my mind they are very different concepts, I was referring to parameters here. |
I agree with this. They're all workflows*, the only thing that separates a "pipeline" from a subworkflow is perception. *except the anonymous entry workflow, which is where the sticky point about params and config comes in 😉 |
Ideally params, but might need to be config for all the
Yeah as it stands I think this basically boils down to the functionality we already have with |
Can the nf-core tooling install a workflow from a pipeline repo? e.g. NFCORE_RNASEQ from nf-core/rnaseq? I think that is the main thing that this ADR adds |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
| // module | ||
| include { BWA_MEM } from 'nf-core/bwa/mem' | ||
| // pipeline | ||
| include { NFCORE_RNASEQ } from 'nf-core/rnaseq' |
There was a problem hiding this comment.
One point that makes me hesitant to reframe the ADR as "remote workflow inclusion" -- here we are referencing the pipeline by name (nf-core/rnaseq)
It could be the GitHub repo or an entity in the Nextflow registry, but either way, the pipeline itself plays a role in facilitating the inclusion. Even if we only include the core workflow (NFCORE_RNASEQ), we likely need to store the entire pipeline code in the meta-pipeline repo, because that is the thing that is versioned
As a user, I will want to know that my meta-pipeline is using a specific pipeline version (e.g. nf-core/rnaseq 3.3.0), so in effect we have to say that we are including the entire pipeline
This PR adds an ADR for remote pipeline inclusion, aka "meta-pipelines".
It describes an approach for including remote pipelines into a meta-pipeline in a way that preserves dataflow concurrency between pipeline inputs/outputs.
It discusses alternative approaches such as pipeline chaining / nf-cascade and why they don't satisfy certain use cases (preserving dataflow concurrency).
It also walks through a basic example of fetchngs -> rnaseq.