acmrep25/index.html at main · George-Marchment/acmrep25 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
<!DOCTYPE html>
<html lang="en">

<head>
    <title>ACM Tutoriel</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.1/css/bootstrap.min.css">
    <link rel="stylesheet" href="assets/styles.css">
</head>

<body>
    <h1 id="computational-reproducibility-with-scientific-workflows-analysing-viral-genomes-with-nextflow"><p>Computational
        Reproducibility With Scientific Workflows: Analysing viral genomes with Nextflow</p></h1>
    An ACM Rep 2025 tutoriel
    <h4> <div style="color:rgb(20, 83, 56)">George Marchment, Sarah Cohen-Boulakia and Frédéric Lemoine </div></h4>
    <p><img src="img/comic.png" alt="comic" width="70%" /></p>

    <h2>Table of content</h2>
    <ul>
        <li>I. <a href="#Abstract">Abstract</a></li>
        <li>II. <a href="#LearningObjectivesandoutline">Learning Objectives and outline</a></li>
        <li>III. <a href="#TutorielMaterial">Tutoriel Material</a>
            <ul>
                <li>A. <a href="#Lecture">Lecture</a></li>
                <li>B. <a href="#PraticalSession">Pratical Session</a>
                    <ul>
                        <li>a. <a href="#Objectif">Objectif</a></li>
                        <li>b. <a href="#Inputdata">Input data</a></li>
                        <li>c. <a href="#Detailedsteps">Detailed steps</a></li>
                        <li>d. <a href="#Listoftoolsneeded">List of tools needed</a></li>
                        <li>e. <a href="#AnalysisQuestions">Analysis Questions</a></li>
                        <li>f. <a href="#Correction">Correction</a></li>
                    </ul>
                </li>
                <li>C. <a href="#Reproducibilityconsensus">Reproducibility consensus</a></li>
            </ul>
        </li>
        <li>IV. <a href="#IntendedAudienceFormatandSpecialEquipmentNeeds">Intended Audience, Format and Special
                Equipment
                Needs</a></li>
        <li>V. <a href="#Authors">Authors</a></li>
    </ul>


    <!--ABSTRACT-->
    <h2 id="i-a-name-abstract-a-abstract">I. <a name='Abstract'></a>Abstract</h2>
    <p>In an era of generation of large datasets and complex scientific analyses, ensuring the reproducibility of data
        analyses has become paramount. <b>Workflow management systems have emerged as a key solution to this
            challenge</b>. By
        managing:
    <ul>
        <li> the software environment,
        <li> task scheduling,
        <li> parallelisation and
        <li> communication with the execution machines (HPC, cloud, etc.).
    </ul>
    They significantly facilitate workflow development compared to historical practices (e.g.,
    simple bash scripts), all while ensuring <b>scalablility</b> and a <b>high level of reproducibility</b>.
    However, while
    they
    are becoming more popular, workflow management systems have not yet gained wide adoption within the scientific
    community, largely due to established practices and the perceived high learning curve associated with their use.
    </p>

    <p><img src="img/intro.png" alt="intro" width="70%" /></p>
    <p><b>This tutorial aims at demonstrating the critical role of workflow management systems in implementing
            reproducible
            data analyses</b>, with an emphasis on their capacity to encapsulate heterogeneous code, manage software
        environments,
        scale with the data size, and leverage heterogeneous computational resources efficiently. To do so, we will use
        the
        Nextflow workflow system and a viral genome sequence reconstruction pipeline as a use case. This will
        demonstrate the fundamentals of Nextflow and illustrate how it can be used to easily implement, execute, and
        share a
        simple workflow. </p>


    <!--Learning Objectives and outline-->
    <h2 id="ii-a-name-learningobjectivesandoutline-a-learning-objectives-and-outline">II. <a
            name='LearningObjectivesandoutline'></a>Learning Objectives and outline</h2>
    <p>Key learning outcomes include
    <ul>
        <li> acquiring basic workflow concepts,
        <li> learning how to implement simple workflows and
        <li> understanding the capabilities of workflow management systems in encapsulating heterogeneous code,
            scalability, software environment management, and computational resource management.
    </ul>
    The tutorial will be organized in three phases.</p>
    <ol>
        <li>We will start with a short lecture to present the main challenges in implementing reproducible data
            analyses,
            specifically using a motivating example for illustration. We will then introduce how workflow management
            systems
            are capable of solving these challenges. To achieve this, we will present the Nextflow framework and then
            demonstrate how it can be used to solve the motivating example. Finally, we will introduce the analysis
            pipeline
            to implement as a workflow. It consists of several viral genome sequencing datasets that need to go through
            multiple analysis steps in order to reconstruct full viral genomes with their annotations.</li>
        <li>The second part of the tutorial will consist of a practical session, during which the participants will
            implement the analysis using the Nextflow management system. The session will be highly interactive,
            featuring
            coding demonstrations and structured group discussions, which will allow participants to apply their
            knowledge,
            with organizers providing support and answering questions. </li>
        <li>The tutorial will conclude with a discussion and survey on the challenges encountered. We will then conduct
            a
            reproducibility consensus by evaluating the results provided by the participants, where we as a group will
            assess the level of reproducibility achieved, fostering a collaborative evaluation of the reproducibility
            achieved. </li>
    </ol>
    <p>By the end of the tutorial, participants will have a solid foundation in workflow management systems and be
        capable
        of designing and implementing reproducible data analysis workflows, aligning with the broader goals and themes
        of <a href="https://acm-rep.github.io/2025/">ACM REP 2025</a>.</p>


    <!--Tutoriel Material-->
    <h2 id="iii-a-name-tutorielmaterial-a-tutoriel-material">III. <a name='TutorielMaterial'></a>Tutoriel Material</h2>
    <h3 id="a-a-name-lecture-a-lecture">A. <a name='Lecture'></a>Lecture</h3>
    <p>Link to the lecture slides can be found <a
            href="https://github.com/George-Marchment/acmrep25/blob/main/tutoriel_material/slides.pdf">here</a>.</p>
    <h3 id="b-a-name-praticalsession-a-pratical-session">B. <a name='PraticalSession'></a>Pratical Session</h3>
    <h4 id="a-a-name-objectif-a-objectif">a. <a name='Objectif'></a>Objectif</h4>


    <p>
        The aim of this practical session is to create a Nextflow workflow to analyse a SARS-CoV-2 sequencing dataset.
        The
        objectives are:
    <ol>
        <li>Infer the full sequence of the virus</li>
        <li>Detect the <a href="https://clades.nextstrain.org/">clade</a> (alpha, beta, etc.). </li>
    </ol>


    To do so we will start from a sample that has been sequenced on an Illumina sequencer, and we will run the following
    steps:
    <ol>
        <li>Map the reads onto a reference genome.</li>
        <li>Build the consensus sequence from the mapped reads.</li>
        <li>Identify the clade of the virus based on the consensus sequence.</li>
    </ol>

    Performing these types of analysis, understanding the genetic sequence and clade of a virus is crucial for multiple
    reasons.
    <ul>
        <li> Firstly, they help in the tracking of the evolution and spread of the virus, providing insights into how it
            mutates over time.
            This information is vital for public health officials to make informed decisions and implement effective
            control measures.
        <li> Secondly, identifying specific clades can help in understanding the virulence and transmissibility of
            different variants, which is essential for developing targeted treatments and vaccines.
            Additionally, this analysis supports epidemiological studies by enabling the identification of outbreak
            sources and transmission patterns.
    </ul> Overall, the detailed genetic analysis of viruses enhances the ability to respond to outbreaks.
    </p>

    <p>
    <figure>
        <img src="img/phylo.png" alt="coronavirus" width="70%" />
        <figcaption>SARS-CoV-2 phylogeny from <a href="https://nextstrain.org">nextstrain.org</a>. Data updated
            2025-06-09</figcaption>
    </figure>

    </p>

    <h4 id="b-a-name-inputdata-a-input-data">b. <a name='Inputdata'></a>Input data</h4>
    <p>It consists of: </p>
    <ul>
        <li>Two compressed fastq files containing paired-end reads from Amplicon sequencing a SARS-CoV2 sample. Which
            can be
            downloaded here:<ul>
                <li>Reads 1: <a
                        href="https://github.com/George-Marchment/acmrep25/blob/main/data/reads/SRR13182925_1.fastq.gz">here</a>
                </li>
                <li>Reads 2: <a
                        href="https://github.com/George-Marchment/acmrep25/blob/main/data/reads/SRR13182925_2.fastq.gz">here</a>
                </li>
            </ul>
        </li>
        <li>The reference genome to map the reads against (<a
                href="https://www.ncbi.nlm.nih.gov/nuccore/MN908947">https://www.ncbi.nlm.nih.gov/nuccore/MN908947</a>).
            Which can be download here:<ul>
                <li><a
                        href="https://github.com/George-Marchment/acmrep25/blob/main/data/genome/reference.fa">reference</a>
                </li>
            </ul>
        </li>
    </ul>


    <h4 id="c-a-name-detailedsteps-a-detailed-steps">c. <a name='Detailedsteps'></a>Detailed steps</h4>

    <p>The resulting workflow should look like this:</p>
    <p><img src="img/wf.png" alt="Workflow graph" width="70%" /></p>

    <h5 id="1-mapping-the-reads-on-a-reference-genome">1. Mapping the reads on a reference genome</h5>
    <ul>
        <li>
            <p>First the reference genome needs to be indexed, this steps is important since it allows to accelerate the
                mapping process. For more information regarding genome indexing see <a
                    href="https://pmbio.org/module-02-inputs/0002/04/01/Indexing/">here</a>.</p>
        </li>
        <ul>
            <li>
                Input of step</p>
                <pre><kbd>file ref</kbd></pre>
            </li>
            <li>
                Output of step</p>
                <pre><kbd>tuple val(ref.name), file("${ref.baseName}.*")</kbd></pre>
            </li>
            <li>
                Command lines</p>
                <pre><kbd>bwa index ${ref}</kbd></pre>
            </li>
        </ul>

        <li>
            <p>The next step is the mapping of the reads to reference genome. Essentially, this step allows to find where in the genome our reads originate from. Read mapping will be
                performed using <a href="https://github.com/lh3/bwa">bwa mem</a>.</p>
        </li>
        <ul>
            <li>
                Inputs of step</p>
                <pre><kbd>tuple val(name), file(f1), file(f2)</kbd>
<kbd>tuple val(refName), file(ref)</kbd></pre>
            </li>
            <li>
                Output of step</p>
                <pre><kbd>tuple val(name), file("*.bam"), file("*.bai")</kbd></pre>
            </li>
            <li>
                Command lines</p>
                <pre><kbd>bwa mem -t 1 reference.fa reads1.fq reads2.fq > tmp.sam</kbd>
<kbd>samtools sort -o sample.bam tmp.sam</kbd>
<kbd>samtools index sample.bam</kbd></pre>
            </li>
        </ul>


    </ul>


    <h5 id="2-building-consensus-sequence">2. Building consensus sequence</h5>
    <ul>
        <li>
            <p>Consensus sequence will be inferred also using <a
                    href="https://andersen-lab.github.io/ivar/html/manualpage.html">iVAR</a>.
                The consensus sequence is a theoretical representative of a nucleotide sequence in which each nucleotide
                is the one which occurs most frequently at that site in the different sequences. The goal of using a
                consensus
                sequence is by using the most frequent nucleotide at each site, the conserved regions are preserved,
                these being the most functionnaly important.
                Additionaly, sequencing errors are also neglected (by averaging them out), creating
                a sequence in which we have more confidence. For more information regarding the consensus sequence see
                <a
                    href="https://www.ncbi.nlm.nih.gov/mesh?Db=mesh&Cmd=DetailsSearch&Term=%22Consensus+Sequence%22%5BMeSH+Terms%5D">here</a>.
            </p>
        </li>
        <ul>
            <li>
                Input of step</p>
                <pre><kbd>tuple val(name), file(bam), file(bai)</kbd></pre>
            </li>
            <li>
                Output of step</p>
                <pre><kbd>file "${name}.fa"</kbd></pre>
            </li>
            <li>
                Command lines</p>
                <pre><kbd>samtools mpileup -d 600000 -A -Q 0 -F 0 ${bam} | ivar consensus -q 20 -t 0 -m 5 -n N -p ${name}</kbd></pre>
            </li>
        </ul>

    </ul>


    <h5 id="3-detecting-clade">3. Detecting clade</h5>
    <p>
        Viral diversity is often broken down into Clades or lineages which are defined by specific combinations of
        signature mutations.
        Clades are groups of related sequences that share a common ancestor.
        To detect a sequences clade we use 2 different methods Pangolin and NextClade.
        For more information see
        <a
            href="https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/04-clade-assignment.html">here</a>.
    </p>

    <ul>
        <li>
            <p>Detecting clade (<a href="https://clades.nextstrain.org/">NextClade</a>)</p>
        </li>
        <p>Nextclade assigns sequences to clades by placing the sequences on a phylogenetic tree annotated with clade
            definitions. More specifically, Nextclade assigns the clade of the nearest reference node found during the
            Phylogenetic placement step.</p>
        <ul>
            <li>First, the nextclade sars-cov-2 reference files have to be downloaded.</li>
            <ul>
                <li>
                    Output of step</p>
                    <pre><kbd>path "ncref"</kbd></pre>
                </li>
                <li>
                    Command lines</p>
                    <pre><kbd>nextclade dataset get --name 'sars-cov-2' --output-dir 'ncref'</kbd></pre>
                </li>
            </ul>

            <li>Then, use these reference files along the fasta file to annotate the sample consensus.</li>
            <ul>
                <li>
                    Inputs of step</p>
                    <pre><kbd>path ncref</kbd>
<kbd>path seq</kbd></pre>
                </li>
                <li>
                    Output of step</p>
                    <pre><kbd>path "annotations.tsv"</kbd></pre>
                </li>
                <li>
                    Command lines</p>
                    <pre><kbd>nextclade run --in-order --input-dataset ${ncref} --output-tsv &#39;annotations.tsv&#39; ${seq}</kbd>
<kbd>sed -i &#39;s/\x0D\$//&#39; annotations.tsv</kbd></pre>
                </li>
            </ul>

        </ul>

    </ul>


    <ul>
        <li>
            <p>Detecting clade (<a href="https://cov-lineages.org/">Pangolin</a>)</p>
            <p>
                Pangolin will assign the most likely lineage out of all currently designated lineages.
                For more information see <a
                    href="https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/04-clade-assignment.html">here</a>.
            </p>
            <ul>
                <li>
                    Inputs of step</p>
                    <pre><kbd>file fa</kbd>
<kbd>path seq</kbd></pre>
                </li>
                <li>
                    Output of step</p>
                    <pre><kbd>file "*.csv"</kbd></pre>
                </li>
                <li>
                    Command lines</p>
                    <pre><kbd>PATH=/opt/conda/envs/pangolin/bin/:\$PATH</kbd>
<kbd>pangolin --usher 'sample_consensus.fa' -t 20 --outfile Pangolin_lineage_report.csv</kbd></pre>
                </li>
            </ul>

        </li>
    </ul>
    <h4 id="d-a-name-listoftoolsneeded-a-list-of-tools-needed">d. <a name='Listoftoolsneeded'></a>List of tools
        needed
    </h4>
    <p>Here are the list of tools you will need in the workflow with a corresponding container to use them:</p>
    <table class="table table-striped">
        <thead>
            <tr>
                <th>Tool</th>
                <th>Container</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td><a href="https://github.com/samtools/samtools">samtools</a></td>
                <td><code>evolbioinfo/samtools:v1.11</code></td>
            </tr>
            <tr>
                <td><a href="https://andersen-lab.github.io/ivar/html/manualpage.html">iVAR</a></td>
                <td><code>evolbioinfo/ivar:v1.3.1</code></td>
            </tr>
            <tr>
                <td><a href="https://clades.nextstrain.org/">Nextclade</a></td>
                <td><code>nextstrain/nextclade:3.13.3</code></td>
            </tr>
            <tr>
                <td><a href="https://cov-lineages.org/">Pangolin</a></td>
                <td><code>evolbioinfo/pangolin:v4.3.1-v1.33-v0.3.19-v0.1.12</code></td>
            </tr>
            <tr>
                <td><a href="https://github.com/lh3/bwa">bwa</a></td>
                <td><code>evolbioinfo/bwa:v0.7.17</code></td>
            </tr>
        </tbody>
    </table>


    <h4 id="e-a-name-analysisquestions-a-analysis-questions">e. <a name='AnalysisQuestions'></a>Analysis Questions
    </h4>
    <p>After running the workflow, we can analyse some of its resutls</p>
    <ul>
        <li>Using the <code>annotations.tsv</code>, determine what clade has been predicted for the samples?</li>
        <li>Using the <code>lineage_report.csv</code>, determine what lineage has been predicted for the samples?</li>
        <li>What can you deduce?</li>
        <li>What year do you estimate the sequence samples were obtained (using the clade and the <a
                href="https://clades.nextstrain.org/">Nextclade</a> website)?</li>
    </ul>
    <h4 id="f-a-name-correction-a-correction">f. <a name='Correction'></a>Correction</h4>
    <p>A link to the correction workflow will be added at the end of the tutoriel.</p>

    <h3 id="c-a-name-reproducibilityconsensus-a-reproducibility-consensus">C. <a
            name='Reproducibilityconsensus'></a>Reproducibility consensus</h3>

    <p>After implementing and running the workflow described above. Please send the results of the workflow (the <code>annotations.tsv</code> as well as the <code>lineage_report.csv</code> file) to <code>george.marchment[at]universite-paris-saclay.fr</code>.</p>

    <p>The goal of the reproducibility consensus is to establish the level to which the results from participants are similar and thus the analysis reproducible. Once all participants have submitted their results to the organisers, the consensus results will be shared, followed by a discussion on the level of reproducibility achieved among all participants.</p>

    <h2
        id="iv-a-name-intendedaudienceformatandspecialequipmentneeds-a-intended-audience-format-and-special-equipment-needs">
        IV. <a name='IntendedAudienceFormatandSpecialEquipmentNeeds'></a>Intended Audience, Format and Special
        Equipment
        Needs</h2>
    <ul>
        <li>
            <p>Intended Audience</p>
            <ul>
                <li>
                    <p>This introductory-level tutorial is targeted at scientists with an informatics
                        background
                        who
                        analyse data in their projects. Participants should have an intermediate level of proficiency in
                        Bash
                        (navigate a terminal, install software, manage dependencies). No prior bioinformatics or
                        biological knowledge
                        is needed.</p>
                </li>
            </ul>
        </li>
        <li>
            <p>Format</p>
            <ul>
                <li>
                    <p>The tutorial will follow a hybrid format, the instructors will be connected remotely.</p>
                </li>
            </ul>
        </li>
        <li>
            <p>Length</p>
            <ul>
                <li>
                    <p>The tutorial will last half a day (3 hours).</p>
                </li>
            </ul>
        </li>
        <li>
            <p>Special Equipment Needs</p>
            <ul>
                <li>
                    <p>Participants will need access to a terminal (Linux or Mac) and should have
                        Apptainer/Docker and Nextflow installed prior to the tutorial, for infromation to install the
                        different
                        software, see here.</p>
                    <ul>
                        <li><a href="https://www.nextflow.io/docs/latest/install.html">Install Nextflow</a></li>
                        <li><a href="https://apptainer.org/docs/admin/main/installation.html#">Install Apptainer</a>
                        </li>
                        <li><a href="https://docs.docker.com/get-started/get-docker/">Install Docker</a></li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
    <h2 id="v-a-name-authors-a-authors">V. <a name='Authors'></a>Authors</h2>
    <ul>
        <li><a href="https://orcid.org/0000-0002-4565-3940">George Marchment</a></li>
        <li><a href="https://orcid.org/0000-0002-7439-1441">Sarah Cohen-Boulakia</a></li>
        <li><a href="https://orcid.org/0000-0001-9576-4449">Frédéric Lemoine</a></li>
    </ul>
</body>

</html>