Skip to content

Various Fixes for Missing and Problematic Data#56

Merged
sgsutcliffe merged 18 commits into
devfrom
fix/missing_data
May 21, 2026
Merged

Various Fixes for Missing and Problematic Data#56
sgsutcliffe merged 18 commits into
devfrom
fix/missing_data

Conversation

@emarinier

@emarinier emarinier commented May 15, 2026

Copy link
Copy Markdown
Member

Changes to scheduled pipeline output:

  1. The "genomic_address_name" and "national_outbreak_code" fields and all other passed metadata fields should be treated as strings.
  2. The "fastmatch_top_samples" and "fastmatch_top_genomic_address" metadata fields should include the top samples metadata values according to the following logic:
    1. The "--fastmatch_top_samples_threshold" parameter controls the maximum number of samples to consider for summarizing (default is 5, which includes the sample matched to itself)
    2. Duplicate values should be removed from the "genomic_address_name" and sample puid values before writing to the respective metadata fields.
    3. Empty values for "genomic_address_name" or sample puid (should never happen) should be ignored when writing the comma-separated list of values to the top metadata fields.
    4. If all metadata values are empty (e.g., all "genomic_address_name" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
    5. See example below
  3. The "fastmatch_code_match" metadata field should contain a list of all matched "national_outbreak_code" values for the query sample (not just top X).
    1. Duplicate values should be removed from the list of matches prior to writing the list to "fastmatch_code_match".
    2. Empty values for "national_outbreak_code" should be ignored prior to writing the list to "fastmatch_code_match".
    3. If all metadata values are empty (e.g., all "national_outbreak_code" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
    4. See example below

1.1. Example matches and summarized results

Matches between a query sample A and reference samples with different cases of metadata values

1. Acceptance criteria

  1. The "genomic_address_name" and "national_outbreak_code" fields and all other passed metadata fields should be treated as strings.
  2. The "fastmatch_top_samples" and "fastmatch_top_genomic_address" metadata fields should include the top samples metadata values according to the following logic:
    1. The "--fastmatch_top_samples_threshold" parameter controls the maximum number of samples to consider for summarizing (default is 5, which includes the sample matched to itself)
    2. Duplicate values should be removed from the "genomic_address_name" and sample puid values before writing to the respective metadata fields.
    3. Empty values for "genomic_address_name" or sample puid (should never happen) should be ignored when writing the comma-separated list of values to the top metadata fields.
    4. If all metadata values are empty (e.g., all "genomic_address_name" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
    5. See example below
  3. The "fastmatch_code_match" metadata field should contain a list of all matched "national_outbreak_code" values for the query sample (not just top X).
    1. Duplicate values should be removed from the list of matches prior to writing the list to "fastmatch_code_match".
    2. Empty values for "national_outbreak_code" should be ignored prior to writing the list to "fastmatch_code_match".
    3. If all metadata values are empty (e.g., all "national_outbreak_code" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
    4. See example below

1.1. Example matches and summarized results

Matches between a query sample A and reference samples with different cases of metadata values

    Case 1: All Different Case 2: All Same Case 3: All Missing   Case 4: Some Missing, Some Different
Query Reference genomic_address_name national_outbreak_code genomic_address_name national_outbreak_code genomic_address_name
A A 1.1 a 1.1 a  
A B 1.2 b 1.1 a  
A C 1.3 c 1.1 a  
A D 1.4 d 1.1 a  
A E 1.5 e 1.1 a  

 

Expected metadata output written back to query sample A for different cases

Cases Sample fastmatch_top_samples fastmatch_top_genomic_address fastmatch_code_match
Case 1: All Different A A,B,C,D,E 1.1,1.2,1.3,1.4,1.5 a,b,c,d,e
Case 2: All Same A A,B,C,D,E 1.1 a
Case 3: All Missing A A,B,C,D,E    
Case 4: Some Missing, Some Different A A,B,C,D,E 1.1,1.2 a,b

 

Comment thread bin/process_output.py Outdated
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread bin/process_output.py
Comment thread conf/modules.config
Comment thread conf/modules.config
Comment thread modules/local/profile_dists/main.nf
Comment thread tests/pipelines/integration.nf.test

@apetkau apetkau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work Eric. Thanks so much 😄. I'm having Steven continue to work on this.

I've made a number of comments on this PR below.

Comment thread nextflow_schema.json Outdated
Comment thread README.md Outdated
Comment thread workflows/fastmatchirida.nf Outdated
Comment thread conf/modules.config
def scheduled_argument = (params.output_type == "scheduled") ? "--scheduled" : ""
def date_argument = (params.output_type == "scheduled") ? "--date_string ${date_prefix}": ""
def top_samples_threshold_argument = (params.output_type == "scheduled" && params.fastmatch_top_samples_threshold) ? "--top_samples_threshold ${params.fastmatch_top_samples_threshold}": ""
def top_samples_threshold_argument = (params.output_type == "scheduled" && !(params.fastmatch_top_samples_threshold == null)) ? "--top_samples_threshold ${params.fastmatch_top_samples_threshold}": ""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion but why not change to params.fastmatch_top_samples_threshold != null instead?

Comment thread conf/modules.config
Comment thread bin/process_output.py
@sgsutcliffe

Copy link
Copy Markdown
Collaborator

Replaced empty matches with "" rather than "NULL" f8ddbbb

@sgsutcliffe sgsutcliffe self-assigned this May 21, 2026

@apetkau apetkau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks perfect. Thanks so much @emarinier and @sgsutcliffe 😄

@sgsutcliffe sgsutcliffe merged commit 60088d3 into dev May 21, 2026
16 checks passed
@sgsutcliffe sgsutcliffe deleted the fix/missing_data branch May 21, 2026 23:33
@sgsutcliffe sgsutcliffe mentioned this pull request May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants