Various Fixes for Missing and Problematic Data by emarinier · Pull Request #56 · phac-nml/fastmatchirida

emarinier · 2026-05-15T18:45:45Z

Changes to scheduled pipeline output:

The "genomic_address_name" and "national_outbreak_code" fields and all other passed metadata fields should be treated as strings.
The "fastmatch_top_samples" and "fastmatch_top_genomic_address" metadata fields should include the top samples metadata values according to the following logic:
1. The "--fastmatch_top_samples_threshold" parameter controls the maximum number of samples to consider for summarizing (default is 5, which includes the sample matched to itself)
2. Duplicate values should be removed from the "genomic_address_name" and sample puid values before writing to the respective metadata fields.
3. Empty values for "genomic_address_name" or sample puid (should never happen) should be ignored when writing the comma-separated list of values to the top metadata fields.
4. If all metadata values are empty (e.g., all "genomic_address_name" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
5. See example below
The "fastmatch_code_match" metadata field should contain a list of all matched "national_outbreak_code" values for the query sample (not just top X).
1. Duplicate values should be removed from the list of matches prior to writing the list to "fastmatch_code_match".
2. Empty values for "national_outbreak_code" should be ignored prior to writing the list to "fastmatch_code_match".
3. If all metadata values are empty (e.g., all "national_outbreak_code" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
4. See example below

1.1. Example matches and summarized results

Matches between a query sample A and reference samples with different cases of metadata values

1. Acceptance criteria

The "genomic_address_name" and "national_outbreak_code" fields and all other passed metadata fields should be treated as strings.
The "fastmatch_top_samples" and "fastmatch_top_genomic_address" metadata fields should include the top samples metadata values according to the following logic:
1. The "--fastmatch_top_samples_threshold" parameter controls the maximum number of samples to consider for summarizing (default is 5, which includes the sample matched to itself)
2. Duplicate values should be removed from the "genomic_address_name" and sample puid values before writing to the respective metadata fields.
3. Empty values for "genomic_address_name" or sample puid (should never happen) should be ignored when writing the comma-separated list of values to the top metadata fields.
4. If all metadata values are empty (e.g., all "genomic_address_name" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
5. See example below
The "fastmatch_code_match" metadata field should contain a list of all matched "national_outbreak_code" values for the query sample (not just top X).
1. Duplicate values should be removed from the list of matches prior to writing the list to "fastmatch_code_match".
2. Empty values for "national_outbreak_code" should be ignored prior to writing the list to "fastmatch_code_match".
3. If all metadata values are empty (e.g., all "national_outbreak_code" values are empty), then an empty string "" should be written to the respective metadata value (which means any previous values in this metadata field will be deleted from IRIDA Next).
4. See example below

1.1. Example matches and summarized results

Matches between a query sample A and reference samples with different cases of metadata values

		Case 1: All Different	Case 2: All Same	Case 3: All Missing		Case 4: Some Missing, Some Different
Query	Reference	genomic_address_name	national_outbreak_code	genomic_address_name	national_outbreak_code	genomic_address_name
A	A	1.1	a	1.1	a
A	B	1.2	b	1.1	a
A	C	1.3	c	1.1	a
A	D	1.4	d	1.1	a
A	E	1.5	e	1.1	a

Expected metadata output written back to query sample A for different cases

Cases	Sample	fastmatch_top_samples	fastmatch_top_genomic_address	fastmatch_code_match
Case 1: All Different	A	A,B,C,D,E	1.1,1.2,1.3,1.4,1.5	a,b,c,d,e
Case 2: All Same	A	A,B,C,D,E	1.1	a
Case 3: All Missing	A	A,B,C,D,E
Case 4: Some Missing, Some Different	A	A,B,C,D,E	1.1,1.2	a,b

… be 0.

apetkau

This is great work Eric. Thanks so much 😄. I'm having Steven continue to work on this.

I've made a number of comments on this PR below.

apetkau · 2026-05-19T17:46:21Z

            def scheduled_argument = (params.output_type == "scheduled") ? "--scheduled" : ""
            def date_argument = (params.output_type == "scheduled") ? "--date_string ${date_prefix}": ""
-            def top_samples_threshold_argument = (params.output_type == "scheduled" && params.fastmatch_top_samples_threshold) ? "--top_samples_threshold ${params.fastmatch_top_samples_threshold}": ""
+            def top_samples_threshold_argument = (params.output_type == "scheduled" && !(params.fastmatch_top_samples_threshold == null)) ? "--top_samples_threshold ${params.fastmatch_top_samples_threshold}": ""


Small suggestion but why not change to params.fastmatch_top_samples_threshold != null instead?

sgsutcliffe · 2026-05-20T19:23:57Z

Replaced empty matches with "" rather than "NULL" f8ddbbb

apetkau

This looks perfect. Thanks so much @emarinier and @sgsutcliffe 😄

emarinier added 13 commits May 13, 2026 12:06

Handling missing genomic address names.

d3c65e5

More tests.

a6cafd7

Adding test data.

7832627

Better handling of types.

ab9bca0

Handling duplicates, all missing.

229d4c1

More testing.

67a4414

Forcing self-hit to always report first.

65e109c

Adding complicated test.

cd14789

Adding missing newline

fa2c922

No matches within either threshold. Allowing top samples parameter to…

ad0dca4

… be 0.

Testing for 0 top samples, 0 prefix, falsy evaluations.

617abf7

Proportional distances.

b0da2ed

Forcing float type in processed output threshold column.

f8ddbbb