Skip to content

Add feature/species specific params#42

Open
sgsutcliffe wants to merge 13 commits into
devfrom
add-feature/species-specific-params
Open

Add feature/species specific params#42
sgsutcliffe wants to merge 13 commits into
devfrom
add-feature/species-specific-params

Conversation

@sgsutcliffe

Copy link
Copy Markdown
Collaborator

Update starAMR parameters

Add parameters for starAMR module

Parameters used in starAMR now modifiable in the config for the module.

        minimum_N50_value                    = 10000
        minimum_contig_length                = 300
        unacceptable_number_contigs          = 1000
        pid_threshold                        = 98
        percent_length_overlap_plasmidfinder = 60       
        no_exclude_genes                     = false
        exclude_negatives                    = false
        exclude_resistance_phenotypes        = false

Add species specific starAMR parameters

Parameters that use the species classification to assign the values (can be turned off through --skip_species_classification . The defaults are:

  • --genome_size_lower_bound : 4000000
  • --genome_size_upper_bound : 6000000
  • --percent_length_overlap_resfinder : 60
  • --percent_length_overlap_pointfinder : 95

The species with specific settings are:

Salmonella, Shigella, or Escherchia coli

  • Lower bound for our genome size for quality metrics = 4,000,000
  • Upper bound for our genome size for quality metrics= 6,700,000 
  • Percent length overlap of BLAST hit for ResFinder Database = 52
  • PointFinder Database = Salmonella or PointFinder Database = E.coli

Campylobacter

  • Lower bound for our genome size for quality metrics= 1,250,000
  • Upper bound for our genome size for quality metrics= 2,500,000
  • Percent length overlap of BLAST hit for ResFinder Database = 52 
  • Percent length overlap ofBLAST hit for PointFinder Database = 58
  • Point Finder Database = Campylobacter

Comment thread tests/main.nf.test
Comment on lines +71 to +75
// Ecoli with wrong species in samplesheet
assert path("$outputDir/staramr/B2_results/B2_settings.staramr.txt").exists()
def ecoli_settings = new File("$outputDir/staramr/B2_results/B2_settings.staramr.txt")
def ecoli_cmd = ecoli_settings.readLines().get(0)
assert ecoli_cmd == "command_line = /usr/local/bin/staramr search --pointfinder-organism escherichia_coli --minimum-contig-length 300 --genome-size-lower-bound 4000000 --genome-size-upper-bound 6000000 --minimum-N50-value 10000 --minimum-contig-length 300 --unacceptable-number-contigs 1000 --pid-threshold 98 --percent-length-overlap-plasmidfinder 60 --percent-length-overlap-resfinder 60 --percent-length-overlap-pointfinder 95 --nprocs 1 -o B2_results B2.fasta"
assert ecoli_cmd == "command_line = /usr/local/bin/staramr search --pointfinder-organism escherichia_coli --minimum-contig-length 300 --minimum-N50-value 10000 --minimum-contig-length 300 --unacceptable-number-contigs 1000 --pid-threshold 98 --percent-length-overlap-plasmidfinder 60 --genome-size-lower-bound 4000000 --genome-size-upper-bound 6700000 --percent-length-overlap-resfinder 52 --percent-length-overlap-pointfinder 95 --nprocs 1 -o B2_results B2.fasta"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is testing and how it works?

The sample sheet is:

sample,sample_name,contigs,species
GCA_000008105,A 1#,https://github.com/phac-nml/staramrnf/raw/dev/tests/genomes/salmonella/GCA_000008105.1_ASM810v1_genomic.fna.gz,Salmonella
GCA_000947975,B2,https://github.com/phac-nml/staramrnf/raw/dev/tests/genomes/ecoli/GCA_000947975.1_ASM94797v1_genomic.fna.gz,Escherichia coli
GCF_000196035,B2,https://github.com/phac-nml/staramrnf/raw/dev/tests/genomes/listeria/GCF_000196035.1_ASM19603v1_genomic.fna,Listeria monocytogenes
GCF_000196035_B,,https://github.com/phac-nml/staramrnf/raw/dev/tests/genomes/listeria/GCF_000196035.1_ASM19603v1_genomic.fna,Listeria monocytogenes

Where there are 2 B2 sample_name entries, one is E coli (GCA_000947975) and the other is Listeria (GCF_000196035).

Is the test checking the Listeria one or the E. coli one? I don't really understand which one the B2_results is supposed to be corresponding to. If it corresponds to the E. coli one, why is E. coli the wrong species in the sample sheet? It looks like there's Escherichia coli in the sample sheet for GCA_000947975,B2. Is it not actually E. coli?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change I made is simply because the new feature changes the output, and so I adjusted it for the test.

As for the test itself, I believe this original test was slightly poorly planned (my first pipeline) test that was a kind of all-in-one where it was confirming the sample renaming and the outputs of these for a full test.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the last sample is not tested. Maybe it is a good time to fix things up.

Comment thread nextflow.config
validationFailUnrecognisedParams = false
validationLenientMode = false
validationSchemaIgnoreParams = 'genomes,igenomes_base'
validationSchemaIgnoreParams = 'genomes,igenomes_base,genus_list,default_staramr,salmonella,escherichia,shigella,campylobacter'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the warnings or errors when not ignored?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this at the top of every run. Since they are more "settings" than "parameters" I decided to hide them:
e.g,

 N E X T F L O W   ~  version 24.10.6

Launching `main.nf` [amazing_engelbart] DSL2 - revision: 38dd4bef6a

WARN: The following invalid input values have been detected:

* --genus_list: [salmonella, campylobacter, escherichia, shigella]
* --default_staramr: [genome_size_lower_bound:4000000, genome_size_upper_bound:6000000, percent_length_overlap_resfinder:60, percent_length_overlap_pointfinder:95]
* --salmonella: [genome_size_lower_bound:4000000, genome_size_upper_bound:6700000, percent_length_overlap_resfinder:52, percent_length_overlap_pointfinder:95]
* --escherichia: [genome_size_lower_bound:4000000, genome_size_upper_bound:6700000, percent_length_overlap_resfinder:52, percent_length_overlap_pointfinder:95]
* --shigella: [genome_size_lower_bound:4000000, genome_size_upper_bound:6700000, percent_length_overlap_resfinder:52, percent_length_overlap_pointfinder:95]
* --campylobacter: [genome_size_lower_bound:1250000, genome_size_upper_bound:2500000, percent_length_overlap_resfinder:52, percent_length_overlap_pointfinder:58]


Comment thread README.md
Comment on lines +175 to +180
### Genus specific settings

They are used when the `species` column has any of the following genus selected:
```
genus_list = ['salmonella', 'campylobacter', 'escherichia', 'shigella']
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### Genus-specific settings

Genus-specific settings are used when the `species` column of the sample sheet contains any of the following genera:

Comment thread README.md Outdated
Comment on lines +193 to +195



Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly too much whitespace

Comment thread README.md Outdated
- Upper bound for our genome size for quality metrics= 2,500,000
- Percent length overlap of BLAST hit for ResFinder Database = 52 
- Percent length overlap ofBLAST hit for PointFinder Database = 58
- Point Finder Database = Campylobacter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PointFinder (no space)

Comment thread README.md Outdated
- Lower bound for our genome size for quality metrics= 1,250,000
- Upper bound for our genome size for quality metrics= 2,500,000
- Percent length overlap of BLAST hit for ResFinder Database = 52 
- Percent length overlap ofBLAST hit for PointFinder Database = 58

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of BLAST (space)

Comment thread README.md Outdated
Comment on lines +183 to +189
- Lower bound for our genome size for quality metrics = 4,000,000
- Upper bound for our genome size for quality metrics= 6,700,000 
- Percent length overlap of BLAST hit for ResFinder Database = 52
- PointFinder Database = Salmonella or PointFinder Database = E.coli
#### Campylobacter
- Lower bound for our genome size for quality metrics= 1,250,000
- Upper bound for our genome size for quality metrics= 2,500,000

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some = have space before, others do not. Recommend making consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants