Skip to content

idekerlab/codonprime-toolkit

Repository files navigation

CodonPrime-toolkit

License

The CodonPrime-toolkit code is distributed under the PolyForm-Strict license. Model weights for CodonPrime models are distributed under the FAIR license.

Both licenses may be found in LICENSE.md.

Dependencies

Optional

System requirements

  • Linux operating system, including Windows Subsystem for Linux.

  • Testing has been conducted on Ubuntu 24.04 LTS and Ubuntu 25.04

  • CodonPrime requires Python 3.10 to maintain compatibility with dependencies.

    And the Lord spake, saying, "First shalt thou download the repository. Then shalt thou use Python three point ten, no more, no less. Three point ten shall be the version thou shalt use, and the number of the version shall be three point ten. Thre point eleven shalt thou not count, neither count thou three point nine, excepting that thou then proceed to three point ten. Three point twelve is right out."

Installation

If your system version of Python is not 3.10, first create and activate some form of virtual environment in which the Python version is 3.10

git clone https://github.com/idekerlab/codonprime
cd CodonPrime
pip install -r requirements_dev.txt
make install

The above should take no more than a few minutes. After this point, you can run CodonPrime for single amino-acid variants only, since reference sequence retrieval will rely on real-time sequence fetching (see below).

For tabular/batch input, it is required to set up local sequence retrieval. This requires parts of the NCBI-BLAST+ toolkit, for which you can find precompiled binaries at the NCBI FTP site

These must be in your $PATH for proper execution.

Then, execute the following :

cd data
bash setup_data.sh
python package_gff.py
cd ..

The above code will download the current NCBI human genome build and annotations, and build a matching BLAST database. It will also package relevant annotations in a lightweight fashion for fast access. The whole process should not take more than a few minutes.

Usage

Basic usage is :

codonprime-cli [-h] {command} [options]

Where command is one of direct, maf, tabular, clinvar, cosmic, and imports. The imports command is for utility/debugging purposes only.

Command Modes Number of variants
direct Either (by --blastdb) 1
maf Local many
tabular Local many
clinvar Remote 1
cosmic Remote 1

Most options are common to all commands, and will be covered later.

MAF

codonprime-cli maf maf_file_name [options]
maf_file_name
MAF file name for batch mutation input. MAF file format is specified at the following URL: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/. If you are generating only amino-acid variants found in tumors and/or patients, this input format is almost certainly the most straightforward to use. NB that not all platforms hosting MAF files use all these columns, and some include their own. The columns required here are 'Hugo_Symbol', 'Transcript_ID', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', and 'HGVSp', which should be present for almost any genuine MAF output. This command supports the use of Ensembl transcript identifiers.

This mode has the following unique options:

--use_xscript_for_reference
 When creating short names for amino-acid variant references, use the Transcript_ID (e.g. NM_1234567) rather than the Hugo_Symbol. Relevant in cases where non-reference transcript isoforms are used.

Direct

codonprime-cli direct [options]

Specify by target gene, transcript or protein, plus protein mutation site and installed amino acid. This command has the following unique arguments:

--identifier IDENTIFIER, -i IDENTIFIER
 Identifier of target gene or protein. Accepted identifiers are GeneID (integer), HUGO Gene Symbol, Protein accession (e.g. NP_), RNA accession (e.g. NM_). Gene-level identifiers will be mapped to MANE transcript/protein isoforms.
--site SITE, -s SITE
 integer amino-acid residue to be mutated.
--installed INSTALLED, -n INSTALLED
 single-letter amino acid symbol to be installed at the site.
--nt_original NT_ORIGINAL
 original nucleotide sequence spanning the intended substitution.
--nt_installed NT_INSTALLED
 substituted nucleotide sequence spanning the intended substitution The above two options must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the arguments are provided but no 'Strict' designs are detected.
--reference REFERENCE
 AA mutation name, intended to be human-readable. Procedurally generated if not supplied by this argument.

Tabular

codonprime-cli tabular table_file_name [options]
table_file_name
File name for tabular mutation input. The file must be tab-separated text, with case-insensitive column headers on the first row.

Column headers MUST include: - site : integer amino-acid residue to be mutated - installed : single-letter amino acid symbol to be installed at the site. * = terminator -identifier : acceptable identifiers (see "--identifier" in the "Direct" section.). Whichever form of identifier is chosen, it must be consistent throughout the column.

Column headers MAY include: - reference : reference string for amino acid mutation in pegRNA data table (for user convencence). - nt_original : original nucleotide sequence spanning the intended substitution. - nt_installed : substituted nucleotide sequence spanning the intended substitution. The above two headers must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the columns are provided but no 'Strict' designs are detected.

ClinVar

codonprime-cli clinvar clinvar_identifier [options]
clinvar_identifer
Clinvar identifier, either long identifier (e.g. NM_000410.4(HFE):c.845G>A (p.Cys282Tyr)), Variation ID, (e.g. 9), or accession (e.g. VCV000000009.137)

COSMIC

codonprime-cli cosmic cosmic_accession [options]
cosmic_accession
COSMIC accession (e.g. COSV55497419)

Common options

-h, --help show help message and exit
--blastdb BLASTDB
 File path accessed by blastdbcmd, used for batch input. Providing "remote" will use remote fetchers (generally slower). Some options may require or disallow the use of remote. The default is the data directory.
--output OUTPUT
 Directory containing output; default ./output
--nicking Additionally, design nicking guides for edit (PE3) with DeepSpCas9 prediction. This is barely supported, and not advised.
--processes PROCESSES, --proc PROCESSES
 Number of simultaneous processes to run. Default value is machine core count-1.
--n_summary N_SUMMARY
 Number of guides to be included in the summary file for each amino-acid variant (default 10). If 0, all guides will be included.i If negative, summary file will not be generated.
--pre_triage_level PRE_TRIAGE_LEVEL
 To save computation time, candidate pegRNA designs that are highly unlikely to score well are triaged before evaluation by the transformer model. Higher values will discard more guides. At the default value, discarded guides have a 2.5% chance of showing an actual editing efficiency greater than 5%. Lower values of this parameter will result in longer computation times. Negative values but not 0 of the parameter will remove this pre-triage step entirely, but whatever time savings result from removing the step is almost certainly going to be offset by the increased number of predicted efficiencies.
--run_spliceai

Run spliceAI (Jaganathan et al., Cell 2019) on pegRNA designs. Two additional columns will be added to the output tables:

spliceai_deltascore : maximum delta score (see ref.) for this pegRNA. The spliceAI authors suggest scores ≥0.5 as "high-confidence", but their own data suggests that even such scores often correspond to low penetrance and validation rates; the cutoff used in Zhao et al is 0.8.

splice_summary : The nature of the splice site change (donor/acceptor, gain/loss) associated with the maximum deltascore.

--check_offtargets
 

Run DeepPrime-Off (Yu et al., Cell 2023) on pegRNA designs. DeepPrime-Off requires cas-offtarget (redistributed with this github repo) and cas-offtarget inturn requires an OpenCL-enabled device. This should encompass most Linux setups with a graphics card, but one notable exception is the Windows Subsystem for Linux (WSL). You should not expect this option to work without OpenCL bindings.

The following columns will be added to the full-length output table, with values referring to the single most likely off-target site:

DPO: Location : chromosome containing site

DPO: Position : chromosomal coordinates of site

DPO: Strand : strand (+/-)

DPO: MM_num : number of mismatches at off-target site

Additionally, this will create the file offtarget_predictions_full.csv.gz in your chosen output directory.

Advanced

These arguments change parameters that are likely best left alone by the typical user. Exceptions include users working with non-human organisms or who wish to use alternative models for pegRNA efficiency prediction.

--data DATA File path for data. It should probably have everything generated by cd data && bash setup_data.sh && python package_gff.py
--chunksize CHUNKSIZE
 Frame chunking size ( by # of contexts). For debugging/optimization purposes.
--model_root_dir MODEL_ROOT_DIR
 Directory where trained models in run#/ folders exist.
--debug Set debug mode for logging.
--prevent_secret_parallelism
 For degugging purposes, prevent builtin paralellism from underlying modules such as numpy and pytorch.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors