The CodonPrime-toolkit code is distributed under the PolyForm-Strict license. Model weights for CodonPrime models are distributed under the FAIR license.
Both licenses may be found in LICENSE.md.
- Python 3.10
- requests
- biopython
- pandas
- numpy
- swalign
- lxml
- pyyaml
- genet
- viennaRNA
- scipy
- prettytable
- optuna
- tqdm
- torch
- scikit-learn
- matplotlib
- Levenshtein
- NCBI Blast
Linux operating system, including Windows Subsystem for Linux.
Testing has been conducted on Ubuntu 24.04 LTS and Ubuntu 25.04
CodonPrime requires Python 3.10 to maintain compatibility with dependencies.
And the Lord spake, saying, "First shalt thou download the repository. Then shalt thou use Python three point ten, no more, no less. Three point ten shall be the version thou shalt use, and the number of the version shall be three point ten. Thre point eleven shalt thou not count, neither count thou three point nine, excepting that thou then proceed to three point ten. Three point twelve is right out."
If your system version of Python is not 3.10, first create and activate some form of virtual environment in which the Python version is 3.10
git clone https://github.com/idekerlab/codonprime cd CodonPrime pip install -r requirements_dev.txt make install
The above should take no more than a few minutes. After this point, you can run CodonPrime for single amino-acid variants only, since reference sequence retrieval will rely on real-time sequence fetching (see below).
For tabular/batch input, it is required to set up local sequence retrieval. This requires parts of the NCBI-BLAST+ toolkit, for which you can find precompiled binaries at the NCBI FTP site
These must be in your $PATH for proper execution.
Then, execute the following :
cd data bash setup_data.sh python package_gff.py cd ..
The above code will download the current NCBI human genome build and annotations, and build a matching BLAST database. It will also package relevant annotations in a lightweight fashion for fast access. The whole process should not take more than a few minutes.
Basic usage is :
codonprime-cli [-h] {command} [options]
Where command is one of direct, maf, tabular, clinvar, cosmic, and imports.
The imports command is for utility/debugging purposes only.
| Command | Modes | Number of variants |
|---|---|---|
| direct | Either (by --blastdb) |
1 |
| maf | Local | many |
| tabular | Local | many |
| clinvar | Remote | 1 |
| cosmic | Remote | 1 |
Most options are common to all commands, and will be covered later.
codonprime-cli maf maf_file_name [options]
- maf_file_name
- MAF file name for batch mutation input. MAF file format is specified at the following URL: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/. If you are generating only amino-acid variants found in tumors and/or patients, this input format is almost certainly the most straightforward to use. NB that not all platforms hosting MAF files use all these columns, and some include their own. The columns required here are 'Hugo_Symbol', 'Transcript_ID', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', and 'HGVSp', which should be present for almost any genuine MAF output. This command supports the use of Ensembl transcript identifiers.
This mode has the following unique options:
--use_xscript_for_reference When creating short names for amino-acid variant references, use the Transcript_ID (e.g. NM_1234567) rather than the Hugo_Symbol. Relevant in cases where non-reference transcript isoforms are used.
codonprime-cli direct [options]
Specify by target gene, transcript or protein, plus protein mutation site and installed amino acid. This command has the following unique arguments:
| --identifier IDENTIFIER, -i IDENTIFIER | |
| Identifier of target gene or protein. Accepted identifiers are GeneID (integer), HUGO Gene Symbol, Protein accession (e.g. NP_), RNA accession (e.g. NM_). Gene-level identifiers will be mapped to MANE transcript/protein isoforms. | |
| --site SITE, -s SITE | |
| integer amino-acid residue to be mutated. | |
| --installed INSTALLED, -n INSTALLED | |
| single-letter amino acid symbol to be installed at the site. | |
| --nt_original NT_ORIGINAL | |
| original nucleotide sequence spanning the intended substitution. | |
| --nt_installed NT_INSTALLED | |
| substituted nucleotide sequence spanning the intended substitution The above two options must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the arguments are provided but no 'Strict' designs are detected. | |
| --reference REFERENCE | |
| AA mutation name, intended to be human-readable. Procedurally generated if not supplied by this argument. | |
codonprime-cli tabular table_file_name [options]
- table_file_name
- File name for tabular mutation input. The file must be tab-separated text, with case-insensitive column headers on the first row.
Column headers MUST include: - site : integer amino-acid residue to be mutated - installed : single-letter amino acid symbol to be installed at the site. * = terminator -identifier : acceptable identifiers (see "--identifier" in the "Direct" section.). Whichever form of identifier is chosen, it must be consistent throughout the column.
Column headers MAY include: - reference : reference string for amino acid mutation in pegRNA data table (for user convencence). - nt_original : original nucleotide sequence spanning the intended substitution. - nt_installed : substituted nucleotide sequence spanning the intended substitution. The above two headers must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the columns are provided but no 'Strict' designs are detected.
codonprime-cli clinvar clinvar_identifier [options]
- clinvar_identifer
- Clinvar identifier, either long identifier (e.g. NM_000410.4(HFE):c.845G>A (p.Cys282Tyr)), Variation ID, (e.g. 9), or accession (e.g. VCV000000009.137)
codonprime-cli cosmic cosmic_accession [options]
- cosmic_accession
- COSMIC accession (e.g. COSV55497419)
| -h, --help | show help message and exit |
| --blastdb BLASTDB | |
File path accessed by blastdbcmd, used for batch
input. Providing "remote" will use remote fetchers
(generally slower). Some options may require or disallow
the use of remote. The default is the data directory. | |
| --output OUTPUT | |
Directory containing output; default ./output | |
| --nicking | Additionally, design nicking guides for edit (PE3) with DeepSpCas9 prediction. This is barely supported, and not advised. |
| --processes PROCESSES, --proc PROCESSES | |
| Number of simultaneous processes to run. Default value is machine core count-1. | |
| --n_summary N_SUMMARY | |
| Number of guides to be included in the summary file for each amino-acid variant (default 10). If 0, all guides will be included.i If negative, summary file will not be generated. | |
| --pre_triage_level PRE_TRIAGE_LEVEL | |
| To save computation time, candidate pegRNA designs that are highly unlikely to score well are triaged before evaluation by the transformer model. Higher values will discard more guides. At the default value, discarded guides have a 2.5% chance of showing an actual editing efficiency greater than 5%. Lower values of this parameter will result in longer computation times. Negative values but not 0 of the parameter will remove this pre-triage step entirely, but whatever time savings result from removing the step is almost certainly going to be offset by the increased number of predicted efficiencies. | |
| --run_spliceai | Run spliceAI (Jaganathan et al., Cell 2019) on pegRNA designs. Two additional columns will be added to the output tables: spliceai_deltascore : maximum delta score (see ref.) for this pegRNA. The spliceAI authors suggest scores ≥0.5 as "high-confidence", but their own data suggests that even such scores often correspond to low penetrance and validation rates; the cutoff used in Zhao et al is 0.8. splice_summary : The nature of the splice site change (donor/acceptor, gain/loss) associated with the maximum deltascore. |
| --check_offtargets | |
Run DeepPrime-Off (Yu et al., Cell 2023) on pegRNA designs. DeepPrime-Off requires cas-offtarget (redistributed with this github repo) and cas-offtarget inturn requires an OpenCL-enabled device. This should encompass most Linux setups with a graphics card, but one notable exception is the Windows Subsystem for Linux (WSL). You should not expect this option to work without OpenCL bindings. The following columns will be added to the full-length output table, with values referring to the single most likely off-target site: DPO: Location : chromosome containing site DPO: Position : chromosomal coordinates of site DPO: Strand : strand (+/-) DPO: MM_num : number of mismatches at off-target site Additionally, this will create the file offtarget_predictions_full.csv.gz in your chosen output directory. | |
These arguments change parameters that are likely best left alone by the typical user. Exceptions include users working with non-human organisms or who wish to use alternative models for pegRNA efficiency prediction.
| --data DATA | File path for data. It should probably have everything
generated by cd data && bash setup_data.sh && python package_gff.py |
| --chunksize CHUNKSIZE | |
| Frame chunking size ( by # of contexts). For debugging/optimization purposes. | |
| --model_root_dir MODEL_ROOT_DIR | |
| Directory where trained models in run#/ folders exist. | |
| --debug | Set debug mode for logging. |
| --prevent_secret_parallelism | |
| For degugging purposes, prevent builtin paralellism from underlying modules such as numpy and pytorch. | |