CodonPrime-toolkit

License

The CodonPrime-toolkit code is distributed under the PolyForm-Strict license. Model weights for CodonPrime models are distributed under the FAIR license.

Both licenses may be found in LICENSE.md.

Dependencies

Optional

NCBI Blast

System requirements

Linux operating system, including Windows Subsystem for Linux.
Testing has been conducted on Ubuntu 24.04 LTS and Ubuntu 25.04
CodonPrime requires Python 3.10 to maintain compatibility with dependencies.

And the Lord spake, saying, "First shalt thou download the repository. Then shalt thou use Python three point ten, no more, no less. Three point ten shall be the version thou shalt use, and the number of the version shall be three point ten. Thre point eleven shalt thou not count, neither count thou three point nine, excepting that thou then proceed to three point ten. Three point twelve is right out."

Installation

If your system version of Python is not 3.10, first create and activate some form of virtual environment in which the Python version is 3.10

git clone https://github.com/idekerlab/codonprime
cd CodonPrime
pip install -r requirements_dev.txt
make install

The above should take no more than a few minutes. After this point, you can run CodonPrime for single amino-acid variants only, since reference sequence retrieval will rely on real-time sequence fetching (see below).

For tabular/batch input, it is required to set up local sequence retrieval. This requires parts of the NCBI-BLAST+ toolkit, for which you can find precompiled binaries at the NCBI FTP site

These must be in your $PATH for proper execution.

Then, execute the following :

cd data
bash setup_data.sh
python package_gff.py
cd ..

The above code will download the current NCBI human genome build and annotations, and build a matching BLAST database. It will also package relevant annotations in a lightweight fashion for fast access. The whole process should not take more than a few minutes.

Usage

Basic usage is :

codonprime-cli [-h] {command} [options]

Where command is one of direct, maf, tabular, clinvar, cosmic, and imports. The imports command is for utility/debugging purposes only.

Command	Modes	Number of variants
direct	Either (by `--blastdb`)	1
maf	Local	many
tabular	Local	many
clinvar	Remote	1
cosmic	Remote	1

Most options are common to all commands, and will be covered later.

MAF

codonprime-cli maf maf_file_name [options]

maf_file_name: MAF file name for batch mutation input. MAF file format is specified at the following URL: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/. If you are generating only amino-acid variants found in tumors and/or patients, this input format is almost certainly the most straightforward to use. NB that not all platforms hosting MAF files use all these columns, and some include their own. The columns required here are 'Hugo_Symbol', 'Transcript_ID', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', and 'HGVSp', which should be present for almost any genuine MAF output. This command supports the use of Ensembl transcript identifiers.

This mode has the following unique options:

--use_xscript_for_reference

When creating short names for amino-acid variant references, use the Transcript_ID (e.g. NM_1234567) rather than the Hugo_Symbol. Relevant in cases where non-reference transcript isoforms are used.

Direct

codonprime-cli direct [options]

Specify by target gene, transcript or protein, plus protein mutation site and installed amino acid. This command has the following unique arguments:

`--identifier IDENTIFIER, -i IDENTIFIER`
	Identifier of target gene or protein. Accepted identifiers are GeneID (integer), HUGO Gene Symbol, Protein accession (e.g. NP_), RNA accession (e.g. NM_). Gene-level identifiers will be mapped to MANE transcript/protein isoforms.
`--site SITE, -s SITE`
	integer amino-acid residue to be mutated.
`--installed INSTALLED, -n INSTALLED`
	single-letter amino acid symbol to be installed at the site.
`--nt_original NT_ORIGINAL`
	original nucleotide sequence spanning the intended substitution.
`--nt_installed NT_INSTALLED`
	substituted nucleotide sequence spanning the intended substitution The above two options must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the arguments are provided but no 'Strict' designs are detected.
`--reference REFERENCE`
	AA mutation name, intended to be human-readable. Procedurally generated if not supplied by this argument.

Tabular

codonprime-cli tabular table_file_name [options]

table_file_name: File name for tabular mutation input. The file must be tab-separated text, with case-insensitive column headers on the first row.

Column headers MUST include: - site : integer amino-acid residue to be mutated - installed : single-letter amino acid symbol to be installed at the site. * = terminator -identifier : acceptable identifiers (see "--identifier" in the "Direct" section.). Whichever form of identifier is chosen, it must be consistent throughout the column.

Column headers MAY include: - reference : reference string for amino acid mutation in pegRNA data table (for user convencence). - nt_original : original nucleotide sequence spanning the intended substitution. - nt_installed : substituted nucleotide sequence spanning the intended substitution. The above two headers must be both be included or omitted. They are used for the detection of Flex- and Strict-codon pegRNAs. The script WILL NOT validate nt_original if provided in error, in the future, it will print a warning if the columns are provided but no 'Strict' designs are detected.

ClinVar

codonprime-cli clinvar clinvar_identifier [options]

clinvar_identifer: Clinvar identifier, either long identifier (e.g. NM_000410.4(HFE):c.845G>A (p.Cys282Tyr)), Variation ID, (e.g. 9), or accession (e.g. VCV000000009.137)

COSMIC

codonprime-cli cosmic cosmic_accession [options]

cosmic_accession: COSMIC accession (e.g. COSV55497419)

Common options

`-h, --help`	show help message and exit
`--blastdb BLASTDB`
	File path accessed by blastdbcmd, used for batch input. Providing "remote" will use remote fetchers (generally slower). Some options may require or disallow the use of `remote`. The default is the data directory.
`--output OUTPUT`
	Directory containing output; default `./output`
`--nicking`	Additionally, design nicking guides for edit (PE3) with DeepSpCas9 prediction. This is barely supported, and not advised.
`--processes PROCESSES, --proc PROCESSES`
	Number of simultaneous processes to run. Default value is machine core count-1.
`--n_summary N_SUMMARY`
	Number of guides to be included in the summary file for each amino-acid variant (default 10). If 0, all guides will be included.i If negative, summary file will not be generated.
`--pre_triage_level PRE_TRIAGE_LEVEL`
	To save computation time, candidate pegRNA designs that are highly unlikely to score well are triaged before evaluation by the transformer model. Higher values will discard more guides. At the default value, discarded guides have a 2.5% chance of showing an actual editing efficiency greater than 5%. Lower values of this parameter will result in longer computation times. Negative values but not 0 of the parameter will remove this pre-triage step entirely, but whatever time savings result from removing the step is almost certainly going to be offset by the increased number of predicted efficiencies.
`--run_spliceai`	Run spliceAI (Jaganathan et al., Cell 2019) on pegRNA designs. Two additional columns will be added to the output tables: spliceai_deltascore : maximum delta score (see ref.) for this pegRNA. The spliceAI authors suggest scores ≥0.5 as "high-confidence", but their own data suggests that even such scores often correspond to low penetrance and validation rates; the cutoff used in Zhao et al is 0.8. splice_summary : The nature of the splice site change (donor/acceptor, gain/loss) associated with the maximum deltascore.
`--check_offtargets`
	Run DeepPrime-Off (Yu et al., Cell 2023) on pegRNA designs. DeepPrime-Off requires cas-offtarget (redistributed with this github repo) and cas-offtarget inturn requires an OpenCL-enabled device. This should encompass most Linux setups with a graphics card, but one notable exception is the Windows Subsystem for Linux (WSL). You should not expect this option to work without OpenCL bindings. The following columns will be added to the full-length output table, with values referring to the single most likely off-target site: DPO: Location : chromosome containing site DPO: Position : chromosomal coordinates of site DPO: Strand : strand (+/-) DPO: MM_num : number of mismatches at off-target site Additionally, this will create the file offtarget_predictions_full.csv.gz in your chosen output directory.

Advanced

These arguments change parameters that are likely best left alone by the typical user. Exceptions include users working with non-human organisms or who wish to use alternative models for pegRNA efficiency prediction.

`--data DATA`	File path for data. It should probably have everything generated by `cd data && bash setup_data.sh && python package_gff.py`
`--chunksize CHUNKSIZE`
	Frame chunking size ( by # of contexts). For debugging/optimization purposes.
`--model_root_dir MODEL_ROOT_DIR`
	Directory where trained models in run#/ folders exist.
`--debug`	Set debug mode for logging.
`--prevent_secret_parallelism`
	For degugging purposes, prevent builtin paralellism from underlying modules such as numpy and pytorch.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
codonprime		codonprime
data		data
docker		docker
docs		docs
src		src
tests		tests
trained_models		trained_models
.editorconfig		.editorconfig
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
cas-offinder		cas-offinder
cas_offinder_input.txt		cas_offinder_input.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodonPrime-toolkit

License

Dependencies

Optional

System requirements

Installation

Usage

MAF

Direct

Tabular

ClinVar

COSMIC

Common options

Advanced

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

`--use_xscript_for_reference`
	When creating short names for amino-acid variant references, use the Transcript_ID (e.g. NM_1234567) rather than the Hugo_Symbol. Relevant in cases where non-reference transcript isoforms are used.

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CodonPrime-toolkit

License

Dependencies

Optional

System requirements

Installation

Usage

MAF

Direct

Tabular

ClinVar

COSMIC

Common options

Advanced

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages