Skip to content

caotianyu0427/JuncForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

primer_design

qPCR primer design pipeline for cryptic exon (CE) junction detection in mouse (default mm39). Given GenBank files that annotate a cryptic exon inside an intron, the pipeline designs primer pairs that span at least one CE boundary (exon↔CE junction) and uses pair-wise BLAST to eliminate off-target amplifications, so only primers that should only amplify the CE are kept.

IMPORTANT: This repository is CE-specific. The BLAST-based selection rules in this README are designed to detect/quantify cryptic exon inclusion (a CE-containing splice junction), which is not the same objective as standard qPCR targeting a normal mRNA.

TL;DR

  • For cryptic exon (CE) junction assays (this repo): require mRNA_amplicons == 0 to avoid WT transcript amplification.
  • For standard mRNA qPCR (not this repo’s default): you typically want Genome_amplicons == 0 and mRNA_amplicons >= 1.

1. What the pipeline actually does

GenBank (.gb)
   │
   ▼
(1) Locate the CE feature and its two junctions (left/right)
   │
   ▼
(2) Primer3 design (three valid qPCR geometries, merged)
     - Tm 57-63 C, GC 40-65 %, product 100-200 bp
     - `lr`: LEFT near the LEFT junction, RIGHT near the RIGHT junction (classic
       CE bridge when length allows).
     - `right_at_L`: LEFT upstream only; RIGHT near the LEFT junction (only RIGHT
       may cross the left boundary — still valid qPCR if junction_coverage >= 1).
     - `left_at_R`: LEFT near the RIGHT junction; RIGHT downstream only (only LEFT
       may cross the right boundary).
     - Each layout is a separate Primer3 run with SEQUENCE_PRIMER_PAIR_OK_REGION_LIST;
       SEQUENCE_OVERLAP_JUNCTION_LIST + PRIMER_MIN_*_PRIME_OVERLAP_OF_JUNCTION=1
       still apply.  Candidates are deduped by (F,R) sequence, capped at 20 pairs,
       sorted by Primer3 pair penalty (then layout preference: lr first).
   │
   ▼
(3) Junction coverage
     - Left/right junction indices ``j`` come from the CE feature (same as before).
     - For each ``j``, the relevant 2 bp on the template are ``j-1`` and ``j``
       (0-based). A junction counts as covered only if **either** primer's binding
       interval fully contains **both** bases (primer straddles the boundary).
     - Count 0 / 1 / 2 for the two junctions; hard filter still requires >= 1.
   │
   ▼
(4) Pair BLAST - check 1: Genome (mm39)
     - blastn-short each primer against mm39.
     - Count in-silico PCR amplicons (F and R on opposite strands, same chrom,
       F+ upstream of R-, length <= --max_amplicon_bp).
     - Expectation for a valid CE pair:
         * 0 amplicons if the pair truly spans a junction (the junction does
           not exist on genomic DNA, so no PCR product is formed).
         * 1 amplicon if both primers happen to lie fully inside the CE (no
           junction spanned; would be dropped in step (3) anyway).
     - DISQUALIFY if >= 2 amplicons (off-target gDNA amplification).
   │
   ▼
(5) Pair BLAST - check 2: curated RefSeq mRNA (mouse, NM_ only)
     - blastn-short each primer against the curated RefSeq mRNA DB.
     - Expectation: 0 amplicons, because a CE is not in any normal mRNA.
     - DISQUALIFY if >= 1 amplicon (primer would amplify a wild-type transcript).
   │
   ▼
(6) Hard filter = {junction_coverage >= 1} AND {genome_amplicons < 2}
                  AND {mrna_amplicons == 0}
     No soft scoring - Primer3's own ranking is already good enough once
     these three binary checks pass.
   │
   ▼
(7) Sort passing pairs by:
       (a) HardFilter_pass (True first)
       (b) Junction_coverage_count DESC (prefer pairs that span both junctions)
       (c) Primer3_rank ASC (fall back to Primer3's penalty order)
     Emit top N (default 5) pairs per sequence to Excel.

Why "0 genome amplicons is GOOD" is not a contradiction: a junction-spanning primer places its 3' end across a sequence boundary that exists only on the spliced CE-containing transcript; the corresponding bases are not contiguous on genomic DNA, so no in-silico amplicon should form - which is exactly what we want.

If you want standard mRNA qPCR (not CE detection)

This pipeline intentionally rejects any primer pair that can amplify a curated RefSeq mRNA (mrna_amplicons >= 1) because that would indicate amplification of wild-type transcripts — which is exactly what we want to avoid when detecting a cryptic exon junction.

If your goal is a normal mRNA qPCR assay instead, the selection logic should be reversed:

  • Genome (gDNA) pair BLAST: prefer 0 amplicons (typically by designing a primer that spans a canonical exon–exon junction, so gDNA cannot produce a product).
  • mRNA pair BLAST: require >= 1 amplicon (the assay must amplify the intended transcript in the curated mRNA DB).

In other words, for standard mRNA qPCR, a good pair is usually: Genome_amplicons == 0 AND mRNA_amplicons >= 1, not the CE rule mRNA_amplicons == 0.


2. Setup

2.1 Python env

Requires Python 3.9+ and BLAST+ (blastn, makeblastdb) on PATH.

cd primer_design
python3 -m venv .venv
source .venv/bin/activate
pip install -r primers_pipeline/requirements.txt

Install BLAST+ if it is missing:

# Ubuntu / Debian
sudo apt-get install ncbi-blast+
# macOS
brew install blast

2.2 Build the two local BLAST databases

The pipeline runs entirely offline and expects two local blast databases under ~/blastdb/ (paths are configurable, see CLI below):

~/blastdb/mm39/mm39.{nhr,nin,nsq,...}
~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse.{nhr,nin,nsq,...}

Genome DB (mm39)

Any standard way of obtaining the mouse genome FASTA works. A minimal recipe:

mkdir -p ~/blastdb/mm39 && cd ~/blastdb/mm39
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
gunzip mm39.fa.gz
makeblastdb -in mm39.fa -dbtype nucl -parse_seqids -out mm39 -title mm39

Curated mouse mRNA DB (NM_ only, no predicted transcripts)

mkdir -p ~/blastdb/refseq_mrna_mouse && cd ~/blastdb/refseq_mrna_mouse

# 1. Fetch the list of mouse mRNA chunks (wget does not expand HTTP wildcards
#    so we scrape the directory listing first).
curl -s https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/ \
    | grep -oE 'mouse\.[0-9]+\.rna\.fna\.gz' | sort -u > files.txt

while read -r f; do
    wget -c "https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/$f"
done < files.txt

# 2. Keep only curated transcripts (NM_*). Drop predicted/non-coding (XM_/XR_).
zcat mouse.*.rna.fna.gz \
    | awk 'BEGIN{p=0} /^>/{p = ($0 ~ /^>NM_/)} {if(p) print}' \
    > refseq_mrna_mouse.fna

# 3. Build the BLAST database.
makeblastdb -in refseq_mrna_mouse.fna -dbtype nucl -parse_seqids \
    -out refseq_mrna_mouse -title refseq_mrna_mouse

3. Run

Minimum invocation - design primers for every .gb file under my_inputs/:

source .venv/bin/activate
python primers_pipeline/run_pipeline.py --input my_inputs/

With non-default paths:

python primers_pipeline/run_pipeline.py \
  --input my_inputs/ \
  --outdir primers_pipeline/output \
  --genome_db ~/blastdb/mm39/mm39 \
  --mrna_db   ~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse \
  --num_pairs 5 \
  --max_amplicon_bp 1000

CLI reference

flag default meaning
--input (required) One or more .gb/.gbk/.genbank files or directories (scanned recursively).
--outdir primers_pipeline/output Destination for primers_output.xlsx + intermediate_results.json + failure CSV.
--blast_mode local local (recommended) or none (skip BLAST; every pair will be BLAST pending).
--genome_db ~/blastdb/mm39/mm39 BLAST prefix for the genome DB.
--mrna_db ~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse BLAST prefix for the curated mouse mRNA DB.
--local_blast_db (none) Deprecated alias for --genome_db.
--blastn_bin blastn Override blastn binary path if not on PATH.
--max_amplicon_bp 1000 Upper bound on in-silico PCR amplicon length when counting pair BLAST amplicons.
--num_pairs 5 Primer pairs emitted per sequence in the Excel.
--feature_query cryptic exon Case-insensitive substring used to locate the CE feature in the GenBank file.
--resume Reuse intermediate_results.json for records that already completed.
--retry_pending With --resume, retry pairs that had failed BLAST last time.
--no_blast Alias for --blast_mode none.

4. Output

Inside --outdir:

  • primers_output.xlsx - one sheet per GenBank record + a Summary sheet. Selected ("green") row per sheet: first pair whose Specificity_flag == ok, otherwise top-ranked pair.
  • intermediate_results.json - raw per-pair data including cached BLAST hits (for resume / debugging).
  • failed_records.csv - records that could not be processed (e.g. no CE feature found).

Key Excel columns

column meaning
Rank Row rank after sorting (pass > junction coverage desc > Primer3 rank asc).
Junction_coverage_count 0 / 1 / 2 - how many junctions have both j-1 and j inside either primer.
Primer3_rank 0-based rank in the merged candidate list (penalty-sorted; see step 2).
Primer3_layout Which geometry produced the pair: lr, right_at_L, or left_at_R.
Genome_amplicons Pair-BLAST amplicons vs --genome_db. Must be 0 or 1 (2+ => off-target gDNA).
mRNA_amplicons Pair-BLAST amplicons vs --mrna_db. Must be 0 (1+ => WT mRNA would be amplified).
HardFilter_pass True iff jc>=1 and genome<2 and mrna==0.
HardFilter_fail_reasons Comma-separated reason codes when HardFilter_pass is False.
Specificity_flag ok, off-target risk, or BLAST pending.

5. Files

primer_design/
├── README.md                         <- this file
├── primers_pipeline/
│   ├── run_pipeline.py               <- the single pipeline entry point
│   └── requirements.txt              <- primer3-py, biopython, pandas, openpyxl
└── .gitignore                        <- ignores output*/ and venvs

About

CE-specific qPCR primer design pipeline: Primer3 + pair-wise BLAST against genome & curated RefSeq mRNA to select junction-spanning assays.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages