qPCR primer design pipeline for cryptic exon (CE) junction detection in
mouse (default mm39). Given GenBank files that annotate a cryptic exon inside
an intron, the pipeline designs primer pairs that span at least one CE
boundary (exon↔CE junction) and uses pair-wise BLAST to eliminate off-target
amplifications, so only primers that should only amplify the CE are kept.
IMPORTANT: This repository is CE-specific. The BLAST-based selection rules in this README are designed to detect/quantify cryptic exon inclusion (a CE-containing splice junction), which is not the same objective as standard qPCR targeting a normal mRNA.
TL;DR
- For cryptic exon (CE) junction assays (this repo): require
mRNA_amplicons == 0to avoid WT transcript amplification. - For standard mRNA qPCR (not this repo’s default): you typically want
Genome_amplicons == 0andmRNA_amplicons >= 1.
GenBank (.gb)
│
▼
(1) Locate the CE feature and its two junctions (left/right)
│
▼
(2) Primer3 design (three valid qPCR geometries, merged)
- Tm 57-63 C, GC 40-65 %, product 100-200 bp
- `lr`: LEFT near the LEFT junction, RIGHT near the RIGHT junction (classic
CE bridge when length allows).
- `right_at_L`: LEFT upstream only; RIGHT near the LEFT junction (only RIGHT
may cross the left boundary — still valid qPCR if junction_coverage >= 1).
- `left_at_R`: LEFT near the RIGHT junction; RIGHT downstream only (only LEFT
may cross the right boundary).
- Each layout is a separate Primer3 run with SEQUENCE_PRIMER_PAIR_OK_REGION_LIST;
SEQUENCE_OVERLAP_JUNCTION_LIST + PRIMER_MIN_*_PRIME_OVERLAP_OF_JUNCTION=1
still apply. Candidates are deduped by (F,R) sequence, capped at 20 pairs,
sorted by Primer3 pair penalty (then layout preference: lr first).
│
▼
(3) Junction coverage
- Left/right junction indices ``j`` come from the CE feature (same as before).
- For each ``j``, the relevant 2 bp on the template are ``j-1`` and ``j``
(0-based). A junction counts as covered only if **either** primer's binding
interval fully contains **both** bases (primer straddles the boundary).
- Count 0 / 1 / 2 for the two junctions; hard filter still requires >= 1.
│
▼
(4) Pair BLAST - check 1: Genome (mm39)
- blastn-short each primer against mm39.
- Count in-silico PCR amplicons (F and R on opposite strands, same chrom,
F+ upstream of R-, length <= --max_amplicon_bp).
- Expectation for a valid CE pair:
* 0 amplicons if the pair truly spans a junction (the junction does
not exist on genomic DNA, so no PCR product is formed).
* 1 amplicon if both primers happen to lie fully inside the CE (no
junction spanned; would be dropped in step (3) anyway).
- DISQUALIFY if >= 2 amplicons (off-target gDNA amplification).
│
▼
(5) Pair BLAST - check 2: curated RefSeq mRNA (mouse, NM_ only)
- blastn-short each primer against the curated RefSeq mRNA DB.
- Expectation: 0 amplicons, because a CE is not in any normal mRNA.
- DISQUALIFY if >= 1 amplicon (primer would amplify a wild-type transcript).
│
▼
(6) Hard filter = {junction_coverage >= 1} AND {genome_amplicons < 2}
AND {mrna_amplicons == 0}
No soft scoring - Primer3's own ranking is already good enough once
these three binary checks pass.
│
▼
(7) Sort passing pairs by:
(a) HardFilter_pass (True first)
(b) Junction_coverage_count DESC (prefer pairs that span both junctions)
(c) Primer3_rank ASC (fall back to Primer3's penalty order)
Emit top N (default 5) pairs per sequence to Excel.
Why "0 genome amplicons is GOOD" is not a contradiction: a junction-spanning primer places its 3' end across a sequence boundary that exists only on the spliced CE-containing transcript; the corresponding bases are not contiguous on genomic DNA, so no in-silico amplicon should form - which is exactly what we want.
This pipeline intentionally rejects any primer pair that can amplify a curated
RefSeq mRNA (mrna_amplicons >= 1) because that would indicate amplification of
wild-type transcripts — which is exactly what we want to avoid when
detecting a cryptic exon junction.
If your goal is a normal mRNA qPCR assay instead, the selection logic should be reversed:
- Genome (gDNA) pair BLAST: prefer 0 amplicons (typically by designing a primer that spans a canonical exon–exon junction, so gDNA cannot produce a product).
- mRNA pair BLAST: require >= 1 amplicon (the assay must amplify the intended transcript in the curated mRNA DB).
In other words, for standard mRNA qPCR, a good pair is usually:
Genome_amplicons == 0 AND mRNA_amplicons >= 1,
not the CE rule mRNA_amplicons == 0.
Requires Python 3.9+ and BLAST+ (blastn, makeblastdb) on PATH.
cd primer_design
python3 -m venv .venv
source .venv/bin/activate
pip install -r primers_pipeline/requirements.txtInstall BLAST+ if it is missing:
# Ubuntu / Debian
sudo apt-get install ncbi-blast+
# macOS
brew install blastThe pipeline runs entirely offline and expects two local blast databases
under ~/blastdb/ (paths are configurable, see CLI below):
~/blastdb/mm39/mm39.{nhr,nin,nsq,...}
~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse.{nhr,nin,nsq,...}
Any standard way of obtaining the mouse genome FASTA works. A minimal recipe:
mkdir -p ~/blastdb/mm39 && cd ~/blastdb/mm39
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
gunzip mm39.fa.gz
makeblastdb -in mm39.fa -dbtype nucl -parse_seqids -out mm39 -title mm39mkdir -p ~/blastdb/refseq_mrna_mouse && cd ~/blastdb/refseq_mrna_mouse
# 1. Fetch the list of mouse mRNA chunks (wget does not expand HTTP wildcards
# so we scrape the directory listing first).
curl -s https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/ \
| grep -oE 'mouse\.[0-9]+\.rna\.fna\.gz' | sort -u > files.txt
while read -r f; do
wget -c "https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/$f"
done < files.txt
# 2. Keep only curated transcripts (NM_*). Drop predicted/non-coding (XM_/XR_).
zcat mouse.*.rna.fna.gz \
| awk 'BEGIN{p=0} /^>/{p = ($0 ~ /^>NM_/)} {if(p) print}' \
> refseq_mrna_mouse.fna
# 3. Build the BLAST database.
makeblastdb -in refseq_mrna_mouse.fna -dbtype nucl -parse_seqids \
-out refseq_mrna_mouse -title refseq_mrna_mouseMinimum invocation - design primers for every .gb file under my_inputs/:
source .venv/bin/activate
python primers_pipeline/run_pipeline.py --input my_inputs/With non-default paths:
python primers_pipeline/run_pipeline.py \
--input my_inputs/ \
--outdir primers_pipeline/output \
--genome_db ~/blastdb/mm39/mm39 \
--mrna_db ~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse \
--num_pairs 5 \
--max_amplicon_bp 1000| flag | default | meaning |
|---|---|---|
--input |
(required) | One or more .gb/.gbk/.genbank files or directories (scanned recursively). |
--outdir |
primers_pipeline/output |
Destination for primers_output.xlsx + intermediate_results.json + failure CSV. |
--blast_mode |
local |
local (recommended) or none (skip BLAST; every pair will be BLAST pending). |
--genome_db |
~/blastdb/mm39/mm39 |
BLAST prefix for the genome DB. |
--mrna_db |
~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse |
BLAST prefix for the curated mouse mRNA DB. |
--local_blast_db |
(none) | Deprecated alias for --genome_db. |
--blastn_bin |
blastn |
Override blastn binary path if not on PATH. |
--max_amplicon_bp |
1000 |
Upper bound on in-silico PCR amplicon length when counting pair BLAST amplicons. |
--num_pairs |
5 |
Primer pairs emitted per sequence in the Excel. |
--feature_query |
cryptic exon |
Case-insensitive substring used to locate the CE feature in the GenBank file. |
--resume |
Reuse intermediate_results.json for records that already completed. |
|
--retry_pending |
With --resume, retry pairs that had failed BLAST last time. |
|
--no_blast |
Alias for --blast_mode none. |
Inside --outdir:
primers_output.xlsx- one sheet per GenBank record + aSummarysheet. Selected ("green") row per sheet: first pair whoseSpecificity_flag == ok, otherwise top-ranked pair.intermediate_results.json- raw per-pair data including cached BLAST hits (for resume / debugging).failed_records.csv- records that could not be processed (e.g. no CE feature found).
| column | meaning |
|---|---|
Rank |
Row rank after sorting (pass > junction coverage desc > Primer3 rank asc). |
Junction_coverage_count |
0 / 1 / 2 - how many junctions have both j-1 and j inside either primer. |
Primer3_rank |
0-based rank in the merged candidate list (penalty-sorted; see step 2). |
Primer3_layout |
Which geometry produced the pair: lr, right_at_L, or left_at_R. |
Genome_amplicons |
Pair-BLAST amplicons vs --genome_db. Must be 0 or 1 (2+ => off-target gDNA). |
mRNA_amplicons |
Pair-BLAST amplicons vs --mrna_db. Must be 0 (1+ => WT mRNA would be amplified). |
HardFilter_pass |
True iff jc>=1 and genome<2 and mrna==0. |
HardFilter_fail_reasons |
Comma-separated reason codes when HardFilter_pass is False. |
Specificity_flag |
ok, off-target risk, or BLAST pending. |
primer_design/
├── README.md <- this file
├── primers_pipeline/
│ ├── run_pipeline.py <- the single pipeline entry point
│ └── requirements.txt <- primer3-py, biopython, pandas, openpyxl
└── .gitignore <- ignores output*/ and venvs