primer_design

qPCR primer design pipeline for cryptic exon (CE) junction detection in mouse (default mm39). Given GenBank files that annotate a cryptic exon inside an intron, the pipeline designs primer pairs that span at least one CE boundary (exon↔CE junction) and uses pair-wise BLAST to eliminate off-target amplifications, so only primers that should only amplify the CE are kept.

IMPORTANT: This repository is CE-specific. The BLAST-based selection rules in this README are designed to detect/quantify cryptic exon inclusion (a CE-containing splice junction), which is not the same objective as standard qPCR targeting a normal mRNA.

TL;DR

For cryptic exon (CE) junction assays (this repo): require mRNA_amplicons == 0 to avoid WT transcript amplification.
For standard mRNA qPCR (not this repo’s default): you typically want Genome_amplicons == 0 and mRNA_amplicons >= 1.

1. What the pipeline actually does

GenBank (.gb)
   │
   ▼
(1) Locate the CE feature and its two junctions (left/right)
   │
   ▼
(2) Primer3 design (three valid qPCR geometries, merged)
     - Tm 57-63 C, GC 40-65 %, product 100-200 bp
     - `lr`: LEFT near the LEFT junction, RIGHT near the RIGHT junction (classic
       CE bridge when length allows).
     - `right_at_L`: LEFT upstream only; RIGHT near the LEFT junction (only RIGHT
       may cross the left boundary — still valid qPCR if junction_coverage >= 1).
     - `left_at_R`: LEFT near the RIGHT junction; RIGHT downstream only (only LEFT
       may cross the right boundary).
     - Each layout is a separate Primer3 run with SEQUENCE_PRIMER_PAIR_OK_REGION_LIST;
       SEQUENCE_OVERLAP_JUNCTION_LIST + PRIMER_MIN_*_PRIME_OVERLAP_OF_JUNCTION=1
       still apply.  Candidates are deduped by (F,R) sequence, capped at 20 pairs,
       sorted by Primer3 pair penalty (then layout preference: lr first).
   │
   ▼
(3) Junction coverage
     - Left/right junction indices ``j`` come from the CE feature (same as before).
     - For each ``j``, the relevant 2 bp on the template are ``j-1`` and ``j``
       (0-based). A junction counts as covered only if **either** primer's binding
       interval fully contains **both** bases (primer straddles the boundary).
     - Count 0 / 1 / 2 for the two junctions; hard filter still requires >= 1.
   │
   ▼
(4) Pair BLAST - check 1: Genome (mm39)
     - blastn-short each primer against mm39.
     - Count in-silico PCR amplicons (F and R on opposite strands, same chrom,
       F+ upstream of R-, length <= --max_amplicon_bp).
     - Expectation for a valid CE pair:
         * 0 amplicons if the pair truly spans a junction (the junction does
           not exist on genomic DNA, so no PCR product is formed).
         * 1 amplicon if both primers happen to lie fully inside the CE (no
           junction spanned; would be dropped in step (3) anyway).
     - DISQUALIFY if >= 2 amplicons (off-target gDNA amplification).
   │
   ▼
(5) Pair BLAST - check 2: curated RefSeq mRNA (mouse, NM_ only)
     - blastn-short each primer against the curated RefSeq mRNA DB.
     - Expectation: 0 amplicons, because a CE is not in any normal mRNA.
     - DISQUALIFY if >= 1 amplicon (primer would amplify a wild-type transcript).
   │
   ▼
(6) Hard filter = {junction_coverage >= 1} AND {genome_amplicons < 2}
                  AND {mrna_amplicons == 0}
     No soft scoring - Primer3's own ranking is already good enough once
     these three binary checks pass.
   │
   ▼
(7) Sort passing pairs by:
       (a) HardFilter_pass (True first)
       (b) Junction_coverage_count DESC (prefer pairs that span both junctions)
       (c) Primer3_rank ASC (fall back to Primer3's penalty order)
     Emit top N (default 5) pairs per sequence to Excel.

Why "0 genome amplicons is GOOD" is not a contradiction: a junction-spanning primer places its 3' end across a sequence boundary that exists only on the spliced CE-containing transcript; the corresponding bases are not contiguous on genomic DNA, so no in-silico amplicon should form - which is exactly what we want.

If you want standard mRNA qPCR (not CE detection)

This pipeline intentionally rejects any primer pair that can amplify a curated RefSeq mRNA (mrna_amplicons >= 1) because that would indicate amplification of wild-type transcripts — which is exactly what we want to avoid when detecting a cryptic exon junction.

If your goal is a normal mRNA qPCR assay instead, the selection logic should be reversed:

Genome (gDNA) pair BLAST: prefer 0 amplicons (typically by designing a primer that spans a canonical exon–exon junction, so gDNA cannot produce a product).
mRNA pair BLAST: require >= 1 amplicon (the assay must amplify the intended transcript in the curated mRNA DB).

In other words, for standard mRNA qPCR, a good pair is usually: Genome_amplicons == 0 AND mRNA_amplicons >= 1, not the CE rule mRNA_amplicons == 0.

2. Setup

2.1 Python env

Requires Python 3.9+ and BLAST+ (blastn, makeblastdb) on PATH.

cd primer_design
python3 -m venv .venv
source .venv/bin/activate
pip install -r primers_pipeline/requirements.txt

Install BLAST+ if it is missing:

# Ubuntu / Debian
sudo apt-get install ncbi-blast+
# macOS
brew install blast

2.2 Build the two local BLAST databases

The pipeline runs entirely offline and expects two local blast databases under ~/blastdb/ (paths are configurable, see CLI below):

~/blastdb/mm39/mm39.{nhr,nin,nsq,...}
~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse.{nhr,nin,nsq,...}

Genome DB (`mm39`)

Any standard way of obtaining the mouse genome FASTA works. A minimal recipe:

mkdir -p ~/blastdb/mm39 && cd ~/blastdb/mm39
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
gunzip mm39.fa.gz
makeblastdb -in mm39.fa -dbtype nucl -parse_seqids -out mm39 -title mm39

Curated mouse mRNA DB (`NM_` only, no predicted transcripts)

mkdir -p ~/blastdb/refseq_mrna_mouse && cd ~/blastdb/refseq_mrna_mouse

# 1. Fetch the list of mouse mRNA chunks (wget does not expand HTTP wildcards
#    so we scrape the directory listing first).
curl -s https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/ \
    | grep -oE 'mouse\.[0-9]+\.rna\.fna\.gz' | sort -u > files.txt

while read -r f; do
    wget -c "https://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/$f"
done < files.txt

# 2. Keep only curated transcripts (NM_*). Drop predicted/non-coding (XM_/XR_).
zcat mouse.*.rna.fna.gz \
    | awk 'BEGIN{p=0} /^>/{p = ($0 ~ /^>NM_/)} {if(p) print}' \
    > refseq_mrna_mouse.fna

# 3. Build the BLAST database.
makeblastdb -in refseq_mrna_mouse.fna -dbtype nucl -parse_seqids \
    -out refseq_mrna_mouse -title refseq_mrna_mouse

3. Run

Minimum invocation - design primers for every .gb file under my_inputs/:

source .venv/bin/activate
python primers_pipeline/run_pipeline.py --input my_inputs/

With non-default paths:

python primers_pipeline/run_pipeline.py \
  --input my_inputs/ \
  --outdir primers_pipeline/output \
  --genome_db ~/blastdb/mm39/mm39 \
  --mrna_db   ~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse \
  --num_pairs 5 \
  --max_amplicon_bp 1000

CLI reference

flag	default	meaning
`--input`	(required)	One or more `.gb/.gbk/.genbank` files or directories (scanned recursively).
`--outdir`	`primers_pipeline/output`	Destination for `primers_output.xlsx` + `intermediate_results.json` + failure CSV.
`--blast_mode`	`local`	`local` (recommended) or `none` (skip BLAST; every pair will be `BLAST pending`).
`--genome_db`	`~/blastdb/mm39/mm39`	BLAST prefix for the genome DB.
`--mrna_db`	`~/blastdb/refseq_mrna_mouse/refseq_mrna_mouse`	BLAST prefix for the curated mouse mRNA DB.
`--local_blast_db`	(none)	Deprecated alias for `--genome_db`.
`--blastn_bin`	`blastn`	Override `blastn` binary path if not on `PATH`.
`--max_amplicon_bp`	`1000`	Upper bound on in-silico PCR amplicon length when counting pair BLAST amplicons.
`--num_pairs`	`5`	Primer pairs emitted per sequence in the Excel.
`--feature_query`	`cryptic exon`	Case-insensitive substring used to locate the CE feature in the GenBank file.
`--resume`		Reuse `intermediate_results.json` for records that already completed.
`--retry_pending`		With `--resume`, retry pairs that had failed BLAST last time.
`--no_blast`		Alias for `--blast_mode none`.

4. Output

Inside --outdir:

primers_output.xlsx - one sheet per GenBank record + a Summary sheet. Selected ("green") row per sheet: first pair whose Specificity_flag == ok, otherwise top-ranked pair.
intermediate_results.json - raw per-pair data including cached BLAST hits (for resume / debugging).
failed_records.csv - records that could not be processed (e.g. no CE feature found).

Key Excel columns

column	meaning
`Rank`	Row rank after sorting (pass > junction coverage desc > Primer3 rank asc).
`Junction_coverage_count`	0 / 1 / 2 - how many junctions have both `j-1` and `j` inside either primer.
`Primer3_rank`	0-based rank in the merged candidate list (penalty-sorted; see step 2).
`Primer3_layout`	Which geometry produced the pair: `lr`, `right_at_L`, or `left_at_R`.
`Genome_amplicons`	Pair-BLAST amplicons vs `--genome_db`. Must be 0 or 1 (2+ => off-target gDNA).
`mRNA_amplicons`	Pair-BLAST amplicons vs `--mrna_db`. Must be 0 (1+ => WT mRNA would be amplified).
`HardFilter_pass`	`True` iff `jc>=1` and `genome<2` and `mrna==0`.
`HardFilter_fail_reasons`	Comma-separated reason codes when `HardFilter_pass` is `False`.
`Specificity_flag`	`ok`, `off-target risk`, or `BLAST pending`.

5. Files

primer_design/
├── README.md                         <- this file
├── primers_pipeline/
│   ├── run_pipeline.py               <- the single pipeline entry point
│   └── requirements.txt              <- primer3-py, biopython, pandas, openpyxl
└── .gitignore                        <- ignores output*/ and venvs

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
primers_pipeline		primers_pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

primer_design

1. What the pipeline actually does

If you want standard mRNA qPCR (not CE detection)

2. Setup

2.1 Python env

2.2 Build the two local BLAST databases

Genome DB (`mm39`)

Curated mouse mRNA DB (`NM_` only, no predicted transcripts)

3. Run

CLI reference

4. Output

Key Excel columns

5. Files

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

primer_design

1. What the pipeline actually does

If you want standard mRNA qPCR (not CE detection)

2. Setup

2.1 Python env

2.2 Build the two local BLAST databases

Genome DB (mm39)

Curated mouse mRNA DB (NM_ only, no predicted transcripts)

3. Run

CLI reference

4. Output

Key Excel columns

5. Files

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Genome DB (`mm39`)

Curated mouse mRNA DB (`NM_` only, no predicted transcripts)

Packages