FRANKENSEARCH

Chance-similarity search of short fusion ("franken") protein sequences against species-restricted protein databases.

Unlike an ordinary BLAST search, FRANKENSEARCH is not looking for homologs. Its inputs are artificial fusion proteins, so the usual E-value/homology statistics are the wrong lens. Results are therefore ranked by identity and are not filtered by E-value — the goal is to surface proteins that are similar by chance, including short, high-identity matches that homology search would discard.

Installation

FRANKENSEARCH drives the NCBI BLAST+ tools (blastp, makeblastdb), so those must be installed. The easiest way is a dedicated conda environment:

conda create -n frankensearch -c conda-forge -c bioconda python=3.11 blast
conda activate frankensearch
pip install -e ".[dev]"

Check everything is ready:

frankensearch doctor

doctor verifies that BLAST+ is found, that the identity matrix loads, and reports the taxonomy cache and which databases you have built.

Quick start

# 1. Build a local database for each species you want to search (once per species)
frankensearch setup --taxids 9606            # human (UniProt reference proteome)

# 2. Search your franken proteins against it
frankensearch search myproteins.fasta --taxids 9606 -n 10

# 3. Read the results
#    myproteins.txt  -> human-readable, with alignments
#    myproteins.tsv  -> for downstream analysis

There is an example input at examples/franken_demo.fasta.

Inputs

Query sequences — short amino-acid sequences (typically < 200 residues). The format is auto-detected:

Format	Shape
FASTA	`>name` lines followed by sequence
TSV	two columns: name `<tab>` sequence
CSV	two columns: name `,` sequence

Bad records (invalid characters, empty sequences, etc.) are skipped with a plain warning rather than aborting the run.

Taxids — one or more NCBI Taxonomy IDs (--taxids 9606,10090). They should be species-level (e.g. human, mouse), not higher clades; FRANKENSEARCH warns if a taxid is not species-rank. Each taxid is searched separately so no single taxon drowns out another.

Key options (`frankensearch search --help`)

Option	Meaning
`-n, --num-hits`	Top hits to report per (query, species) (default 10).
`--rank-by {identity-alignment,identity-query,alignment-length}`	How to rank hits within each (query, species) group: identity ÷ alignment length (default), identity ÷ query length, or longest alignment first. Both ratios and the alignment length are always reported.
`--matrix {identity,pam30,blosum45,blosum62}`	Scoring matrix; `identity` (built-in pure-identity) is the default.
`--ungapped`	Ungapped alignments only.
`--remote`	Search NCBI remotely instead of using local databases (see below).
`-o, --output`	Output path prefix (defaults to the input file's name).
`--dry-run`	Show the plan (parsed queries, resolved species) without searching.
`--debug`	Show full tracebacks (otherwise errors are concise, friendly messages).

Output

Three files are written per run, sharing the -o/--output prefix:

.txt — human-readable, in two sections: (1) a table of every hit (query, species, target, both identity ratios, alignment length, bit score, E-value, query/subject start–end coordinates, and a final match column), then (2) the BLAST-style pairwise alignments grouped by query then species. The match column / middle "match" line shows the query residue where the two sequences are identical and a dot (.) where they differ (e.g. MKL.EV).
.tsv — one row per hit for downstream processing, with columns including the query ID, queried taxid + species, target accession + name, both identity ratios (over alignment length and over query length), bit score, E-value, alignment coordinates, and the alignment itself (as a single field with newlines escaped as \n).
.summary.md — a methods-grade record of the run (command, input checksum, per-species database provenance, full effective parameters, software versions, references, and a ready-to-paste Methods paragraph).

E-value is reported for reference only; it is never used to filter results.

Local vs. remote

Local (default, recommended). setup downloads each species' UniProt proteome and builds a BLAST database under ~/.frankensearch/blastdb/. Reproducible, offline after setup, and uses the pure-identity matrix.
- --proteome-set {reference,swissprot,all} chooses what to download:
  - reference (default) — the species' reference proteome: one protein per gene, mixing reviewed (Swiss-Prot) and unreviewed (TrEMBL) entries. Complete but non-redundant; the recommended search space.
  - swissprot — reviewed entries only. Small and high quality, but can be sparse or empty for non-model organisms.
  - all — every UniProtKB entry for the organism. Largest and most redundant (isoforms, fragments, strains).
  reference already includes most Swiss-Prot sequences (it uses the reviewed entry per gene where one exists), so it is not a strict superset of swissprot but overlaps it heavily.
Remote (--remote). Searches NCBI's nr remotely (no local database needed), restricting to each taxid. Convenient for one-offs, but:
- NCBI's remote service has no IDENTITY matrix, so it falls back to PAM30 (with a warning).
- nr is non-redundant, so a hit's listed organism may differ from the queried taxid (the output notes this).
- It is slower and subject to NCBI's load.

See what you have built:

frankensearch databases

Where data lives

Everything is stored under ~/.frankensearch/ (taxonomy cache + BLAST databases). Override the location with the FRANKENSEARCH_HOME environment variable.

Troubleshooting

"BLAST+ tools were not found" — conda activate frankensearch (or install BLAST+). Run frankensearch doctor to confirm.
"No local database for ..." — build it: frankensearch setup --taxids <id>.
A taxid won't resolve — check it at https://www.ncbi.nlm.nih.gov/taxonomy; taxonomy lookups need internet on first use (results are then cached).
For a full traceback when reporting a bug, re-run with --debug.

License

FRANKENSEARCH is free software, licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See the LICENSE file for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
src/frankensearch		src/frankensearch
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FRANKENSEARCH

Installation

Quick start

Inputs

Key options (`frankensearch search --help`)

Output

Local vs. remote

Where data lives

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FRANKENSEARCH

Installation

Quick start

Inputs

Key options (frankensearch search --help)

Output

Local vs. remote

Where data lives

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key options (`frankensearch search --help`)

Packages