Welcome! This repository contains the open-source implementation of proto-language, a Python package for designing biological sequences (DNA, RNA, and proteins) through constraint-based optimization. A design is specified as a set of constraints, and the framework runs a propose–score–refine loop to search for sequences that satisfy them, drawing on a large suite of computational biology and biological AI tools to score candidates.
proto-language is built on top of the proto-tools execution layer, so each computationally intensive tool (structure predictors, protein language models, inverse folding, sequence and structure aligners, gene annotation, and more) runs in its own automatically managed, isolated environment. Programs can run locally or as hosted optimization runs through the proto-client Python SDK.
Proto-language is open source under an MIT license. Contributions are welcome!
The package requires Python 3.10 or later and pip:
pip install git+https://github.com/evo-design/proto-language.gitSystem tools that standalone tool environments require in order to build (git, curl, gcc, make, cmake) are automatically provisioned on first use through proto-tools' shared foundation environment, so no manual setup is necessary.
Note
A direct PyPI install (pip install proto-language) is planned.
Note
Contributors should instead use the editable installation described in CONTRIBUTING.md.
All persistent data (model weights, tool environments, micromamba) is stored under PROTO_HOME, which defaults to ~/.proto/ and is inherited from proto-tools.
To customize the storage location (recommended for laboratory and HPC environments):
# Add to your shell profile:
export PROTO_HOME=/path/to/your/proto_homeTo override only the model-weights location, set export PROTO_MODEL_CACHE=/path/to/shared/weights. See notes/filesystem.md for all options.
Some generators and constraints load gated models (for example ESM3, AlphaGenome, and AlphaFold3) that require accepting a license and authenticating with HuggingFace. Set HF_TOKEN in the environment after accepting each model's terms. See proto-tools/README.md for the full procedure and the list of gated models.
Tip
Setup is complete. See the Quickstart to run a program from end to end.
Working programs are provided under examples/:
examples/scripts/— runnable Python programs, ranging from a minimal end-to-end example (toy.py) to broader workloads.examples/jsons/— declarative JSON program definitions (theoptimization_stagesschema). These illustrate program structure and are not loaded by a Python consumer.
The framework is built around seven primitives in proto_language/core/ — three data containers, three pluggable interfaces, and one orchestrator:
Sequence— a typed string (DNA, RNA, or protein) together with optional logits, a folded structure, and namespaced metadata. The atomic unit of design.Segment— a single design region. It holds the proposalSequences for that region and the surviving resultSequences after scoring.Construct— an ordered list ofSegments that concatenate into a full biological construct (for example, a promoter plus a coding region; a multi-chain protein; or a designed gene).Constraint(registered via@constraint) — scores aSequenceagainst a target property, returning a score and namespaced metadata, and may optionally provide gradients.Generator(registered via@generator) — proposes newSequences for aSegment.Optimizer(registered via@optimizer) — a search strategy that drives the propose–score–refine loop.Program— the top-level orchestrator. It owns theConstructand composes one or moreOptimizerstages.
All three pluggable interfaces share a BaseConfig Pydantic configuration pattern and declare parameters with ConfigField.
Program.run() iterates through its optimizer stages. Each stage performs the following steps:
- The
Optimizerrequests proposalSequences from itsGeneratorfor one or moreSegments. - Each
Constraintevaluates the proposals and records its score and metadata on the proposalSequences. - The
Optimizeraggregates the constraint scores and selects survivors. These become theSegment's resultSequences and feed into the next iteration, or the next stage.
When the program finishes, Program.export(path=...) writes a directory containing tables for sequences, constraints, constructs, and optimization steps, a FASTA file, and an assets/ sidecar directory.
See CONTRIBUTING.md for developer setup, code style, testing, and agent conventions.
If you use Proto in your research, please cite our preprint:
Merchant AT, Guo D, Viggiano B, Brennan-Almaraz LE, Hur E, Mai T, Yin P, King SH, Ashley E, Hie BL. A high-level programming language for generative biology with Proto. bioRxiv (2026). doi: 10.64898/2026.06.22.733870
@article{Merchant2026.06.22.733870,
author = {Merchant, Aditi T and Guo, Daniel and Viggiano, Ben and Brennan-Almaraz, Lucas Emmanuel and Hur, Evelyn and Mai, Tina and Yin, Peter and King, Samuel H and Ashley, Euan and Hie, Brian L},
title = {A high-level programming language for generative biology with Proto},
elocation-id = {2026.06.22.733870},
year = {2026},
doi = {10.64898/2026.06.22.733870},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/10.64898/2026.06.22.733870},
journal = {bioRxiv}
}