Feature/starting residue number by CesarPuentes · Pull Request #241 · aqlaboratory/openfold-3

CesarPuentes · 2026-05-30T18:24:09Z

Summary

Implements the feature requested in issue #58: "Allow to specify the number of the first residue of a chain".

Protein fragments predicted by OpenFold-3 always come out numbered from residue 1. This adds a convenience option, starting_residue_number, that lets the user choose what number to assign to the first residue of a given chain, useful for matching PDB conventions, mature protein numbering, or signal peptide regions.

Usage

Add starting_residue_number to any protein chain definition. Omitting it (or setting it to null) preserves the default 1-based behavior — fully backward-compatible.

json

{
    "queries": {
        "mature_ubiquitin": {
            "chains": [
                {
                    "molecule_type": "protein",
                    "chain_ids": ["A"],
                    "sequence": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG",
                    "starting_residue_number": 3
                }
            ]
        }
    }
}

Supported values:

Any integer, including zero and negative numbers (e.g., -5 for signal peptides)
Multi-copy chains ("chain_ids": ["C", "D"]) all receive the same offset
Ligand chains ignore the field

Changes

inference_query_format.py — Added starting_residue_number: int | None = None to Chain. Accepts any integer including negative. Fully backward-compatible.
single_datasets/inference.py — After building the AtomArray, computes offset = starting_residue_number - 1 per chain and stores it (e.g., {"A": offset}) in the feature dict. The AtomArray stays 1-based so the model is unaffected.
data_module.py — Pops residue_number_offsets from samples before pad_sequence (a plain dict would cause a TypeError), then re-inserts it as a list of dicts.
writer.py — Before writing each diffusion sample, copies the shared AtomArray and adds the per-chain offset to res_id at the very last moment — after the model has finished.
docs/source/input_format_reference.md — Documented the new field in the protein chain schema (Section 3.1).
examples/example_inference_inputs/query_ubiquitin_starting_residue_number.json — Added example input file demonstrating the feature.

Design Decision

The offset is applied at write-time in OF3OutputWriter rather than earlier in the pipeline. This keeps the model's internal 1-based indexing untouched throughout inference, ensuring zero risk of regression in the folding logic.

Related Issues

Closes #58

Testing

Unit Tests

21 tests in openfold3/tests/test_custom_residue_numbering.py covering:

Schema Validation: Confirms starting_residue_number accepts integers (including negatives) and defaults to None for backward compatibility.
Data Pipeline: Ensures the dictionary of offsets safely bypasses tensor collation in the data loader without crashing PyTorch's pad_sequence.
Offset Computation: Validates the logic mapping the JSON offsets to specific chains, particularly for homomers sharing offsets.
Writer Output: Proves the output writer correctly shifts the res_id in the final CIF file for specified chains without altering the model's internal 1-based indexing.

Manual Verification

Inference was run for several configurations (single chain, multiple chains with different offsets, protein-ligand complexes) and the resulting CIF files were inspected both with a text editor and structurally in PyMOL, confirming:

Residue numbering in the output starts at the specified value
3D coordinates are correctly placed in both runs, with and without starting_residue_number — the offset is metadata-only and does not enter the model.
Multi-chain queries correctly apply independent offsets per chain

Full Test Suite

Ran pixi run -e openfold3-cpu pytest openfold3/tests/ twice. 487 passed, 46 skipped, 86 warnings in 1381.04s (0:23:01). One unrelated timing benchmark failed (test_inference_load_state_dict_benchmark_under_ten_seconds — took 23s under high CPU load; not related to this PR).

Other Notes

Code formatted with ruff format and ruff check --fix.
LLM disclosure: Used an LLM to assist in designing some of the tests in openfold3/tests/test_custom_residue_numbering.py.

CesarPuentes · 2026-06-01T22:26:39Z

Sorry, I didn't notice that PR #69 was already addressing this issue! I'll still keep an eye out for any comments. Closing this PR.

CesarPuentes added 2 commits May 28, 2026 21:37

add starting_residue_number feature

a973a6f

linting

5b5d910

CesarPuentes closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/starting residue number#241

Feature/starting residue number#241
CesarPuentes wants to merge 2 commits into
aqlaboratory:mainfrom
CesarPuentes:feature/starting_residue_number

CesarPuentes commented May 30, 2026

Uh oh!

CesarPuentes commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CesarPuentes commented May 30, 2026

Summary

Supported values:

Changes

Design Decision

Related Issues

Testing

Unit Tests

Manual Verification

Full Test Suite

Other Notes

Uh oh!

CesarPuentes commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant