Skip to content

Feature/starting residue number#241

Closed
CesarPuentes wants to merge 2 commits into
aqlaboratory:mainfrom
CesarPuentes:feature/starting_residue_number
Closed

Feature/starting residue number#241
CesarPuentes wants to merge 2 commits into
aqlaboratory:mainfrom
CesarPuentes:feature/starting_residue_number

Conversation

@CesarPuentes

Copy link
Copy Markdown
Contributor

Summary

Implements the feature requested in issue #58: "Allow to specify the number of the first residue of a chain".

Protein fragments predicted by OpenFold-3 always come out numbered from residue 1. This adds a convenience option, starting_residue_number, that lets the user choose what number to assign to the first residue of a given chain, useful for matching PDB conventions, mature protein numbering, or signal peptide regions.

Usage

Add starting_residue_number to any protein chain definition. Omitting it (or setting it to null) preserves the default 1-based behavior — fully backward-compatible.

json

{
    "queries": {
        "mature_ubiquitin": {
            "chains": [
                {
                    "molecule_type": "protein",
                    "chain_ids": ["A"],
                    "sequence": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG",
                    "starting_residue_number": 3
                }
            ]
        }
    }
}


Supported values:

  • Any integer, including zero and negative numbers (e.g., -5 for signal peptides)
  • Multi-copy chains ("chain_ids": ["C", "D"]) all receive the same offset
  • Ligand chains ignore the field

Changes

  • inference_query_format.py — Added starting_residue_number: int | None = None to Chain. Accepts any integer including negative. Fully backward-compatible.

  • single_datasets/inference.py — After building the AtomArray, computes offset = starting_residue_number - 1 per chain and stores it (e.g., {"A": offset}) in the feature dict. The AtomArray stays 1-based so the model is unaffected.

  • data_module.py — Pops residue_number_offsets from samples before pad_sequence (a plain dict would cause a TypeError), then re-inserts it as a list of dicts.

  • writer.py — Before writing each diffusion sample, copies the shared AtomArray and adds the per-chain offset to res_id at the very last moment — after the model has finished.

  • docs/source/input_format_reference.md — Documented the new field in the protein chain schema (Section 3.1).

  • examples/example_inference_inputs/query_ubiquitin_starting_residue_number.json — Added example input file demonstrating the feature.

Design Decision

The offset is applied at write-time in OF3OutputWriter rather than earlier in the pipeline. This keeps the model's internal 1-based indexing untouched throughout inference, ensuring zero risk of regression in the folding logic.

Related Issues

Closes #58

Testing

Unit Tests

21 tests in openfold3/tests/test_custom_residue_numbering.py covering:

  • Schema Validation: Confirms starting_residue_number accepts integers (including negatives) and defaults to None for backward compatibility.
  • Data Pipeline: Ensures the dictionary of offsets safely bypasses tensor collation in the data loader without crashing PyTorch's pad_sequence.
  • Offset Computation: Validates the logic mapping the JSON offsets to specific chains, particularly for homomers sharing offsets.
  • Writer Output: Proves the output writer correctly shifts the res_id in the final CIF file for specified chains without altering the model's internal 1-based indexing.

Manual Verification

Inference was run for several configurations (single chain, multiple chains with different offsets, protein-ligand complexes) and the resulting CIF files were inspected both with a text editor and structurally in PyMOL, confirming:

  • Residue numbering in the output starts at the specified value
  • 3D coordinates are correctly placed in both runs, with and without starting_residue_number — the offset is metadata-only and does not enter the model.
  • Multi-chain queries correctly apply independent offsets per chain
image

Full Test Suite

Ran pixi run -e openfold3-cpu pytest openfold3/tests/ twice. 487 passed, 46 skipped, 86 warnings in 1381.04s (0:23:01). One unrelated timing benchmark failed (test_inference_load_state_dict_benchmark_under_ten_seconds — took 23s under high CPU load; not related to this PR).

Other Notes

  • Code formatted with ruff format and ruff check --fix.
  • LLM disclosure: Used an LLM to assist in designing some of the tests in openfold3/tests/test_custom_residue_numbering.py.

@CesarPuentes

Copy link
Copy Markdown
Contributor Author

Sorry, I didn't notice that PR #69 was already addressing this issue! I'll still keep an eye out for any comments. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow to specify the number of the first residue of a chain

1 participant