This repository contains the code and data processing pipelines for the paper Protein Function Prediction: Capabilities and Limitations of Alignment-based Approaches.
At their core, alignment-based annotation transfers work by homology-matching proteins before transferring annotation from one (single-template) or more (multi-template) paired proteins:
While computational prediction of protein function is dominated by deep learning architectures, we find that homology-based methods can surpass state-of-the-art deep learning models when provided with appropriate contextual data. \
This repository provides the framework we used to explore and mitigate the primary failure point of homology methods (i.e. when there is either no protein to transfer from, or no function to transfer).
This repository provides a pipeline covering preprocessing steps (data downloading, annotation propagation, alignment using Diamond), running prediction algorithms and evaluating the alignment-based scoring methods discussed in the article:
- IDScore (Sequence Identity-Based): Transferring GO annotations from the highest sequence identity match.
- DiamondScore (Bit-Score Aggregation): A multi-template method aggregating annotations weighted by bitscores.
- DiamondKNN: A top-$k$ constrained variant of DiamondScore.
Advanced configurations like One-vs-All (OVA) alignment, integration of curated Labels (CUR), and STRING interactions are also supported through simple input flags.
The following Python packages are required:
pip install obonet networkx tqdm pandas scipy biopython matplotlib seaborn pickleAdditionally, the Diamond software is required for sequence alignment. On Linux-based systems:
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.11/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gzDownload necessary information (e.g., protein sequences, GO annotations) from UniProt (SwissProt) at different releases and parse them into required formats:
python download_swissprot.pyIf you wish to use functionalities that rely on STRING protein interaction data, download with:
python download_stringdb.pyNote: This can take a while.
Propagate GO annotations from specific terms to broader parent terms using the ontology structure:
python propagate_swissprot_terms.pyRun sequence alignment using Diamond. This step finds homologous proteins required for annotation transfer:
echo "Creating Diamond database..."
diamond makedb --in data/2024_01/swissprot_2024_01.fasta -d data/2024_01/swissprot_2024_01_proteins_set
echo "Running Diamond blast on protein sequences against themselves..."
diamond blastp --very-sensitive --db data/swissprot/2024_01/swissprot_2024_01_proteins_set.dmnd --query data/swissprot/2024_01/swissprot_2024_01.fasta --out data/swissprot/2024_01/diamond_swissprot_2024_01_alignment.tsv -e 0.001Note: The all-vs-all alignment for the entire SwissProt database (~570,000 proteins) takes about 1 hour, depending on hardware.
To more accurately evaluate the predictive performance, we use
Example for the ATGO dataset:
python background.py \
--cco ./data/ATGO/ATGO_CCO_train_annotations.tsv \
--bpo ./data/ATGO/ATGO_BPO_train_annotations.tsv \
--mfo ./data/ATGO/ATGO_MFO_train_annotations.tsv \
--output ./data/ATGO/background_ATGO.pkl \
--test_cco ./data/ATGO/ATGO_CCO_test_annotations.tsv \
--test_bpo ./data/ATGO/ATGO_BPO_test_annotations.tsv \
--test_mfo ./data/ATGO/ATGO_MFO_test_annotations.tsvpython main.py \
--db_versions '' \
--dataset ATGO \
--k_values 1 3 5 \
--aspects BPO CCO MFOParameters:
--dataset:D1(BeProf D1),H30(low-homology),ATGO, orCAFA3.--db_versions: One or more SwissProt versions (e.g.,2024_01). Set to empty string''to execute annotation transfer strictly from the benchmark train set to the test set. Note that not passing the flag will run on all available SwissProt versions, based on theSWISSPROT_VERSIONSconstant.--k_values: Limits for KNN baseline (e.g.,DiamondKNN(k=1)).--aspects: Restrict GO subontologies (BPO,CCO,MFO).--experimental_only: Restricts transfered annotations to experimental GO terms.--eval: Evaluation script to use, can be eithercafaorbeprof. Use cafa if in doubt.--norm: Normalization mode used by CAFA evaluator, can be eithercafaorpredorgt.
To reproduce enhancements proposed in the paper, add the following flags:
- DS – OVA (One-vs-all): Include
--one_vs_allto allow all proteins in the dataset to contribute annotations to the target. - DS – CUR (Curated): Omit
--experimental_onlyto incorporate non-experimental manually reviewed labels. - DS – STRING: Add
--stringdb_weightfollowed by the weight (between 0 and 1) to balance STRING interaction contribution predictions.
Example:
# Blend alignment (50%) and StringDB (50%) for all proteins
# Will run DiamondScore, IDScore and DiamondKNN(k=1,3,5) on the H30 dataset at release 2024_01
python main.py \
--dataset H30 \
--db_versions 2024_01 \
--k_values 1 3 5 \
--stringdb_weight 0.5
--experimental_onlyEvaluation of predictions is run automatically during main.py, logging
You can also manually evaluate:
python evaluation.py \
--input_dir './results/ATGO/baselines_ATGO_2024_01_BPO_exp_one_vs_all' \
--aspect BPO \
--k_values 1 3 5