Hey,
Congrats to the cool BioReason-Pro!
I was trying to obtain the SFT and RL generated texts for a subset of proteins (yeast to be specific) by running the predict.py, but the texts (all three sections: think, summary, and terms (interpro, GOBP/CC/MF terms) ) are different from what's released on hf-datasets protein_catalogue (terms are mostly the same, summary remains sematically similar, but is different).
Because the predicted go terms (go_pred) are released in huggingface training and testing dataset, I just used the released go_pred and bypassed the GO-GPT step. I only did GO-GPT for proteins that do not have go_pred data. This might be where the discrepancy from?
So I think it'd be best for reproducibility reasons that the prompt used for each protein could be released alongside the generation field of protein_catalogue.
Qi
Hey,
Congrats to the cool BioReason-Pro!
I was trying to obtain the SFT and RL generated texts for a subset of proteins (yeast to be specific) by running the predict.py, but the texts (all three sections: think, summary, and terms (interpro, GOBP/CC/MF terms) ) are different from what's released on hf-datasets
protein_catalogue(terms are mostly the same, summary remains sematically similar, but is different).Because the predicted go terms (go_pred) are released in huggingface training and testing dataset, I just used the released
go_predand bypassed the GO-GPT step. I only did GO-GPT for proteins that do not have go_pred data. This might be where the discrepancy from?So I think it'd be best for reproducibility reasons that the prompt used for each protein could be released alongside the
generationfield ofprotein_catalogue.Qi