Is it possible to also release the prompt used for each protein in protein_catalogue that used to generate the `generation`?

Hey,
Congrats to the cool BioReason-Pro!
I was trying to obtain the SFT and RL generated texts for a subset of proteins (yeast to be specific) by running the predict.py, but the texts (all three sections: think, summary, and terms (interpro, GOBP/CC/MF terms) ) are different from what's released on hf-datasets `protein_catalogue` (terms are mostly the same, summary remains sematically similar, but <think> is different).  

Because the predicted go terms (go_pred) are released in huggingface training and testing dataset, I just used the released `go_pred` and bypassed the GO-GPT step. I only did GO-GPT for proteins that do not have go_pred data. This might be where the discrepancy from?
So I think it'd be best for reproducibility reasons that the prompt used for each protein could be released alongside the `generation` field of `protein_catalogue`.

Qi


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to also release the prompt used for each protein in protein_catalogue that used to generate the `generation`? #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Is it possible to also release the prompt used for each protein in protein_catalogue that used to generate the generation? #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Is it possible to also release the prompt used for each protein in protein_catalogue that used to generate the `generation`? #5