6 Comments
User's avatar
Gustavo Seabra's avatar

Another issue is the _type_ of data. Can you please comment on what kind of data you are collecting? IC50s are relatively abundant in the open-source sets, but notoriously low in quality and even worse in reproducibility.

Expand full comment
Gustavo Seabra's avatar

Thanks a lot! I'm glad to see this, as I've been hitting on this key for a long time now.

I see too many new models out there claiming to "beat SOTA", which bring only small, incremental improvements that more often than not fall within error bars, so, not really statistically significant at all. The methods are great, but there's a barrier we cannot seem to cross, and I've always attributed that to data, not model. No matter how much we tinker with the models, there only so much one can do with the low-quality open-source data available for most academic researchers.

Expand full comment
Dom's avatar

Interesting results. Have you tried a train/test split where you hold out one protein family for validation, while training the model on the rest? e.g. train on all non-kinases, then validate on a kinase

Expand full comment
Kevin Lerner's avatar

Very nice! Do you think this predictive modeling can translate to small molecule metabolism? It seems like a logical step to me… if you can predict small molecule binding based on amino acid sequence and chemical formula, may you be able to go one step further and determine if that chemical formula is altered based on the binding event?

Expand full comment
Andrew Blevins's avatar

The dream is that we keep improving this general binding model, and then we can get smaller datasets and fine-tune on those problems. A model that truly learns how to think about how molecules and proteins interacts, should need less data to understand these similar tasks.

Expand full comment