Good binding data is all you need

Jul 16

We trained a simple model on huge amounts of in-house data and match state-of-the-art small molecule-protein binding prediction

Read →

6 Comments

Gustavo Seabra

Jul 24

Another issue is the _type_ of data. Can you please comment on what kind of data you are collecting? IC50s are relatively abundant in the open-source sets, but notoriously low in quality and even worse in reproducibility.

Expand full comment

Gustavo Seabra

Jul 24

Thanks a lot! I'm glad to see this, as I've been hitting on this key for a long time now.

I see too many new models out there claiming to "beat SOTA", which bring only small, incremental improvements that more often than not fall within error bars, so, not really statistically significant at all. The methods are great, but there's a barrier we cannot seem to cross, and I've always attributed that to data, not model. No matter how much we tinker with the models, there only so much one can do with the low-quality open-source data available for most academic researchers.

Expand full comment

Dom

Jul 19

Interesting results. Have you tried a train/test split where you hold out one protein family for validation, while training the model on the rest? e.g. train on all non-kinases, then validate on a kinase

Expand full comment

Armand B. Cognetta III

Jul 17

Epic

Expand full comment

Kevin Lerner

Jul 17

Very nice! Do you think this predictive modeling can translate to small molecule metabolism? It seems like a logical step to me… if you can predict small molecule binding based on amino acid sequence and chemical formula, may you be able to go one step further and determine if that chemical formula is altered based on the binding event?

Expand full comment

Reply (1)

Andrew Blevins

Jul 17

The dream is that we keep improving this general binding model, and then we can get smaller datasets and fine-tune on those problems. A model that truly learns how to think about how molecules and proteins interacts, should need less data to understand these similar tasks.

Expand full comment

Leash

Good binding data is all you need