Amazon did us a solid
Leash got into Amazon Web Services’s (AWS) Generative AI Accelerator program
The AWS Generative AI Accelerator is a program that includes 10 weeks of working with Amazon’s technical and business people to help startups do fancy ML stuff. Out of about four thousand teams who applied, we were fortunate enough to be among the 80 who got in. AWS blog post here, ad in the Washington Post here. Hooray! This program will help us build chemistry models at scale with the huge amount of data we’re collecting over here (as of this writing we’re at ~22B physical measurements).
One thing we’re especially looking forward to is the availability of compute resources and expertise to allow us to effectively wield our data. As part of the accelerator program, Amazon provides us a lot of compute (up to $1M worth) and smart people to help us use it. A big goal of ours at Leash is to train computers to accurately predict if a given small molecule might bind to a given protein target, and training computational models of this size is going to go easier if we establish a scaling law of how best to teach them. Amazon is going to help us get there.
Scaling laws describe how resource tradeoffs affect model performance
Training big models is expensive. They can require huge amounts of data (GPT4 used about 13T tokens, 1) and huge amounts of compute (GPT4 cost “more than” $100M to train, 2). Is more data important, or more compute, or something else? To get a better understanding of the process, research groups started tweaking these factors, tweaking model size, and more, and evaluated how performance turned out. An early effort by OpenAI (3) explored these tradeoffs, allowing them to “determine the optimal allocation of a fixed compute budget”. This was a demonstration of a scaling law, a mathematical function one could use to predict how model training would go given defined parameters.
OpenAI followed this work by looking at transfer learning (4), which is the practice of pretraining a model on a certain task and then fine-tuning it to perform a different, but related, task (5). They found “pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.” By way of analogy, a person might learn to surf and discover that learning to snowboard or ski - skills that require similar balance and quick decisions - comes a lot faster afterwards, and models are no different. Sometimes these pretrained models are called foundation models (6,7).
The OpenAI team then turned their attention to scaling training batch sizes (8) in different domains. It’s clear that the optimal numbers of examples per batch - the number of examples the thing you’re trying to learn that you show your model at a time - is different when training images, for example, as compared to training agents that play Atari (9). They reported “a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications.” Researchers can use this statistic to optimize model training for their task of interest, which helps them use their resources effectively.
With all this work, the community got a better and better understanding of the best ways to train models for all sorts of domains and applications.
Finally, there’s DeepMind’s infamous Chinchilla paper (10, nice summary in 11), which found “that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training [examples] should also be doubled.” This has massive implications for where one puts efforts: compute nowadays is generally much more scalable than data generation in most domains - you can just buy it - but finding more data, which is usually harder to do, might be a better way to boost performance.
While these efforts focused on language models, there’s even a protein sequence scaling law paper, which evaluates these various tradeoffs in protein space (12).
However, there is no such scaling law for small molecules interacting with proteins.
We want a scaling law for medicinal chemistry
The Chinchilla paper argues that many large language models are undertrained, and that investment in getting more data is a better current strategy than more compute. At Leash, we generate data at scale, and we took the Chinchilla findings to suggest that we were on the right track: there is currently way more compute available in the world than measurements of small molecules interacting with proteins (13), and we could be in a position to fix that. We’ve got 22B physical measurements we made ourselves (as of September 2024) and we just built the machinery to screen ~200 new, in-house manufactured proteins against millions of molecules every month (figure 1). We believe that tackling medicinal chemistry is going to need not only a huge diversity from the small molecule side, but from the protein side as well. We plan to screen thousands of proteins by the end of 2025.
A scaling law for medicinal chemistry could let us make concrete predictions about what resources it takes to get a certain level of performance. Put another way, we want to know what it will cost to get a model that does the things we want it to do: to predict interactions between arbitrary proteins and arbitrary small molecules to a certain level of confidence. Such a model could help us find new chemical material for therapeutics, guide medicinal chemists during drug program development, and have lots of other applications in agriculture, synthetic biology, and more. It would help a lot!
This means we will have to sort of replicate the huge amount of work done by the researchers above on language models where we vary the data size, batch size, compute size, and so on on with our chemistry and protein datasets. Testing all those parameters takes lots of expertise and lots of money on compute; shoveling billions or trillions of measurements into models is a nontrivial task. Fortunately, Amazon is gonna help us out here.
The AWS Generative AI Accelerator Program was built for folks like us
We are thrilled to collaborate with the AWS team - not just because they’re providing a lot of compute for model training (although that is really really nice) but also because they have people who are very good at thinking about, and tweaking, parameters for training language models. Their cohort from last year (14) has lots of companies working to optimize language models for many applications. Their cohort this year includes our buddies over at Noetik, who are building foundational models on patient tumor data they measure themselves.
We’re super excited to dig in with the experts on the AWS crew. With their expertise and compute availability, we’re going to push hard into training models to learn the underlying behaviors of protein/small molecule interactions, and try to get our arms wrapped around a scaling law for this domain and point it at improving human health.
Thanks, AWS. Let’s go!
Very insightful, I was not aware of the protein sequence scaling law paper, thanks. And, good luck with the AWS AI Accelerator Program.