Introducing BELKA
We just released the biggest public molecule-protein interaction dataset on earth
In early August of 2021, we started Leash Biosciences and quietly built a lab in the basement apartment of my house. Even back then, it was clear to us that we really wanted to make a meaningful data release to help others tackle medicinal chemistry with machine learning approaches.
Over here, we believe that while rules can be useful, examples of the thing you’re trying to learn are the best way to learn it, something we write about now and again (1,2). There are technologies these days to make measurements of examples of lots of things and those measurements can help us, and machines, learn to do useful tasks. For medicinal chemistry, there are methods to make chemicals at a truly staggering scale (Enamine offers 48B molecules in its catalog, 3) and also approaches to measure them interacting with targets at a comparable scale (Nuevolution made a 40T pooled molecule library, 4). Using large numbers of examples, plus machine learning, has enabled great leaps in the exploration of chemical space (5).
And yet: it felt like while large numbers of examples probably exist, we had never seen them.
If you are a genomics researcher, you learn that nearly any academic journal you might publish in requires you to deposit your raw data - we’re talking sequence reads and experimental details - to some public repository before the paper is released. The repository is often NCBI GEO and I have done many such submissions myself (6). NCBI GEO contains some 6.5M samples (7). Each sample typically contains huge amounts of data - counts of roughly 20,000 genes (for critters like us), or maybe full genome sequences. There are other places, too: the UK Biobank has full genome sequences for half a million individuals, plus their health records (8).
When we started Leash, we figured we could get our hands on at least a little small molecule + protein data that we could evaluate. There was this flood of genomic data, surely similar amounts of chemical data were out there somewhere. In particular, we wanted data from DNA-encoded library screens (9).
There was essentially no data of this kind at all in the public domain to explore, and that made us furious. We’ve been mad about it ever since and thinking about how to fix it.
Kaggle is a great venue for people with data
When people with a lot of data want smart scientists to look at it, a good option is Kaggle. Kaggle is an outfit that hosts competitions wherein the group with the data allows the scientist nerd types to extract meaning out of it in a competitive fashion, and it often provides excellent solutions.
Our favorite intuitive example of a Kaggle competition is one focused on whale identification (10). Here is the host’s description of the contest:
To aid whale conservation efforts, scientists use photo surveillance systems to monitor ocean activity. They use the shape of whales’ tails and unique markings found in footage to identify what species of whale they’re analyzing and meticulously log whale pod dynamics and movements. For the past 40 years, most of this work has been done manually by individual scientists, leaving a huge trove of data untapped and underutilized.
In this competition, you’re challenged to build an algorithm to identify individual whales in images. You’ll analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors. By contributing, you’ll help to open rich fields of understanding for marine mammal population dynamics around the globe.
Here is what an individual whale looks like. Check out that lil white circle!
Kaggle gets folks together to try new approaches to problems and solutions to those problems can be assisted by a new dataset. It felt natural to us to leverage the expertise in this community to our own medicinal chemistry problems. We also figured that if we were so mad about there not being any good public collections of medicinal chemistry, maybe other folks were too.
The Big Encoded Library for Chemical Assessment
To address this deficiency, we generated the Big Encoded Library for Chemical Assessment (BELKA). We named it after my dog, who spent a lot of time in the basement lab those days. BELKA is a collection of 133M molecules screened against 3 protein targets. We did 3 rounds of selection (think of each one as a pass through a sieve) and did each one of those in triplicate. We measure which molecules bound by sequencing a DNA barcode attached to each one; we sequenced each one of those triplicates in each round of selection quite deeply to get accurate counts. Altogether, BELKA is about 3.6B physical measurements of small molecules binding (or not) to those protein targets (3 targets * 3 rounds * 3 replicates * 133M molecules).
This is a lot of measurements in the chemistry world.
The biggest public database of chemistry in biological systems is PubChem. PubChem has about 300M measurements (11), from patents and many journals and contributions from nearly 1000 organizations, but these include RNAi, cell-based assays, that sort of thing. Even so, BELKA is >10x bigger than PubChem. A better comparator is bindingdb (12), which has 2.8M direct small molecule-protein binding or activity assays. BELKA is >1000x bigger than bindingdb. BELKA is about 4% of the screens we’ve run here so far.
We will be releasing all these selection rounds and replicates and binding counts publicly in the summer (see more details below). When we do, we expect PubChem to ingest them and add them to its database, which we encourage and feel is only fair. But: BELKA was generated experimentally by a single pair of hands - hands belonging to our scientist Brayden Halverson - which means that when PubChem ingests BELKA, ~90% of PubChem will be from Brayden alone, made in the basement of my house. We will probably make a funny t-shirt about it.
BELKA is part of a Kaggle competition on chemistry
We are first releasing BELKA as a Kaggle competition (13). The contest is to get computers to look at chemical structures and predict whether they will bind to one of three protein targets. To get good, trustworthy examples of binders and not-binders, we aggregated all those measurements (the 3 rounds in triplicate) and made a yes/no determination. The dataset released as part of the Kaggle competition, then, is about 100M molecule yes/no assessments per protein, or about 300M total (still bigger than PubChem, but not the 3.6B quite yet, stay tuned for that release).
We designed the Kaggle competition to be half easy and half difficult. I’ve written before about how machine learning algorithms are very good at cheating (1) and chemistry algorithms are no different. Our molecules have many shared pieces, and if one piece is particularly good at binding in the training set, the algorithm can memorize it as good and predict it to be good if it ever sees it in the test set. So, we built the test set to have some shared pieces (so contestants feel a sense of accomplishment) but also some pieces that are completely novel (which is really the problem we’re trying to solve, having computers be good at predicting novel chemical space). The novel ones are likely to be much more difficult to accurately predict on.
We are very proud of the dataset, and not just because it provides fodder for model training.
The Ultimate Fighting Championship of medicinal chemistry
When the Ultimate Fighting Championship was first conceived, the idea was to take fighters of many different disciplines - judo enthusiasts, boxers - and figure out which fighting style would win in the same ring. Does a sumo wrestler beat a karate expert? (He does not.) There are many ways to predict small molecule interactions with protein targets: one could dock (14) or impute (15) or try to fold the protein around the ligand (16, 17, 18). Or one could train on the empirical examples we provided in BELKA.
We believe the empirical data will help. Maybe we’re right and maybe we’re not, but until today, there was no way to tell. There aren’t any other chemistry datasets at this scale to test different prediction approaches. Our Kaggle competition is deliberately designed to allow contestants to use any method they choose to tackle this problem - they can submit solutions without using our data at all. It’s like the Ultimate Fighting Championship, a level ground on which to compete with your method of choice.
Our sense over here is that every time machine learning solves a problem, it starts off with people who study the problem applying their own strategies. Chess people taught their algorithms the Sicilian Defense; protein folding people taught theirs electron shells and bond angles. Inevitably, though, a group that didn’t much care about human strategies showed up with huge numbers of examples and mopped the floor with everyone else (19). This happened with chess, Go, Atari, object recognition in images, language translation, text generation, and protein folding. It’s happened a lot.
Maybe medicinal chemistry is different. Maybe rules-based strategies will win this time, even though they usually don’t. We’re honestly not sure - but you can probably guess how we feel, since we just released more examples than anyone else on earth. It’s going to be to fun to find out who’s right.