Create test dataset #98

peastman · 2024-03-28T00:58:40Z

This script generates a test set for evaluating models trained on SPICE. It tries to measure how well models generalize to new molecules that weren't in the training set, and more specifically how well they generalize to larger molecules than they were trained on.

It includes the following.

200 LigandExpo molecules with between 40 and 50 atoms. The amino-acid/ligand subset used LigandExpo molecules, but the largest ones are only 36 atoms, so none of these were included. We do have lots of PubChem molecules of this size, so it measures generalization to new molecules of the same size as the training data.
200 LigandExpo molecules with between 70 and 80 atoms. These are larger than any single molecule in the training set (though some clusters are this large). It measures generalization to larger molecules.
200 random pentapeptides. The training set contains all possible dipeptides, so this measures generalization to longer peptides (and hopefully to proteins, but running QM on full proteins would be very expensive).

There are 10 conformations for each molecule, giving a total of 6000 conformations.

peastman added 2 commits March 27, 2024 17:48

Create test dataset

7541c16

Generated conformations

224b438

peastman merged commit 44fea2f into openmm:main Mar 31, 2024

peastman deleted the test branch March 31, 2024 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create test dataset #98

Create test dataset #98

peastman commented Mar 28, 2024

Create test dataset #98

Create test dataset #98

Conversation

peastman commented Mar 28, 2024