Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create test dataset #98

Merged
merged 2 commits into from
Mar 31, 2024
Merged

Create test dataset #98

merged 2 commits into from
Mar 31, 2024

Conversation

peastman
Copy link
Member

This script generates a test set for evaluating models trained on SPICE. It tries to measure how well models generalize to new molecules that weren't in the training set, and more specifically how well they generalize to larger molecules than they were trained on.

It includes the following.

  • 200 LigandExpo molecules with between 40 and 50 atoms. The amino-acid/ligand subset used LigandExpo molecules, but the largest ones are only 36 atoms, so none of these were included. We do have lots of PubChem molecules of this size, so it measures generalization to new molecules of the same size as the training data.
  • 200 LigandExpo molecules with between 70 and 80 atoms. These are larger than any single molecule in the training set (though some clusters are this large). It measures generalization to larger molecules.
  • 200 random pentapeptides. The training set contains all possible dipeptides, so this measures generalization to longer peptides (and hopefully to proteins, but running QM on full proteins would be very expensive).

There are 10 conformations for each molecule, giving a total of 6000 conformations.

@peastman peastman merged commit 44fea2f into openmm:main Mar 31, 2024
@peastman peastman deleted the test branch March 31, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant