This project implements a Variational Autoencoder (VAE) for generating novel molecular structures. It's particularly useful in drug discovery, where the goal is to generate new potential drug candidates. The project is designed to run in Google Colab, leveraging GPU acceleration for efficient training and generation.
- 🧠 VAE Architecture: Utilizes a Variational Autoencoder to learn a compact representation of molecular structures and generate new ones.
- 🧬 SMILES Representation: Uses SMILES (Simplified Molecular-Input Line-Entry System) strings for molecular representation.
- 📊 QM9 Dataset: Trains on the QM9 dataset, a standard benchmark in molecular machine learning.
- 👁️ Molecule Visualization: Generates and visualizes molecular structures using RDKit.
- ⚗️ Property Calculation: Computes basic molecular properties for generated molecules.
- ✅ Validity and Novelty Checks: Assesses the validity of generated molecules and checks for novelty against the training set.
- ☁️ Google Colab Integration: Designed to run in Google Colab for easy access to GPU resources.
- Google Colab environment
- Required libraries (automatically installed in the notebook):
- PyTorch
- RDKit
- Pandas
- Pillow
- IPython
- Open the notebook in Google Colab.
- Run the cells in order, following the instructions in the notebook.
- The notebook will guide you through:
- Setting up the environment
- Loading and preprocessing the QM9 dataset
- Defining and training the VAE model
- Generating new molecules
- Visualizing and analyzing the generated molecules
You can modify the following parameters in the notebook:
hidden_dim
: Dimension of the hidden state in GRU layerslatent_dim
: Dimension of the latent spacebatch_size
: Batch size for trainingnum_epochs
: Number of training epochs
The notebook generates several outputs:
- Training loss plots
- Generated SMILES strings
- Visualizations of generated molecules
- Analysis of molecular properties
- Validity and novelty statistics
- The model's performance is limited by the size and diversity of the training dataset (QM9).
- Generated molecules may not always be synthetically feasible or stable.
- The current implementation focuses on small organic molecules.
Contributions, issues, and feature requests are welcome. Feel free to open an issue or submit a pull request.
This project is open-source and available under the MIT License.
This tool is for research and educational purposes only. Generated molecules should not be considered as actual drug candidates without further extensive testing and validation.