Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss Function vs epoch plot while training #338

Open
Awesomium10 opened this issue Aug 16, 2024 · 5 comments
Open

Loss Function vs epoch plot while training #338

Awesomium10 opened this issue Aug 16, 2024 · 5 comments

Comments

@Awesomium10
Copy link

We are training a model and require a plot of loss function with epochs while the model is training. How can we do this with torchmd?

@RaulPPelaez
Copy link
Collaborator

RaulPPelaez commented Aug 16, 2024

When running torchmd-train (see documentation here or here for a more advance approach) a file called metrics.csv will be generated inside the logdir (along with checkpoints and other information).

The metrics.csv file will be similar to this

epoch,lr,step,train_neg_dy_mse_loss,train_total_mse_loss,train_y_mse_loss,val_neg_dy_l1_loss,val_neg_dy_mse_loss,val_total_l1_loss,val_total_mse_loss,val_y_l1_loss,val_y_mse_loss
20.0,0.0005000000237487257,1357436,0.009475486353039742,5.106609344482422,0.012685230001807213,0.05106592923402786,0.010068569332361221,5.183582305908203,1.0168300867080688,0.07698939740657806,0.009973234497010708
21.0,0.0005000000237487257,1357593,0.012938769534230232,5.229669570922852,0.013254974037408829,0.04567231982946396,0.007519902195781469,4.6268630027771,0.7586674094200134,0.05963057279586792,0.006677159108221531
22.0,0.0005000000237487257,1357750,0.011407813988626003,5.110054969787598,0.01159985177218914,0.043068308383226395,0.00
....

This file contains, among others, the information you request (epoch and different losses).

You can plot this from a python script, for instance:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/path/to/metrics.csv')
plt.figure(figsize=(10, 6))
plt.plot(df['epoch'], df['train_total_mse_loss'], marker='o')
plt.title('Epoch vs Train Total MSE Loss')
plt.xlabel('Epoch')
plt.ylabel('Train Total MSE Loss')
plt.grid(True)
plt.show()

Note that we also provide integration with some popular frameworks for ML training visualization https://torchmd-net.readthedocs.io/en/latest/torchmd-train.html#cmdoption-torchmd-train-wandb-use

@Awesomium10
Copy link
Author

Hi!
We trained the data of alanine dipeptide using torchmd with a train.yaml configuration file. Our data file (metrics.csv) has a set of different loss functions. But we are unaware about the loss function which we need to minimise. So we chose an arbitrary loss function (train_total_mse_loss) from among those given in the data file, and observed the plot of loss vs number of epochs. It was not a decreasing plot, but highly irregular. How do we know the loss function involved in the algorithm and how can we toggle between them if it is possible?

@RaulPPelaez
Copy link
Collaborator

The metrics.csv file contains the losses for the energy (y), the forces (neg_dy) and total (sum of both) for the training, validation and test sets.
So for instance, if you chose MSE loss as function, the loss on the energy for the training set is denoted train_y_mse_loss.
Its hard to pin point why your loss is not going down just from the information you provided. Could you share configuration?

@Awesomium10
Copy link
Author

This is the train.yaml file we are using.
train.yaml.txt

@RaulPPelaez
Copy link
Collaborator

Your network seems to be very barebones (you are disabling neighbor embedding, for instance), you are also choosing the defaults for parameters such as the cutoff.
I am inclined to believe this is a matter of hypeparameters.
You seem to be trying to adapt this configuration file https://github.com/torchmd/torchmd-cg/blob/master/tutorial/train.yaml for a more recent version of the project.
That repo is long before my time I am afraid, I am not familiar with the Graph Network and the old iterations of it as to tell you the translation for each default/parameter.
Perhaps the current documentation of this network will help https://torchmd-net.readthedocs.io/en/latest/models.html#graph-network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants