You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to start stage 2 training after completing stage 1 using a single A100 80GB GPU with My Korean dataset, I encountered an issue where g_loss becomes NaN.
Upon investigation, it was found that the y_rec_gt_pred output from model.decoder was NaN:
The issue arises due to inconsistent key names when loading checkpoints between the first and second stages. In the second stage, the MyDataParallel class is used, which prefixes all model keys with ‘module.’. However, if you are using single gpu, the first stage does not apply this prefix when saving checkpoints. --> #120
This inconsistency prevents the proper loading of the model parameters, leading to NaN values in the loss calculation.
Solution
To address this, I’ve updated the load_checkpoint function to handle cases where the checkpoint keys do not match the model keys by creating a new state_dict with matching keys if direct loading fails.
Description
When trying to start stage 2 training after completing stage 1 using a single A100 80GB GPU with My Korean dataset, I encountered an issue where
g_loss
becomesNaN
.Upon investigation, it was found that the
y_rec_gt_pred
output frommodel.decoder
wasNaN
:The s variable, an input to the decoder, had abnormally large values:
The s input is derived from the style encoder:
s = model.style_encoder(st.unsqueeze(1) if multispeaker else gt.unsqueeze(1))
The inputs st and gt were within normal ranges, and i found the style encoder’s weights were not properly loaded from the checkpoint:
Cause
The issue arises due to inconsistent key names when loading checkpoints between the first and second stages. In the second stage, the MyDataParallel class is used, which prefixes all model keys with ‘module.’. However, if you are using single gpu, the first stage does not apply this prefix when saving checkpoints. --> #120
This inconsistency prevents the proper loading of the model parameters, leading to NaN values in the loss calculation.
Solution
To address this, I’ve updated the
load_checkpoint
function to handle cases where the checkpoint keys do not match the model keys by creating a new state_dict with matching keys if direct loading fails.Updated load_checkpoint Function
Additionally, I have submitted a PR to address this issue: #253
The text was updated successfully, but these errors were encountered: