You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I create a finetune dataset with 100M samples from Molecule3D dataset with HOMO label which is used to finetune 84M unimol2 model with no checkpoint. Howerver the train loss cannot converge correctly representing in gradually decreasing to 0.1 and suddenly increasing to 0.55 and not decreasing any more. I've tried a variety of training parametes, but got similar loss curves. Here is one examples.
Describe the bug
I create a finetune dataset with 100M samples from Molecule3D dataset with HOMO label which is used to finetune 84M unimol2 model with no checkpoint. Howerver the train loss cannot converge correctly representing in gradually decreasing to 0.1 and suddenly increasing to 0.55 and not decreasing any more. I've tried a variety of training parametes, but got similar loss curves. Here is one examples.
V100+python 3.9+pytorch 2.0.0
seed=0, cpu=False, fp16=False, bf16=False, bf16_sr=False, allreduce_fp32_grad=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir='./unimol2', empty_cache_freq=0, all_gather_list_size=16384, suppress_crashes=False, profile=False, ema_decay=-1.0, validate_with_ema=False, loss='finetune_smooth_mae', optimizer='adam', lr_scheduler='polynomial_decay', task='mol_finetune', num_workers=8, skip_invalid_size_inputs_valid_test=False, batch_size=32, required_batch_size_multiple=1, data_buffer_size=10, train_subset='train', valid_subset='valid,test', validate_interval=1, validate_interval_updates=0, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, batch_size_valid=32, max_valid_steps=None, curriculum=0, distributed_world_size=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, device_id=0, distributed_no_spawn=True, ddp_backend='c10d', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=True, fast_stat_sync=False, broadcast_buffers=False, nprocs_per_node=1, arch='unimol2_84M', max_epoch=40, max_update=0, stop_time_hours=0, clip_norm=1.0, per_sample_clip_norm=0, update_freq=[1], lr=[0.0001], stop_min_lr=-1, best_checkpoint_metric='valid_agg_mae', maximize_best_checkpoint_metric=False, patience=10, checkpoint_suffix='', droppath_prob=0.0, gaussian_std_width=1.0, gaussian_mean_start=0.0, gaussian_mean_stop=9.0, mode='train', data='unimol2/example_data/molecule3d', task_name='molecule3d_homo', classification_head_name='molecule3d_homo', num_classes=1, reg=True, no_shuffle=False, conf_size=1, remove_hydrogen=False, drop_feat_prob=1.0, use_2d_pos_prob=0.0, max_atoms=256, adam_betas='(0.9, 0.99)', adam_eps=1e-06, weight_decay=0.0, force_anneal=None, warmup_updates=0, warmup_ratio=0.03, end_learning_rate=0.0, power=1.0, total_num_update=1000000, pooler_dropout=0.0, no_seed_provided=False, encoder_layers=12, encoder_embed_dim=768, pair_embed_dim=512, pair_hidden_dim=64, encoder_ffn_embed_dim=768, encoder_attention_heads=48, dropout=0.1, emb_dropout=0.1, attention_dropout=0.1, activation_dropout=0.0, max_seq_len=512, activation_fn='gelu', pooler_activation_fn='tanh', post_ln=False, masked_token_loss=-1.0, masked_coord_loss=-1.0, masked_dist_loss=-1.0, x_norm_loss=-1.0, delta_pair_repr_norm_loss=-1.0, notri=False
Uni-Mol Version
Uni-Mol2
Expected behavior
Train loss converge smoothly in Unimol2 finetuning.
To Reproduce
No response
Environment
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: