Train loss cannot converge correctly in Unimol2 finetune scence. #312

wangyifei1992 · 2025-01-20T02:34:01Z

Describe the bug

I create a finetune dataset with 100M samples from Molecule3D dataset with HOMO label which is used to finetune 84M unimol2 model with no checkpoint. Howerver the train loss cannot converge correctly representing in gradually decreasing to 0.1 and suddenly increasing to 0.55 and not decreasing any more. I've tried a variety of training parametes, but got similar loss curves. Here is one examples.

V100+python 3.9+pytorch 2.0.0

seed=0, cpu=False, fp16=False, bf16=False, bf16_sr=False, allreduce_fp32_grad=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir='./unimol2', empty_cache_freq=0, all_gather_list_size=16384, suppress_crashes=False, profile=False, ema_decay=-1.0, validate_with_ema=False, loss='finetune_smooth_mae', optimizer='adam', lr_scheduler='polynomial_decay', task='mol_finetune', num_workers=8, skip_invalid_size_inputs_valid_test=False, batch_size=32, required_batch_size_multiple=1, data_buffer_size=10, train_subset='train', valid_subset='valid,test', validate_interval=1, validate_interval_updates=0, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, batch_size_valid=32, max_valid_steps=None, curriculum=0, distributed_world_size=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method='env://', distributed_port=-1, device_id=0, distributed_no_spawn=True, ddp_backend='c10d', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=True, fast_stat_sync=False, broadcast_buffers=False, nprocs_per_node=1, arch='unimol2_84M', max_epoch=40, max_update=0, stop_time_hours=0, clip_norm=1.0, per_sample_clip_norm=0, update_freq=[1], lr=[0.0001], stop_min_lr=-1, best_checkpoint_metric='valid_agg_mae', maximize_best_checkpoint_metric=False, patience=10, checkpoint_suffix='', droppath_prob=0.0, gaussian_std_width=1.0, gaussian_mean_start=0.0, gaussian_mean_stop=9.0, mode='train', data='unimol2/example_data/molecule3d', task_name='molecule3d_homo', classification_head_name='molecule3d_homo', num_classes=1, reg=True, no_shuffle=False, conf_size=1, remove_hydrogen=False, drop_feat_prob=1.0, use_2d_pos_prob=0.0, max_atoms=256, adam_betas='(0.9, 0.99)', adam_eps=1e-06, weight_decay=0.0, force_anneal=None, warmup_updates=0, warmup_ratio=0.03, end_learning_rate=0.0, power=1.0, total_num_update=1000000, pooler_dropout=0.0, no_seed_provided=False, encoder_layers=12, encoder_embed_dim=768, pair_embed_dim=512, pair_hidden_dim=64, encoder_ffn_embed_dim=768, encoder_attention_heads=48, dropout=0.1, emb_dropout=0.1, attention_dropout=0.1, activation_dropout=0.0, max_seq_len=512, activation_fn='gelu', pooler_activation_fn='tanh', post_ln=False, masked_token_loss=-1.0, masked_coord_loss=-1.0, masked_dist_loss=-1.0, x_norm_loss=-1.0, delta_pair_repr_norm_loss=-1.0, notri=False

Uni-Mol Version

Uni-Mol2

Expected behavior

Train loss converge smoothly in Unimol2 finetuning.

To Reproduce

No response

Environment

No response

Additional Context

No response

wangyifei1992 added the bug Something isn't working label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train loss cannot converge correctly in Unimol2 finetune scence. #312

Train loss cannot converge correctly in Unimol2 finetune scence. #312

wangyifei1992 commented Jan 20, 2025

Train loss cannot converge correctly in Unimol2 finetune scence. #312

Train loss cannot converge correctly in Unimol2 finetune scence. #312

Comments

wangyifei1992 commented Jan 20, 2025

Describe the bug

Uni-Mol Version

Expected behavior

To Reproduce

Environment

Additional Context