[Question]: 昇腾910B3，Paddle UIE 正常推理，但是训练过程loss=NaN #9693

modderBUG · 2024-12-25T07:35:38Z

请提出你的问题

PaddlePaddle==2.6.1
PaddleCustomDevice==2.6 编译安装

λ 5926b66120ca /app/output/PaddleNLP-2.6.1 python ./applications/information_extraction/text/finetune.py  \
> --device npu \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm False \
    --metric_for_best_model eval_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1>     --logging_steps 10 \
>     --save_steps 100 \
>     --eval_steps 100 \
>     --seed 1000 \
>     --model_name_or_path /app/output/PaddleNLP-2.6.1/uie-base/ \
>     --output_dir ./checkpoint/model_best \
>     --train_path ./applications/information_extraction/text/data/train.txt \
>     --dev_path ./applications/information_extraction/text/data/dev.txt  \
>     --max_seq_len 512  \
>     --per_device_train_batch_size  16 \
>     --per_device_eval_batch_size 16 \
>     --num_train_epochs 20 \
>     --learning_rate 1e-5 \
>     --do_train \
>     --do_eval \
>     --do_export \
>     --export_model_dir ./checkpoint/model_best \
>     --overwrite_output_dir \
>     --disable_tqdm False \
>     --metric_for_best_model eval_f1 \
>     --load_best_model_at_end  True \
>     --save_total_limit 1
I1225 11:43:11.952183 131826 init.cc:233] ENV [CUSTOM_DEVICE_ROOT]=/usr/local/lib/python3.9/dist-packages/paddle_custom_device
I1225 11:43:11.952232 131826 init.cc:142] Try loading custom device libs from: [/usr/local/lib/python3.9/dist-packages/paddle_custom_device]
I1225 11:43:12.406082 131826 custom_device.cc:1108] Successed in loading custom runtime in lib: /usr/local/lib/python3.9/dist-packages/paddle_custom_device/libpaddle-custom-npu.so
I1225 11:43:12.409425 131826 custom_kernel.cc:63] Successed in loading 326 custom kernel(s) from loaded lib(s), will be used like native ones.
I1225 11:43:12.409597 131826 init.cc:154] Finished in LoadCustomDevice with libs_path: [/usr/local/lib/python3.9/dist-packages/paddle_custom_device]
I1225 11:43:12.409634 131826 init.cc:239] CustomDevice: npu, visible devices count: 1
/usr/local/lib/python3.9/dist-packages/_distutils_hack/__init__.py:26: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2024-12-25 11:43:15,415] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2024-12-25 11:43:15,415] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2024-12-25 11:43:15,415] [    INFO] - ============================================================
[2024-12-25 11:43:15,415] [    INFO] -      Model Configuration Arguments
[2024-12-25 11:43:15,415] [    INFO] - paddle commit id              :fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:15,416] [    INFO] - export_model_dir              :./checkpoint/model_best
[2024-12-25 11:43:15,416] [    INFO] - model_name_or_path            :/app/output/PaddleNLP-2.6.1/uie-base/
[2024-12-25 11:43:15,416] [    INFO] - multilingual                  :False
[2024-12-25 11:43:15,416] [    INFO] -
[2024-12-25 11:43:15,416] [    INFO] - ============================================================
[2024-12-25 11:43:15,416] [    INFO] -       Data Configuration Arguments
[2024-12-25 11:43:15,417] [    INFO] - paddle commit id              :fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:15,417] [    INFO] - dev_path                      :./applications/information_extraction/text/data/dev.txt
[2024-12-25 11:43:15,417] [    INFO] - dynamic_max_length            :None
[2024-12-25 11:43:15,417] [    INFO] - max_seq_length                :512
[2024-12-25 11:43:15,417] [    INFO] - train_path                    :./applications/information_extraction/text/data/train.txt
[2024-12-25 11:43:15,417] [    INFO] -
[2024-12-25 11:43:15,418] [ WARNING] - Process rank: -1, device: npu, world_size: 1, distributed training: False, 16-bits training: False
[2024-12-25 11:43:15,418] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load '/app/output/PaddleNLP-2.6.1/uie-base/'.
[2024-12-25 11:43:15,441] [    INFO] - Loading configuration file /app/output/PaddleNLP-2.6.1/uie-base/config.json
[2024-12-25 11:43:15,441] [    INFO] - Loading weights file /app/output/PaddleNLP-2.6.1/uie-base/model_state.pdparams
[2024-12-25 11:43:15,708] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-12-25 11:43:32,588] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2024-12-25 11:43:32,589] [    INFO] - All the weights of UIE were initialized from the model checkpoint at /app/output/PaddleNLP-2.6.1/uie-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2024-12-25 11:43:32,708] [    INFO] - ============================================================
[2024-12-25 11:43:32,709] [    INFO] -     Training Configuration Arguments
[2024-12-25 11:43:32,709] [    INFO] - paddle commit id              : fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:32,709] [    INFO] - _no_sync_in_gradient_accumulation: True
[2024-12-25 11:43:32,709] [    INFO] - activation_quantize_type      : None
[2024-12-25 11:43:32,709] [    INFO] - adam_beta1                    : 0.9
[2024-12-25 11:43:32,709] [    INFO] - adam_beta2                    : 0.999
[2024-12-25 11:43:32,709] [    INFO] - adam_epsilon                  : 1e-08
[2024-12-25 11:43:32,709] [    INFO] - algo_list                     : None
[2024-12-25 11:43:32,710] [    INFO] - amp_custom_black_list         : None
[2024-12-25 11:43:32,710] [    INFO] - amp_custom_white_list         : None
[2024-12-25 11:43:32,710] [    INFO] - amp_master_grad               : False
[2024-12-25 11:43:32,710] [    INFO] - batch_num_list                : None
[2024-12-25 11:43:32,710] [    INFO] - batch_size_list               : None
[2024-12-25 11:43:32,710] [    INFO] - bf16                          : False
[2024-12-25 11:43:32,710] [    INFO] - bf16_full_eval                : False
[2024-12-25 11:43:32,710] [    INFO] - bias_correction               : False
[2024-12-25 11:43:32,710] [    INFO] - current_device                : npu:0
[2024-12-25 11:43:32,711] [    INFO] - data_parallel_rank            : 0
[2024-12-25 11:43:32,711] [    INFO] - dataloader_drop_last          : False
[2024-12-25 11:43:32,711] [    INFO] - dataloader_num_workers        : 0
[2024-12-25 11:43:32,711] [    INFO] - dataset_rank                  : 0
[2024-12-25 11:43:32,711] [    INFO] - dataset_world_size            : 1
[2024-12-25 11:43:32,711] [    INFO] - device                        : npu
[2024-12-25 11:43:32,711] [    INFO] - disable_tqdm                  : False
[2024-12-25 11:43:32,711] [    INFO] - distributed_dataloader        : False
[2024-12-25 11:43:32,711] [    INFO] - do_compress                   : False
[2024-12-25 11:43:32,711] [    INFO] - do_eval                       : True
[2024-12-25 11:43:32,712] [    INFO] - do_export                     : True
[2024-12-25 11:43:32,712] [    INFO] - do_predict                    : False
[2024-12-25 11:43:32,712] [    INFO] - do_train                      : True
[2024-12-25 11:43:32,712] [    INFO] - eval_accumulation_steps       : None
[2024-12-25 11:43:32,712] [    INFO] - eval_batch_size               : 16
[2024-12-25 11:43:32,712] [    INFO] - eval_steps                    : 100
[2024-12-25 11:43:32,712] [    INFO] - evaluation_strategy           : IntervalStrategy.STEPS
[2024-12-25 11:43:32,712] [    INFO] - flatten_param_grads           : False
[2024-12-25 11:43:32,712] [    INFO] - fp16                          : False
[2024-12-25 11:43:32,712] [    INFO] - fp16_full_eval                : False
[2024-12-25 11:43:32,713] [    INFO] - fp16_opt_level                : O1
[2024-12-25 11:43:32,713] [    INFO] - gradient_accumulation_steps   : 1
[2024-12-25 11:43:32,713] [    INFO] - greater_is_better             : True
[2024-12-25 11:43:32,713] [    INFO] - hybrid_parallel_topo_order    : None
[2024-12-25 11:43:32,713] [    INFO] - ignore_data_skip              : False
[2024-12-25 11:43:32,713] [    INFO] - input_dtype                   : int64
[2024-12-25 11:43:32,713] [    INFO] - input_infer_model_path        : None
[2024-12-25 11:43:32,713] [    INFO] - label_names                   : ['start_positions', 'end_positions']
[2024-12-25 11:43:32,713] [    INFO] - lazy_data_processing          : True
[2024-12-25 11:43:32,713] [    INFO] - learning_rate                 : 1e-05
[2024-12-25 11:43:32,714] [    INFO] - load_best_model_at_end        : True
[2024-12-25 11:43:32,714] [    INFO] - load_sharded_model            : False
[2024-12-25 11:43:32,714] [    INFO] - local_process_index           : 0
[2024-12-25 11:43:32,714] [    INFO] - local_rank                    : -1
[2024-12-25 11:43:32,714] [    INFO] - log_level                     : -1
[2024-12-25 11:43:32,714] [    INFO] - log_level_replica             : -1
[2024-12-25 11:43:32,714] [    INFO] - log_on_each_node              : True
[2024-12-25 11:43:32,714] [    INFO] - logging_dir                   : ./checkpoint/model_best/runs/Dec25_11-43-15_5926b66120ca
[2024-12-25 11:43:32,714] [    INFO] - logging_first_step            : False
[2024-12-25 11:43:32,714] [    INFO] - logging_steps                 : 10
[2024-12-25 11:43:32,715] [    INFO] - logging_strategy              : IntervalStrategy.STEPS
[2024-12-25 11:43:32,715] [    INFO] - lr_end                        : 1e-07
[2024-12-25 11:43:32,715] [    INFO] - lr_scheduler_type             : SchedulerType.LINEAR
[2024-12-25 11:43:32,715] [    INFO] - max_evaluate_steps            : -1
[2024-12-25 11:43:32,715] [    INFO] - max_grad_norm                 : 1.0
[2024-12-25 11:43:32,715] [    INFO] - max_steps                     : -1
[2024-12-25 11:43:32,715] [    INFO] - metric_for_best_model         : eval_f1
[2024-12-25 11:43:32,715] [    INFO] - minimum_eval_times            : None
[2024-12-25 11:43:32,715] [    INFO] - moving_rate                   : 0.9
[2024-12-25 11:43:32,715] [    INFO] - no_cuda                       : False
[2024-12-25 11:43:32,716] [    INFO] - num_cycles                    : 0.5
[2024-12-25 11:43:32,716] [    INFO] - num_train_epochs              : 20.0
[2024-12-25 11:43:32,716] [    INFO] - onnx_format                   : True
[2024-12-25 11:43:32,716] [    INFO] - optim                         : OptimizerNames.ADAMW
[2024-12-25 11:43:32,716] [    INFO] - optimizer_name_suffix         : None
[2024-12-25 11:43:32,716] [    INFO] - output_dir                    : ./checkpoint/model_best
[2024-12-25 11:43:32,716] [    INFO] - overwrite_output_dir          : True
[2024-12-25 11:43:32,716] [    INFO] - past_index                    : -1
[2024-12-25 11:43:32,716] [    INFO] - per_device_eval_batch_size    : 16
[2024-12-25 11:43:32,716] [    INFO] - per_device_train_batch_size   : 16
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_config      :
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_degree      : -1
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_rank        : 0
[2024-12-25 11:43:32,717] [    INFO] - power                         : 1.0
[2024-12-25 11:43:32,717] [    INFO] - prediction_loss_only          : False
[2024-12-25 11:43:32,717] [    INFO] - process_index                 : 0
[2024-12-25 11:43:32,717] [    INFO] - prune_embeddings              : False
[2024-12-25 11:43:32,717] [    INFO] - recompute                     : False
[2024-12-25 11:43:32,718] [    INFO] - remove_unused_columns         : True
[2024-12-25 11:43:32,718] [    INFO] - report_to                     : ['visualdl']
[2024-12-25 11:43:32,718] [    INFO] - resume_from_checkpoint        : None
[2024-12-25 11:43:32,718] [    INFO] - round_type                    : round
[2024-12-25 11:43:32,718] [    INFO] - run_name                      : ./checkpoint/model_best
[2024-12-25 11:43:32,718] [    INFO] - save_on_each_node             : False
[2024-12-25 11:43:32,718] [    INFO] - save_sharded_model            : False
[2024-12-25 11:43:32,718] [    INFO] - save_steps                    : 100
[2024-12-25 11:43:32,718] [    INFO] - save_strategy                 : IntervalStrategy.STEPS
[2024-12-25 11:43:32,718] [    INFO] - save_total_limit              : 1
[2024-12-25 11:43:32,719] [    INFO] - scale_loss                    : 32768
[2024-12-25 11:43:32,719] [    INFO] - seed                          : 1000
[2024-12-25 11:43:32,719] [    INFO] - sharding                      : []
[2024-12-25 11:43:32,719] [    INFO] - sharding_degree               : -1
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_config      :
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_degree      : -1
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_rank        : 0
[2024-12-25 11:43:32,719] [    INFO] - should_load_dataset           : True
[2024-12-25 11:43:32,719] [    INFO] - should_load_sharding_stage1_model: False
[2024-12-25 11:43:32,719] [    INFO] - should_log                    : True
[2024-12-25 11:43:32,720] [    INFO] - should_save                   : True
[2024-12-25 11:43:32,720] [    INFO] - should_save_model_state       : True
[2024-12-25 11:43:32,720] [    INFO] - should_save_sharding_stage1_model: False
[2024-12-25 11:43:32,720] [    INFO] - skip_memory_metrics           : True
[2024-12-25 11:43:32,720] [    INFO] - skip_profile_timer            : True
[2024-12-25 11:43:32,720] [    INFO] - strategy                      : dynabert+ptq
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_config        :
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_degree        : -1
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_rank          : 0
[2024-12-25 11:43:32,720] [    INFO] - train_batch_size              : 16
[2024-12-25 11:43:32,721] [    INFO] - use_hybrid_parallel           : False
[2024-12-25 11:43:32,721] [    INFO] - use_pact                      : True
[2024-12-25 11:43:32,721] [    INFO] - warmup_ratio                  : 0.1
[2024-12-25 11:43:32,721] [    INFO] - warmup_steps                  : 0
[2024-12-25 11:43:32,721] [    INFO] - weight_decay                  : 0.0
[2024-12-25 11:43:32,721] [    INFO] - weight_name_suffix            : None
[2024-12-25 11:43:32,721] [    INFO] - weight_quantize_type          : channel_wise_abs_max
[2024-12-25 11:43:32,721] [    INFO] - width_mult_list               : None
[2024-12-25 11:43:32,721] [    INFO] - world_size                    : 1
[2024-12-25 11:43:32,721] [    INFO] -
[2024-12-25 11:43:32,722] [    INFO] - ***** Running training *****
[2024-12-25 11:43:32,723] [    INFO] -   Num examples = 1,167
[2024-12-25 11:43:32,723] [    INFO] -   Num Epochs = 20
[2024-12-25 11:43:32,723] [    INFO] -   Instantaneous batch size per device = 16
[2024-12-25 11:43:32,723] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 16
[2024-12-25 11:43:32,723] [    INFO] -   Gradient Accumulation steps = 1
[2024-12-25 11:43:32,723] [    INFO] -   Total optimization steps = 1,460
[2024-12-25 11:43:32,723] [    INFO] -   Total num train samples = 23,340
[2024-12-25 11:43:32,725] [    INFO] -   Number of trainable parameters = 117,946,370 (per device)
  0%|                                                                                                          | 0/1460 [00:00<?, ?it/s]/app/output/PaddleNLP-2.6.1/paddlenlp/transformers/tokenizer_utils_base.py:2478: FutureWarning: The `max_seq_len` argument is deprecated and will be removed in a future version, please use `max_length` instead.
  warnings.warn(
/app/output/PaddleNLP-2.6.1/paddlenlp/transformers/tokenizer_utils_base.py:1878: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  warnings.warn(
  1%|▋                                                                                                | 10/1460 [01:53<41:33,  1.72s/it]loss: nan, learning_rate: 1e-05, global_step: 10, interval_runtime: 113.6572, interval_samples_per_second: 1.4077422000673658, interval_steps_per_second: 0.08798388750421036, epoch: 0.137
loss: nan, learning_rate: 1e-05, global_step: 20, interval_runtime: 3.6064, interval_samples_per_second: 44.36522606484956, interval_steps_per_second: 2.7728266290530974, epoch: 0.274
loss: nan, learning_rate: 1e-05, global_step: 30, interval_runtime: 3.6258, interval_samples_per_second: 44.127987665259184, interval_steps_per_second: 2.757999229078699, epoch: 0.411
loss: nan, learning_rate: 1e-05, global_step: 40, interval_runtime: 3.5736, interval_samples_per_second: 44.77264604545965, interval_steps_per_second: 2.7982903778412282, epoch: 0.5479
loss: nan, learning_rate: 1e-05, global_step: 50, interval_runtime: 3.6283, interval_samples_per_second: 44.09766862567628, interval_steps_per_second: 2.7561042891047673, epoch: 0.6849
loss: 0.0, learning_rate: 1e-05, global_step: 60, interval_runtime: 3.6448, interval_samples_per_second: 43.8985043642171, interval_steps_per_second: 2.7436565227635685, epoch: 0.8219
loss: nan, learning_rate: 1e-05, global_step: 70, interval_runtime: 3.7433, interval_samples_per_second: 42.74318439757554, interval_steps_per_second: 2.6714490248484712, epoch: 0.9589
  5%|████▊                                                                                            | 72/1460 [02:16<07:23,  3.13it/s]loss: nan, learning_rate: 1e-05, global_step: 80, interval_runtime: 70.2024, interval_samples_per_second: 2.2791238350855676, interval_steps_per_second: 0.14244523969284797, epoch: 1.0959
loss: nan, learning_rate: 1e-05, global_step: 90, interval_runtime: 3.7313, interval_samples_per_second: 42.88079748004118, interval_steps_per_second: 2.680049842502574, epoch: 1.2329
loss: nan, learning_rate: 1e-05, global_step: 100, interval_runtime: 3.8448, interval_samples_per_second: 41.61510258783223, interval_steps_per_second: 2.6009439117395146, epoch: 1.3699
  7%|██████▌                                                                                         | 100/1460 [03:33<09:22,  2.42it/s][2024-12-25 11:47:05,984] [    INFO] - ***** Running Evaluation *****
[2024-12-25 11:47:05,985] [    INFO] -   Num examples = 120
[2024-12-25 11:47:05,985] [    INFO] -   Total prediction steps = 8
[2024-12-25 11:47:05,985] [    INFO] -   Pre device batch size = 16
[2024-12-25 11:47:05,985] [    INFO] -

The text was updated successfully, but these errors were encountered:

modderBUG added the question Further information is requested label Dec 25, 2024

paddle-bot bot assigned ZHUI Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: 昇腾910B3，Paddle UIE 正常推理，但是训练过程loss=NaN #9693

[Question]: 昇腾910B3，Paddle UIE 正常推理，但是训练过程loss=NaN #9693

modderBUG commented Dec 25, 2024

[Question]: 昇腾910B3，Paddle UIE 正常推理，但是训练过程loss=NaN #9693

[Question]: 昇腾910B3，Paddle UIE 正常推理，但是训练过程loss=NaN #9693

Comments

modderBUG commented Dec 25, 2024

请提出你的问题