Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 昇腾910B3,Paddle UIE 正常推理,但是训练过程loss=NaN #9693

Open
modderBUG opened this issue Dec 25, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@modderBUG
Copy link

请提出你的问题

PaddlePaddle==2.6.1
PaddleCustomDevice==2.6 编译安装

λ 5926b66120ca /app/output/PaddleNLP-2.6.1 python ./applications/information_extraction/text/finetune.py  \
> --device npu \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm False \
    --metric_for_best_model eval_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1>     --logging_steps 10 \
>     --save_steps 100 \
>     --eval_steps 100 \
>     --seed 1000 \
>     --model_name_or_path /app/output/PaddleNLP-2.6.1/uie-base/ \
>     --output_dir ./checkpoint/model_best \
>     --train_path ./applications/information_extraction/text/data/train.txt \
>     --dev_path ./applications/information_extraction/text/data/dev.txt  \
>     --max_seq_len 512  \
>     --per_device_train_batch_size  16 \
>     --per_device_eval_batch_size 16 \
>     --num_train_epochs 20 \
>     --learning_rate 1e-5 \
>     --do_train \
>     --do_eval \
>     --do_export \
>     --export_model_dir ./checkpoint/model_best \
>     --overwrite_output_dir \
>     --disable_tqdm False \
>     --metric_for_best_model eval_f1 \
>     --load_best_model_at_end  True \
>     --save_total_limit 1
I1225 11:43:11.952183 131826 init.cc:233] ENV [CUSTOM_DEVICE_ROOT]=/usr/local/lib/python3.9/dist-packages/paddle_custom_device
I1225 11:43:11.952232 131826 init.cc:142] Try loading custom device libs from: [/usr/local/lib/python3.9/dist-packages/paddle_custom_device]
I1225 11:43:12.406082 131826 custom_device.cc:1108] Successed in loading custom runtime in lib: /usr/local/lib/python3.9/dist-packages/paddle_custom_device/libpaddle-custom-npu.so
I1225 11:43:12.409425 131826 custom_kernel.cc:63] Successed in loading 326 custom kernel(s) from loaded lib(s), will be used like native ones.
I1225 11:43:12.409597 131826 init.cc:154] Finished in LoadCustomDevice with libs_path: [/usr/local/lib/python3.9/dist-packages/paddle_custom_device]
I1225 11:43:12.409634 131826 init.cc:239] CustomDevice: npu, visible devices count: 1
/usr/local/lib/python3.9/dist-packages/_distutils_hack/__init__.py:26: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
[2024-12-25 11:43:15,415] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2024-12-25 11:43:15,415] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2024-12-25 11:43:15,415] [    INFO] - ============================================================
[2024-12-25 11:43:15,415] [    INFO] -      Model Configuration Arguments
[2024-12-25 11:43:15,415] [    INFO] - paddle commit id              :fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:15,416] [    INFO] - export_model_dir              :./checkpoint/model_best
[2024-12-25 11:43:15,416] [    INFO] - model_name_or_path            :/app/output/PaddleNLP-2.6.1/uie-base/
[2024-12-25 11:43:15,416] [    INFO] - multilingual                  :False
[2024-12-25 11:43:15,416] [    INFO] -
[2024-12-25 11:43:15,416] [    INFO] - ============================================================
[2024-12-25 11:43:15,416] [    INFO] -       Data Configuration Arguments
[2024-12-25 11:43:15,417] [    INFO] - paddle commit id              :fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:15,417] [    INFO] - dev_path                      :./applications/information_extraction/text/data/dev.txt
[2024-12-25 11:43:15,417] [    INFO] - dynamic_max_length            :None
[2024-12-25 11:43:15,417] [    INFO] - max_seq_length                :512
[2024-12-25 11:43:15,417] [    INFO] - train_path                    :./applications/information_extraction/text/data/train.txt
[2024-12-25 11:43:15,417] [    INFO] -
[2024-12-25 11:43:15,418] [ WARNING] - Process rank: -1, device: npu, world_size: 1, distributed training: False, 16-bits training: False
[2024-12-25 11:43:15,418] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load '/app/output/PaddleNLP-2.6.1/uie-base/'.
[2024-12-25 11:43:15,441] [    INFO] - Loading configuration file /app/output/PaddleNLP-2.6.1/uie-base/config.json
[2024-12-25 11:43:15,441] [    INFO] - Loading weights file /app/output/PaddleNLP-2.6.1/uie-base/model_state.pdparams
[2024-12-25 11:43:15,708] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-12-25 11:43:32,588] [    INFO] - All model checkpoint weights were used when initializing UIE.

[2024-12-25 11:43:32,589] [    INFO] - All the weights of UIE were initialized from the model checkpoint at /app/output/PaddleNLP-2.6.1/uie-base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2024-12-25 11:43:32,708] [    INFO] - ============================================================
[2024-12-25 11:43:32,709] [    INFO] -     Training Configuration Arguments
[2024-12-25 11:43:32,709] [    INFO] - paddle commit id              : fbf852dd832bc0e63ae31cd4aa37defd829e4c03
[2024-12-25 11:43:32,709] [    INFO] - _no_sync_in_gradient_accumulation: True
[2024-12-25 11:43:32,709] [    INFO] - activation_quantize_type      : None
[2024-12-25 11:43:32,709] [    INFO] - adam_beta1                    : 0.9
[2024-12-25 11:43:32,709] [    INFO] - adam_beta2                    : 0.999
[2024-12-25 11:43:32,709] [    INFO] - adam_epsilon                  : 1e-08
[2024-12-25 11:43:32,709] [    INFO] - algo_list                     : None
[2024-12-25 11:43:32,710] [    INFO] - amp_custom_black_list         : None
[2024-12-25 11:43:32,710] [    INFO] - amp_custom_white_list         : None
[2024-12-25 11:43:32,710] [    INFO] - amp_master_grad               : False
[2024-12-25 11:43:32,710] [    INFO] - batch_num_list                : None
[2024-12-25 11:43:32,710] [    INFO] - batch_size_list               : None
[2024-12-25 11:43:32,710] [    INFO] - bf16                          : False
[2024-12-25 11:43:32,710] [    INFO] - bf16_full_eval                : False
[2024-12-25 11:43:32,710] [    INFO] - bias_correction               : False
[2024-12-25 11:43:32,710] [    INFO] - current_device                : npu:0
[2024-12-25 11:43:32,711] [    INFO] - data_parallel_rank            : 0
[2024-12-25 11:43:32,711] [    INFO] - dataloader_drop_last          : False
[2024-12-25 11:43:32,711] [    INFO] - dataloader_num_workers        : 0
[2024-12-25 11:43:32,711] [    INFO] - dataset_rank                  : 0
[2024-12-25 11:43:32,711] [    INFO] - dataset_world_size            : 1
[2024-12-25 11:43:32,711] [    INFO] - device                        : npu
[2024-12-25 11:43:32,711] [    INFO] - disable_tqdm                  : False
[2024-12-25 11:43:32,711] [    INFO] - distributed_dataloader        : False
[2024-12-25 11:43:32,711] [    INFO] - do_compress                   : False
[2024-12-25 11:43:32,711] [    INFO] - do_eval                       : True
[2024-12-25 11:43:32,712] [    INFO] - do_export                     : True
[2024-12-25 11:43:32,712] [    INFO] - do_predict                    : False
[2024-12-25 11:43:32,712] [    INFO] - do_train                      : True
[2024-12-25 11:43:32,712] [    INFO] - eval_accumulation_steps       : None
[2024-12-25 11:43:32,712] [    INFO] - eval_batch_size               : 16
[2024-12-25 11:43:32,712] [    INFO] - eval_steps                    : 100
[2024-12-25 11:43:32,712] [    INFO] - evaluation_strategy           : IntervalStrategy.STEPS
[2024-12-25 11:43:32,712] [    INFO] - flatten_param_grads           : False
[2024-12-25 11:43:32,712] [    INFO] - fp16                          : False
[2024-12-25 11:43:32,712] [    INFO] - fp16_full_eval                : False
[2024-12-25 11:43:32,713] [    INFO] - fp16_opt_level                : O1
[2024-12-25 11:43:32,713] [    INFO] - gradient_accumulation_steps   : 1
[2024-12-25 11:43:32,713] [    INFO] - greater_is_better             : True
[2024-12-25 11:43:32,713] [    INFO] - hybrid_parallel_topo_order    : None
[2024-12-25 11:43:32,713] [    INFO] - ignore_data_skip              : False
[2024-12-25 11:43:32,713] [    INFO] - input_dtype                   : int64
[2024-12-25 11:43:32,713] [    INFO] - input_infer_model_path        : None
[2024-12-25 11:43:32,713] [    INFO] - label_names                   : ['start_positions', 'end_positions']
[2024-12-25 11:43:32,713] [    INFO] - lazy_data_processing          : True
[2024-12-25 11:43:32,713] [    INFO] - learning_rate                 : 1e-05
[2024-12-25 11:43:32,714] [    INFO] - load_best_model_at_end        : True
[2024-12-25 11:43:32,714] [    INFO] - load_sharded_model            : False
[2024-12-25 11:43:32,714] [    INFO] - local_process_index           : 0
[2024-12-25 11:43:32,714] [    INFO] - local_rank                    : -1
[2024-12-25 11:43:32,714] [    INFO] - log_level                     : -1
[2024-12-25 11:43:32,714] [    INFO] - log_level_replica             : -1
[2024-12-25 11:43:32,714] [    INFO] - log_on_each_node              : True
[2024-12-25 11:43:32,714] [    INFO] - logging_dir                   : ./checkpoint/model_best/runs/Dec25_11-43-15_5926b66120ca
[2024-12-25 11:43:32,714] [    INFO] - logging_first_step            : False
[2024-12-25 11:43:32,714] [    INFO] - logging_steps                 : 10
[2024-12-25 11:43:32,715] [    INFO] - logging_strategy              : IntervalStrategy.STEPS
[2024-12-25 11:43:32,715] [    INFO] - lr_end                        : 1e-07
[2024-12-25 11:43:32,715] [    INFO] - lr_scheduler_type             : SchedulerType.LINEAR
[2024-12-25 11:43:32,715] [    INFO] - max_evaluate_steps            : -1
[2024-12-25 11:43:32,715] [    INFO] - max_grad_norm                 : 1.0
[2024-12-25 11:43:32,715] [    INFO] - max_steps                     : -1
[2024-12-25 11:43:32,715] [    INFO] - metric_for_best_model         : eval_f1
[2024-12-25 11:43:32,715] [    INFO] - minimum_eval_times            : None
[2024-12-25 11:43:32,715] [    INFO] - moving_rate                   : 0.9
[2024-12-25 11:43:32,715] [    INFO] - no_cuda                       : False
[2024-12-25 11:43:32,716] [    INFO] - num_cycles                    : 0.5
[2024-12-25 11:43:32,716] [    INFO] - num_train_epochs              : 20.0
[2024-12-25 11:43:32,716] [    INFO] - onnx_format                   : True
[2024-12-25 11:43:32,716] [    INFO] - optim                         : OptimizerNames.ADAMW
[2024-12-25 11:43:32,716] [    INFO] - optimizer_name_suffix         : None
[2024-12-25 11:43:32,716] [    INFO] - output_dir                    : ./checkpoint/model_best
[2024-12-25 11:43:32,716] [    INFO] - overwrite_output_dir          : True
[2024-12-25 11:43:32,716] [    INFO] - past_index                    : -1
[2024-12-25 11:43:32,716] [    INFO] - per_device_eval_batch_size    : 16
[2024-12-25 11:43:32,716] [    INFO] - per_device_train_batch_size   : 16
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_config      :
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_degree      : -1
[2024-12-25 11:43:32,717] [    INFO] - pipeline_parallel_rank        : 0
[2024-12-25 11:43:32,717] [    INFO] - power                         : 1.0
[2024-12-25 11:43:32,717] [    INFO] - prediction_loss_only          : False
[2024-12-25 11:43:32,717] [    INFO] - process_index                 : 0
[2024-12-25 11:43:32,717] [    INFO] - prune_embeddings              : False
[2024-12-25 11:43:32,717] [    INFO] - recompute                     : False
[2024-12-25 11:43:32,718] [    INFO] - remove_unused_columns         : True
[2024-12-25 11:43:32,718] [    INFO] - report_to                     : ['visualdl']
[2024-12-25 11:43:32,718] [    INFO] - resume_from_checkpoint        : None
[2024-12-25 11:43:32,718] [    INFO] - round_type                    : round
[2024-12-25 11:43:32,718] [    INFO] - run_name                      : ./checkpoint/model_best
[2024-12-25 11:43:32,718] [    INFO] - save_on_each_node             : False
[2024-12-25 11:43:32,718] [    INFO] - save_sharded_model            : False
[2024-12-25 11:43:32,718] [    INFO] - save_steps                    : 100
[2024-12-25 11:43:32,718] [    INFO] - save_strategy                 : IntervalStrategy.STEPS
[2024-12-25 11:43:32,718] [    INFO] - save_total_limit              : 1
[2024-12-25 11:43:32,719] [    INFO] - scale_loss                    : 32768
[2024-12-25 11:43:32,719] [    INFO] - seed                          : 1000
[2024-12-25 11:43:32,719] [    INFO] - sharding                      : []
[2024-12-25 11:43:32,719] [    INFO] - sharding_degree               : -1
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_config      :
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_degree      : -1
[2024-12-25 11:43:32,719] [    INFO] - sharding_parallel_rank        : 0
[2024-12-25 11:43:32,719] [    INFO] - should_load_dataset           : True
[2024-12-25 11:43:32,719] [    INFO] - should_load_sharding_stage1_model: False
[2024-12-25 11:43:32,719] [    INFO] - should_log                    : True
[2024-12-25 11:43:32,720] [    INFO] - should_save                   : True
[2024-12-25 11:43:32,720] [    INFO] - should_save_model_state       : True
[2024-12-25 11:43:32,720] [    INFO] - should_save_sharding_stage1_model: False
[2024-12-25 11:43:32,720] [    INFO] - skip_memory_metrics           : True
[2024-12-25 11:43:32,720] [    INFO] - skip_profile_timer            : True
[2024-12-25 11:43:32,720] [    INFO] - strategy                      : dynabert+ptq
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_config        :
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_degree        : -1
[2024-12-25 11:43:32,720] [    INFO] - tensor_parallel_rank          : 0
[2024-12-25 11:43:32,720] [    INFO] - train_batch_size              : 16
[2024-12-25 11:43:32,721] [    INFO] - use_hybrid_parallel           : False
[2024-12-25 11:43:32,721] [    INFO] - use_pact                      : True
[2024-12-25 11:43:32,721] [    INFO] - warmup_ratio                  : 0.1
[2024-12-25 11:43:32,721] [    INFO] - warmup_steps                  : 0
[2024-12-25 11:43:32,721] [    INFO] - weight_decay                  : 0.0
[2024-12-25 11:43:32,721] [    INFO] - weight_name_suffix            : None
[2024-12-25 11:43:32,721] [    INFO] - weight_quantize_type          : channel_wise_abs_max
[2024-12-25 11:43:32,721] [    INFO] - width_mult_list               : None
[2024-12-25 11:43:32,721] [    INFO] - world_size                    : 1
[2024-12-25 11:43:32,721] [    INFO] -
[2024-12-25 11:43:32,722] [    INFO] - ***** Running training *****
[2024-12-25 11:43:32,723] [    INFO] -   Num examples = 1,167
[2024-12-25 11:43:32,723] [    INFO] -   Num Epochs = 20
[2024-12-25 11:43:32,723] [    INFO] -   Instantaneous batch size per device = 16
[2024-12-25 11:43:32,723] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 16
[2024-12-25 11:43:32,723] [    INFO] -   Gradient Accumulation steps = 1
[2024-12-25 11:43:32,723] [    INFO] -   Total optimization steps = 1,460
[2024-12-25 11:43:32,723] [    INFO] -   Total num train samples = 23,340
[2024-12-25 11:43:32,725] [    INFO] -   Number of trainable parameters = 117,946,370 (per device)
  0%|                                                                                                          | 0/1460 [00:00<?, ?it/s]/app/output/PaddleNLP-2.6.1/paddlenlp/transformers/tokenizer_utils_base.py:2478: FutureWarning: The `max_seq_len` argument is deprecated and will be removed in a future version, please use `max_length` instead.
  warnings.warn(
/app/output/PaddleNLP-2.6.1/paddlenlp/transformers/tokenizer_utils_base.py:1878: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  warnings.warn(
  1%|▋                                                                                                | 10/1460 [01:53<41:33,  1.72s/it]loss: nan, learning_rate: 1e-05, global_step: 10, interval_runtime: 113.6572, interval_samples_per_second: 1.4077422000673658, interval_steps_per_second: 0.08798388750421036, epoch: 0.137
loss: nan, learning_rate: 1e-05, global_step: 20, interval_runtime: 3.6064, interval_samples_per_second: 44.36522606484956, interval_steps_per_second: 2.7728266290530974, epoch: 0.274
loss: nan, learning_rate: 1e-05, global_step: 30, interval_runtime: 3.6258, interval_samples_per_second: 44.127987665259184, interval_steps_per_second: 2.757999229078699, epoch: 0.411
loss: nan, learning_rate: 1e-05, global_step: 40, interval_runtime: 3.5736, interval_samples_per_second: 44.77264604545965, interval_steps_per_second: 2.7982903778412282, epoch: 0.5479
loss: nan, learning_rate: 1e-05, global_step: 50, interval_runtime: 3.6283, interval_samples_per_second: 44.09766862567628, interval_steps_per_second: 2.7561042891047673, epoch: 0.6849
loss: 0.0, learning_rate: 1e-05, global_step: 60, interval_runtime: 3.6448, interval_samples_per_second: 43.8985043642171, interval_steps_per_second: 2.7436565227635685, epoch: 0.8219
loss: nan, learning_rate: 1e-05, global_step: 70, interval_runtime: 3.7433, interval_samples_per_second: 42.74318439757554, interval_steps_per_second: 2.6714490248484712, epoch: 0.9589
  5%|████▊                                                                                            | 72/1460 [02:16<07:23,  3.13it/s]loss: nan, learning_rate: 1e-05, global_step: 80, interval_runtime: 70.2024, interval_samples_per_second: 2.2791238350855676, interval_steps_per_second: 0.14244523969284797, epoch: 1.0959
loss: nan, learning_rate: 1e-05, global_step: 90, interval_runtime: 3.7313, interval_samples_per_second: 42.88079748004118, interval_steps_per_second: 2.680049842502574, epoch: 1.2329
loss: nan, learning_rate: 1e-05, global_step: 100, interval_runtime: 3.8448, interval_samples_per_second: 41.61510258783223, interval_steps_per_second: 2.6009439117395146, epoch: 1.3699
  7%|██████▌                                                                                         | 100/1460 [03:33<09:22,  2.42it/s][2024-12-25 11:47:05,984] [    INFO] - ***** Running Evaluation *****
[2024-12-25 11:47:05,985] [    INFO] -   Num examples = 120
[2024-12-25 11:47:05,985] [    INFO] -   Total prediction steps = 8
[2024-12-25 11:47:05,985] [    INFO] -   Pre device batch size = 16
[2024-12-25 11:47:05,985] [    INFO] -   
@modderBUG modderBUG added the question Further information is requested label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants