Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:tensorflow:Model diverged with loss = NaN #80

Open
11lucky111 opened this issue Aug 13, 2023 · 0 comments
Open

ERROR:tensorflow:Model diverged with loss = NaN #80

11lucky111 opened this issue Aug 13, 2023 · 0 comments

Comments

@11lucky111
Copy link

When I try to train, I have a problem. Can you help me?
Here is my error code:

(tensorflowgqn) E:\LH\tf-gqn-master-2\tf-gqn-master>python train_gqn.py ^
More?   --data_dir data\gqn-dataset ^
More?   --dataset shepard_metzler_5_parts ^
More?   --model_dir models\shepard_metzler_5_parts\gqn
Training a GQN.
PARSED ARGV: Namespace(adam_lr_alpha=0.0005, adam_lr_beta=5e-05, anneal_lr_tau=1600000, batch_size=36, chkpt_steps=10000, context_size=5, data_dir='data\\gqn-dataset', dataset='shepard_metzler_5_parts', debug=False, img_size=64, initial_eval=False, log_steps=100, memcap=1.0, model_dir='models\\shepard_metzler_5_parts\\gqn', queue_buffer=4, queue_threads=4, seq_length=8, train_epochs=2)
UNPARSED ARGV: []
Saved model config to models\shepard_metzler_5_parts\gqn\gqn_config.json
INFO:tensorflow:Using config: {'_model_dir': 'models\\shepard_metzler_5_parts\\gqn', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10000, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
  allow_growth: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001B00D429DA0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2023-08-13 15:47:55.159960: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2023-08-13 15:47:55.256456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.665
pciBusID: 0000:01:00.0
totalMemory: 12.00GiB freeMemory: 10.84GiB
2023-08-13 15:47:55.257468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2023-08-13 15:47:55.704851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-08-13 15:47:55.704983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2023-08-13 15:47:55.705062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2023-08-13 15:47:55.705225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12287 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into models\shepard_metzler_5_parts\gqn\model.ckpt.
INFO:tensorflow:l2_reconstruction = [0.         0.03854151]
INFO:tensorflow:loss = 598648.6, step = 0
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "train_gqn.py", line 227, in <module>
    tf.app.run(argv=[sys.argv[0]] + UNPARSED_ARGS)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "train_gqn.py", line 212, in main
    hooks=[logging_hook],
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1471, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\six.py", line 693, in reraise
    raise value
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1320, in run
    run_metadata=run_metadata))
  File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant