We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I try to train, I have a problem. Can you help me? Here is my error code:
(tensorflowgqn) E:\LH\tf-gqn-master-2\tf-gqn-master>python train_gqn.py ^ More? --data_dir data\gqn-dataset ^ More? --dataset shepard_metzler_5_parts ^ More? --model_dir models\shepard_metzler_5_parts\gqn Training a GQN. PARSED ARGV: Namespace(adam_lr_alpha=0.0005, adam_lr_beta=5e-05, anneal_lr_tau=1600000, batch_size=36, chkpt_steps=10000, context_size=5, data_dir='data\\gqn-dataset', dataset='shepard_metzler_5_parts', debug=False, img_size=64, initial_eval=False, log_steps=100, memcap=1.0, model_dir='models\\shepard_metzler_5_parts\\gqn', queue_buffer=4, queue_threads=4, seq_length=8, train_epochs=2) UNPARSED ARGV: [] Saved model config to models\shepard_metzler_5_parts\gqn\gqn_config.json INFO:tensorflow:Using config: {'_model_dir': 'models\\shepard_metzler_5_parts\\gqn', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10000, '_save_checkpoints_secs': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 1.0 allow_growth: true } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001B00D429DA0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2023-08-13 15:47:55.159960: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2023-08-13 15:47:55.256456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.665 pciBusID: 0000:01:00.0 totalMemory: 12.00GiB freeMemory: 10.84GiB 2023-08-13 15:47:55.257468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2023-08-13 15:47:55.704851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-08-13 15:47:55.704983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2023-08-13 15:47:55.705062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2023-08-13 15:47:55.705225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12287 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into models\shepard_metzler_5_parts\gqn\model.ckpt. INFO:tensorflow:l2_reconstruction = [0. 0.03854151] INFO:tensorflow:loss = 598648.6, step = 0 ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "train_gqn.py", line 227, in <module> tf.app.run(argv=[sys.argv[0]] + UNPARSED_ARGS) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "train_gqn.py", line 212, in main hooks=[logging_hook], File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1241, in _train_model_default saving_listeners) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1471, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 671, in run run_metadata=run_metadata) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1156, in run run_metadata=run_metadata) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1255, in run raise six.reraise(*original_exc_info) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\six.py", line 693, in reraise raise value File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1240, in run return self._sess.run(*args, **kwargs) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1320, in run run_metadata=run_metadata)) File "E:\JZL\deeping_learning\Anaconda\envs\tensorflowgqn\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 753, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
When I try to train, I have a problem. Can you help me?
Here is my error code:
The text was updated successfully, but these errors were encountered: