Got loss=nan when training on gpu #730

fkurushin · 2025-01-22T07:07:32Z

Got loss=nan and sometimes fails and cuda error (again loss calculation cased it) when training on GPU. When i set calculate_training_loss=False - model trains absolutely fine. If calculate_training_loss=True than:

Using GPU: 3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.19s/it, loss=nan]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25728193/25728193 [07:42<00:00, 55622.56it/s]
NDCG for qvec 0: 1.26 %
CuPy cache cleared on GPU 3
Using GPU: 4
  7%|█████████████▍                                                                                                                                                                                           | 1/15 [00:12<02:52, 12.33s/it]
Traceback (most recent call last):
  File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 74, in <module>
    main(args.gpus, args.sparse_matrix_paths, args.n_factors)
  File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 53, in main
    model.fit(train)
  File "/home/fkurushin/venv/implicit/lib/python3.11/site-packages/implicit/gpu/als.py", line 166, in fit
    loss = self.solver.calculate_loss(Cui, X, Y, self.regularization)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_cuda.pyx", line 265, in implicit.gpu._cuda.LeastSquaresSolver.calculate_loss
RuntimeError: Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/als.cu:276)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/matrix.cu:246)
Aborted

And i am pretty sure this problem is not with data types and variables overflow

Additional Information:

implicit: 0.7.2 (built from source)
Python: 3.11.2
CUDA: 12.3
OS: Debian GNU/Linux 12
Scipy: 1.14.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got loss=nan when training on gpu #730

Got loss=nan when training on gpu #730

fkurushin commented Jan 22, 2025 •

edited

Loading

Got loss=nan when training on gpu #730

Got loss=nan when training on gpu #730

Comments

fkurushin commented Jan 22, 2025 • edited Loading

fkurushin commented Jan 22, 2025 •

edited

Loading