Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce model results #20

Open
arvoelke opened this issue Apr 10, 2017 · 14 comments
Open

Cannot reproduce model results #20

arvoelke opened this issue Apr 10, 2017 · 14 comments

Comments

@arvoelke
Copy link

arvoelke commented Apr 10, 2017

After running all of the steps in the README.md verbatim, the prediction_scores.txt contains:

Model MSE: 0.201040
Previous Frame MSE: 0.021246

and all of the generated plots contain only two colours (pink and blue), for example, here are the first five:
plot_0
plot_3
plot_4
plot_5
plot_7

I am using Python 2.7 and Theano==0.9.0 (GPU with pygpu==0.6.2) and Keras==1.0.8 (it seems higher versions are incompatible [see #18]). I am not using cuDNN.

@bill-lotter
Copy link
Contributor

I'm not sure why this happened, it's never happened for me. This was using weights that you trained, correct?, and not the downloaded weights? Maybe it was just a particularly bad initialization and the training got stuck. Have you tried again and run into the same issue? Have you tried with cuDNN?

@arvoelke
Copy link
Author

I wasn't able to get cuDNN working. I get the message:

Can not use cuDNN on context None: Device not supported
Mapped name None to device cuda: Tesla C2075 (0000:02:00.0)

I used the weights that I trained. If I use the downloaded weights, then it works! :) But really I would like to train it myself (I intend on experimenting with some minor variants).

I will try retraining, but it will take my GPU ~5 days to reply with the results.

@bill-lotter
Copy link
Contributor

Hmmm yeah not sure about the cuDNN issue, but if you can get that working on your machine, it should speed up training a lot. If you end up trying to train again, I would run model.fit_generator with verbose=1 and then you should at least be able to see if it's getting anywhere as it starts training.

@arvoelke
Copy link
Author

arvoelke commented Apr 14, 2017

I tried training again. I think it was almost a day or two faster this time, for some reason. Adding verbose=1 to the model.fit_generator call only changed one line of output as far as I could tell (or changed nothing, and this was always there):

/keras/engine/training.py:1460: UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results. Set `samples_per_epoch` correctly to avoid this warning.

The Model MSE was slightly better:

Model MSE: 0.144585
Previous Frame MSE: 0.021246

But something is still clearly wrong:

plot_0
plot_2
plot_3
plot_4
plot_7

@arvoelke
Copy link
Author

Here is the full output:

Using Theano backend.
Can not use cuDNN on context None: Device not supported
Mapped name None to device cuda: Tesla C2075 (0000:02:00.0)
/home/arvoelke/.virtualenvs/CTN/local/lib/python2.7/site-packages/keras/backend/theano_backend.py:1237: UserWarning: DEPRECATION: the 'ds' parameter is not going to exist anymore as it is going to be replaced by the parameter 'ws'.
  mode='max')
/home/arvoelke/.virtualenvs/CTN/local/lib/python2.7/site-packages/keras/backend/theano_backend.py:1237: UserWarning: DEPRECATION: the 'st' parameter is not going to exist anymore as it is going to be replaced by the parameter 'stride'.
  mode='max')
/home/arvoelke/.virtualenvs/CTN/local/lib/python2.7/site-packages/keras/backend/theano_backend.py:1237: UserWarning: DEPRECATION: the 'padding' parameter is not going to exist anymore as it is going to be replaced by the parameter 'pad'.
  mode='max')
Epoch 1/150
500/500 [==============================] - 1488s - loss: 0.1931 - val_loss: 0.2059
Epoch 2/150
500/500 [==============================] - 1486s - loss: 0.1590 - val_loss: 0.2103
Epoch 3/150
500/500 [==============================] - 1487s - loss: 0.1446 - val_loss: 0.1801
Epoch 4/150
500/500 [==============================] - 1487s - loss: 0.1404 - val_loss: 0.1577
Epoch 5/150
500/500 [==============================] - 1487s - loss: 0.1393 - val_loss: 0.1565
Epoch 6/150
500/500 [==============================] - 1487s - loss: 0.1390 - val_loss: 0.1548
Epoch 7/150
500/500 [==============================] - 1487s - loss: 0.1381 - val_loss: 0.1548
Epoch 8/150
500/500 [==============================] - 1486s - loss: 0.1385 - val_loss: 0.1556
Epoch 9/150
500/500 [==============================] - 1487s - loss: 0.1392 - val_loss: 0.1523
Epoch 10/150
500/500 [==============================] - 1491s - loss: 0.1391 - val_loss: 0.1516
Epoch 11/150
500/500 [==============================] - 1486s - loss: 0.1406 - val_loss: 0.1529
Epoch 12/150
500/500 [==============================] - 1486s - loss: 0.1383 - val_loss: 0.1529
Epoch 13/150
500/500 [==============================] - 1487s - loss: 0.1389 - val_loss: 0.1514
Epoch 14/150
500/500 [==============================] - 1486s - loss: 0.1383 - val_loss: 0.1522
Epoch 15/150
500/500 [==============================] - 1486s - loss: 0.1377 - val_loss: 0.1517
Epoch 16/150
500/500 [==============================] - 1487s - loss: 0.1389 - val_loss: 0.1511
Epoch 17/150
500/500 [==============================] - 1486s - loss: 0.1374 - val_loss: 0.1513
Epoch 18/150
500/500 [==============================] - 1487s - loss: 0.1401 - val_loss: 0.1508
Epoch 19/150
500/500 [==============================] - 1487s - loss: 0.1398 - val_loss: 0.1507
Epoch 20/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.1452
Epoch 21/150
500/500 [==============================] - 1487s - loss: 0.1394 - val_loss: 0.1443
Epoch 22/150
500/500 [==============================] - 1487s - loss: 0.1394 - val_loss: 0.1435
Epoch 23/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.1433
Epoch 24/150
500/500 [==============================] - 1487s - loss: 0.1383 - val_loss: 0.1432
Epoch 25/150
500/500 [==============================] - 1487s - loss: 0.1388 - val_loss: 0.1432
Epoch 26/150
500/500 [==============================] - 1486s - loss: 0.1387 - val_loss: 0.1432
Epoch 27/150
500/500 [==============================] - 1486s - loss: 0.1403 - val_loss: 0.1434
Epoch 28/150
500/500 [==============================] - 1486s - loss: 0.1397 - val_loss: 0.1433
Epoch 29/150
500/500 [==============================] - 1486s - loss: 0.1408 - val_loss: 0.1433
Epoch 30/150
500/500 [==============================] - 1486s - loss: 0.1393 - val_loss: 0.1434
Epoch 31/150
500/500 [==============================] - 1486s - loss: 0.1390 - val_loss: 0.1435
Epoch 32/150
500/500 [==============================] - 1486s - loss: 0.1366 - val_loss: 0.1436
Epoch 33/150
500/500 [==============================] - 1486s - loss: 0.1356 - val_loss: 0.1436
Epoch 34/150
500/500 [==============================] - 1486s - loss: 0.1372 - val_loss: 0.1435
Epoch 35/150
500/500 [==============================] - 1486s - loss: 0.1371 - val_loss: 0.1435
Epoch 36/150
500/500 [==============================] - 1486s - loss: 0.1402 - val_loss: 0.1436
Epoch 37/150
500/500 [==============================] - 1486s - loss: 0.1394 - val_loss: 0.1444
Epoch 38/150
500/500 [==============================] - 1486s - loss: 0.1404 - val_loss: 0.1448
Epoch 39/150
500/500 [==============================] - 1486s - loss: 0.1374 - val_loss: 0.1436
Epoch 40/150
500/500 [==============================] - 1486s - loss: 0.1392 - val_loss: 0.1437
Epoch 41/150
500/500 [==============================] - 1486s - loss: 0.1381 - val_loss: 0.1439
Epoch 42/150
500/500 [==============================] - 1486s - loss: 0.1400 - val_loss: 0.1436
Epoch 43/150
500/500 [==============================] - 1486s - loss: 0.1369 - val_loss: 0.1446
Epoch 44/150
500/500 [==============================] - 1486s - loss: 0.1382 - val_loss: 0.1453
Epoch 45/150
500/500 [==============================] - 1486s - loss: 0.1373 - val_loss: 0.1441
Epoch 46/150
500/500 [==============================] - 1486s - loss: 0.1365 - val_loss: 0.1436
Epoch 47/150
500/500 [==============================] - 1486s - loss: 0.1393 - val_loss: 0.1434
Epoch 48/150
500/500 [==============================] - 1486s - loss: 0.1364 - val_loss: 0.1432
Epoch 49/150
500/500 [==============================] - 1486s - loss: 0.1387 - val_loss: 0.1445
Epoch 50/150
500/500 [==============================] - 1487s - loss: 0.1378 - val_loss: 0.1430
Epoch 51/150
500/500 [==============================] - 1487s - loss: 0.1402 - val_loss: 0.1425
Epoch 52/150
500/500 [==============================] - 1486s - loss: 0.1379 - val_loss: 0.1444
Epoch 53/150
500/500 [==============================] - 1486s - loss: 0.1402 - val_loss: 0.2258
Epoch 54/150
500/500 [==============================] - 1486s - loss: 0.1388 - val_loss: 0.2253
Epoch 55/150
500/500 [==============================] - 1486s - loss: 0.1387 - val_loss: 0.2280
Epoch 56/150
500/500 [==============================] - 1486s - loss: 0.1371 - val_loss: 0.2277
Epoch 57/150
500/500 [==============================] - 1486s - loss: 0.1388 - val_loss: 0.2269
Epoch 58/150
500/500 [==============================] - 1486s - loss: 0.1378 - val_loss: 0.2263
Epoch 59/150
500/500 [==============================] - 1486s - loss: 0.1368 - val_loss: 0.2261
Epoch 60/150
500/500 [==============================] - 1486s - loss: 0.1380 - val_loss: 0.2259
Epoch 61/150
500/500 [==============================] - 1486s - loss: 0.1382 - val_loss: 0.2248
Epoch 62/150
500/500 [==============================] - 1486s - loss: 0.1389 - val_loss: 0.2247
Epoch 63/150
500/500 [==============================] - 1486s - loss: 0.1364 - val_loss: 0.2235
Epoch 64/150
500/500 [==============================] - 1486s - loss: 0.1379 - val_loss: 0.2234
Epoch 65/150
500/500 [==============================] - 1486s - loss: 0.1384 - val_loss: 0.2227
Epoch 66/150
500/500 [==============================] - 1486s - loss: 0.1390 - val_loss: 0.2218
Epoch 67/150
500/500 [==============================] - 1486s - loss: 0.1396 - val_loss: 0.2218
Epoch 68/150
500/500 [==============================] - 1486s - loss: 0.1386 - val_loss: 0.2212
Epoch 69/150
500/500 [==============================] - 1486s - loss: 0.1387 - val_loss: 0.2131
Epoch 70/150
500/500 [==============================] - 1486s - loss: 0.1382 - val_loss: 0.2066
Epoch 71/150
500/500 [==============================] - 1487s - loss: 0.1377 - val_loss: 0.2070
Epoch 72/150
500/500 [==============================] - 1487s - loss: 0.1378 - val_loss: 0.2032
Epoch 73/150
500/500 [==============================] - 1487s - loss: 0.1380 - val_loss: 0.1999
Epoch 74/150
500/500 [==============================] - 1487s - loss: 0.1373 - val_loss: 0.1990
Epoch 75/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.2092
Epoch 76/150
500/500 [==============================] - 1487s - loss: 0.1388 - val_loss: 0.2098
Epoch 77/150
500/500 [==============================] - 1487s - loss: 0.1389 - val_loss: 0.2098
Epoch 78/150
500/500 [==============================] - 1487s - loss: 0.1387 - val_loss: 0.2105
Epoch 79/150
500/500 [==============================] - 1486s - loss: 0.1383 - val_loss: 0.2105
Epoch 80/150
500/500 [==============================] - 1486s - loss: 0.1371 - val_loss: 0.2102
Epoch 81/150
500/500 [==============================] - 1486s - loss: 0.1380 - val_loss: 0.2100
Epoch 82/150
498/500 [============================>.] - ETA: 5s - loss: 0.1394 /home/arvoelke/.virtualenvs/CTN/local/lib/python2.7/site-packages/keras/engine/training.py:1460: UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results. Set `samples_per_epoch` correctly to avoid this warning.
502/500 [==============================] - 1492s - loss: 0.1394 - val_loss: 0.2095
Epoch 83/150
500/500 [==============================] - 1487s - loss: 0.1399 - val_loss: 0.2091
Epoch 84/150
500/500 [==============================] - 1487s - loss: 0.1390 - val_loss: 0.2089
Epoch 85/150
500/500 [==============================] - 1487s - loss: 0.1369 - val_loss: 0.2084
Epoch 86/150
500/500 [==============================] - 1487s - loss: 0.1386 - val_loss: 0.2080
Epoch 87/150
500/500 [==============================] - 1486s - loss: 0.1380 - val_loss: 0.2078
Epoch 88/150
500/500 [==============================] - 1486s - loss: 0.1379 - val_loss: 0.2077
Epoch 89/150
500/500 [==============================] - 1486s - loss: 0.1401 - val_loss: 0.2075
Epoch 90/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.2074
Epoch 91/150
500/500 [==============================] - 1486s - loss: 0.1388 - val_loss: 0.2074
Epoch 92/150
500/500 [==============================] - 1486s - loss: 0.1396 - val_loss: 0.2068
Epoch 93/150
500/500 [==============================] - 1487s - loss: 0.1385 - val_loss: 0.2066
Epoch 94/150
500/500 [==============================] - 1486s - loss: 0.1375 - val_loss: 0.2071
Epoch 95/150
500/500 [==============================] - 1486s - loss: 0.1386 - val_loss: 0.2068
Epoch 96/150
500/500 [==============================] - 1486s - loss: 0.1397 - val_loss: 0.2067
Epoch 97/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.2049
Epoch 98/150
500/500 [==============================] - 1487s - loss: 0.1386 - val_loss: 0.2045
Epoch 99/150
500/500 [==============================] - 1487s - loss: 0.1367 - val_loss: 0.2024
Epoch 100/150
500/500 [==============================] - 1487s - loss: 0.1376 - val_loss: 0.1993
Epoch 101/150
500/500 [==============================] - 1487s - loss: 0.1358 - val_loss: 0.1967
Epoch 102/150
500/500 [==============================] - 1486s - loss: 0.1396 - val_loss: 0.1960
Epoch 103/150
500/500 [==============================] - 1486s - loss: 0.1400 - val_loss: 0.1958
Epoch 104/150
500/500 [==============================] - 1486s - loss: 0.1385 - val_loss: 0.1958
Epoch 105/150
500/500 [==============================] - 1486s - loss: 0.1387 - val_loss: 0.1955
Epoch 106/150
500/500 [==============================] - 1486s - loss: 0.1393 - val_loss: 0.1954
Epoch 107/150
500/500 [==============================] - 1486s - loss: 0.1393 - val_loss: 0.1952
Epoch 108/150
500/500 [==============================] - 1487s - loss: 0.1392 - val_loss: 0.1951
Epoch 109/150
500/500 [==============================] - 1486s - loss: 0.1390 - val_loss: 0.1953
Epoch 110/150
500/500 [==============================] - 1487s - loss: 0.1386 - val_loss: 0.1951
Epoch 111/150
500/500 [==============================] - 1486s - loss: 0.1380 - val_loss: 0.1954
Epoch 112/150
500/500 [==============================] - 1486s - loss: 0.1380 - val_loss: 0.1953
Epoch 113/150
500/500 [==============================] - 1487s - loss: 0.1370 - val_loss: 0.1957
Epoch 114/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.1959
Epoch 115/150
500/500 [==============================] - 1487s - loss: 0.1388 - val_loss: 0.1957
Epoch 116/150
500/500 [==============================] - 1487s - loss: 0.1381 - val_loss: 0.1955
Epoch 117/150
500/500 [==============================] - 1487s - loss: 0.1397 - val_loss: 0.1955
Epoch 118/150
500/500 [==============================] - 1487s - loss: 0.1374 - val_loss: 0.1975
Epoch 119/150
500/500 [==============================] - 1487s - loss: 0.1368 - val_loss: 0.1972
Epoch 120/150
500/500 [==============================] - 1487s - loss: 0.1358 - val_loss: 0.1976
Epoch 121/150
500/500 [==============================] - 1487s - loss: 0.1384 - val_loss: 0.1974
Epoch 122/150
500/500 [==============================] - 1487s - loss: 0.1355 - val_loss: 0.1984
Epoch 123/150
500/500 [==============================] - 1487s - loss: 0.1388 - val_loss: 0.1974
Epoch 124/150
500/500 [==============================] - 1487s - loss: 0.1375 - val_loss: 0.1970
Epoch 125/150
500/500 [==============================] - 1487s - loss: 0.1396 - val_loss: 0.1968
Epoch 126/150
500/500 [==============================] - 1487s - loss: 0.1384 - val_loss: 0.1953
Epoch 127/150
500/500 [==============================] - 1487s - loss: 0.1377 - val_loss: 0.1949
Epoch 128/150
500/500 [==============================] - 1487s - loss: 0.1369 - val_loss: 0.1948
Epoch 129/150
500/500 [==============================] - 1487s - loss: 0.1376 - val_loss: 0.1949
Epoch 130/150
500/500 [==============================] - 1487s - loss: 0.1383 - val_loss: 0.1957
Epoch 131/150
500/500 [==============================] - 1487s - loss: 0.1381 - val_loss: 0.1962
Epoch 132/150
500/500 [==============================] - 1487s - loss: 0.1371 - val_loss: 0.1961
Epoch 133/150
500/500 [==============================] - 1487s - loss: 0.1390 - val_loss: 0.1985
Epoch 134/150
500/500 [==============================] - 1487s - loss: 0.1396 - val_loss: 0.1965
Epoch 135/150
500/500 [==============================] - 1487s - loss: 0.1367 - val_loss: 0.1973
Epoch 136/150
500/500 [==============================] - 1487s - loss: 0.1384 - val_loss: 0.1980
Epoch 137/150
500/500 [==============================] - 1487s - loss: 0.1382 - val_loss: 0.1975
Epoch 138/150
500/500 [==============================] - 1487s - loss: 0.1377 - val_loss: 0.1995
Epoch 139/150
500/500 [==============================] - 1487s - loss: 0.1383 - val_loss: 0.2060
Epoch 140/150
500/500 [==============================] - 1487s - loss: 0.1364 - val_loss: 0.2000
Epoch 141/150
500/500 [==============================] - 1487s - loss: 0.1380 - val_loss: 0.1998
Epoch 142/150
500/500 [==============================] - 1486s - loss: 0.1359 - val_loss: 0.1992
Epoch 143/150
500/500 [==============================] - 1486s - loss: 0.1382 - val_loss: 0.1992
Epoch 144/150
500/500 [==============================] - 1486s - loss: 0.1372 - val_loss: 0.2002
Epoch 145/150
500/500 [==============================] - 1486s - loss: 0.1376 - val_loss: 0.1995
Epoch 146/150
500/500 [==============================] - 1486s - loss: 0.1399 - val_loss: 0.1996
Epoch 147/150
500/500 [==============================] - 1486s - loss: 0.1361 - val_loss: 0.2003
Epoch 148/150
500/500 [==============================] - 1486s - loss: 0.1389 - val_loss: 0.2026
Epoch 149/150
500/500 [==============================] - 1486s - loss: 0.1384 - val_loss: 0.2024
Epoch 150/150
500/500 [==============================] - 1486s - loss: 0.1397 - val_loss: 0.2025

It seems to dramatically diverge after 50 epochs?

@bill-lotter
Copy link
Contributor

That's really weird, I'm not sure what's going on and have never seen that. The loss should drop a lot quicker than that and the diversion is also weird. I would start with just maybe one training example and make sure you can train to get essentially zero error with that, which should happen quickly so you can easily experiment.

@Faur
Copy link

Faur commented Apr 19, 2017

I just tried doing exactly as the README suggested, and everything looks fine. This is the result after 5 epochs. I am using the TF backend, and cuDNN

plot_54

@kikyou123
Copy link

@Faur
I run the kitti_train.py, but the MSE is 0.018, what is your results?

@Faur
Copy link

Faur commented Sep 27, 2017

Sorry I don't remember, and I don't have access to the same setup that I had back then

@nistha21
Copy link

nistha21 commented Dec 9, 2017

I trained the model myself and in the predictions I am just getting red components. Here is a sample
plot_17

Any pointers?

@nistha21
Copy link

Changing the seed during the training changes the color of the output as well as the mse. Tried bunch of seeds, yet cannot achieve the balanced image.

@bill-lotter
Copy link
Contributor

@nistha21 It seems like there is an issue with TimeDistributed in Keras 2, where it overrides the initial weights of the layer to be wrapped (keras-team/keras#8895). In our case, this results in a meaningless loss function. I adjusted the code for this (9f6482e). Give this a shot and let me know if you still have issues - thanks!

@wongjoel
Copy link

I wasn't getting the strange colourisation of output, but I was getting some poorer quality output before the latest commit (9f6482e). Compare the results from before and after:

plot_69
plot_69

So just wanted to say thanks to bill-lotter for fixing the issue, and to encourage anyone with issues to try again with the latest commit.

@mitkina
Copy link

mitkina commented Jan 22, 2019

Hello,

I have recently noticed the commit for the TimeDistributed keras 2.0 syntax fix. Before the fix, my loss plots looked relatively alright on my data. However, after implementing the fix, my loss plots have a strange behavior in that they jump significantly partway through the training and do not fall to the lowest loss at the end of training. The loss is also higher than prior to the fix. Have you seen anything of this sort? Do you think that this behavior could be circumvented simply by longer training?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants