Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

kilickaya · 2023-02-07T22:06:14Z

Hi,

Thanks a lot for your amazing work and releasing the code. I am trying to reproduce your Table 4 for sometime. I directly use the code and the scripts with NO modification.

For example, in this Table, BYOL fine-tuning on ImageNet-100 for 5-class incremental task performance is 66.0. Instead, I measured below <<60.0, at least 6% below. Please see the full results Table below if interested (a 5 x 5 Table).

results.pdf

Any idea what may be causing the gap? Is there any nuances in evaluation method? For example, for average accuracy, I simply take the mean of the below Table across all rows and colums (as also suggested by GEM, as you referenced).

Thanks a lot again for your response and your eye-opening work.

DonkeyShot21 · 2023-02-08T10:06:22Z

Hi, I have just checked my logs and the results seem consistent with the ones we published. See the screenshot below:

The second run (65.4%) is with slightly different hyperparams. We got around 59% with online linear eval and ~66% after offline linear eval. It might be that you are having some issues with the offline linear eval parameters. How much did you get with online eval? Maybe I can look for the checkpoint and you can try just running offline linear eval to debug?

I don't really understand the results you are reporting, you need to look for val_acc1 in the wandb run. Maybe you are not looking at the right metric?

kilickaya · 2023-02-08T10:28:43Z

Hi Enrico,

Many thanks for the swift response!

Please see the wandb output for val_acc1 on ImageNet-100, for all the 5 checkpoints

As is evident, the last model (task4, the highest) reaches to 62% accuracy at the very end of the linear-probing.

Please see below my offline linear probing parameters, equivalent to yours:


python main_linear.py \
    --dataset imagenet100 \
    --encoder resnet18 \
    --data_dir $DATA_DIR \
    --train_dir imagenet-100/train \
    --val_dir imagenet-100/val \
    --split_strategy class \
    --num_tasks 5 \
    --max_epochs 100 \
    --gpus 0 \
    --precision 16 \
    --optimizer sgd \
    --scheduler step \
    --lr 3.0 \
    --lr_decay_steps 60 80 \
    --weight_decay 0 \
    --batch_size 256 \
    --num_workers 8 \
    --dali \
    --name byol-imagenet100-5T-linear-eval \
    --pretrained_feature_extractor $PRETRAINED_PATH \
    --project benchmark \
    --entity swordrock \
    --wandb \
    --save_checkpoint

Is the accuracy in the paper just the accuracy of the final model (which I found as 62%)?

It would be great if you can share the checkpoint indeed. Then I can debug my evaluation code.

It would be great if you can share the evaluation script as well. It does not have to be clean, just to give the clearest idea possible.

Thank you.

DonkeyShot21 · 2023-02-08T10:48:58Z

Yes, it is the accuracy of the final model as reported in the paper. Intermediate checkpoints are only used for forgetting. Please see this screenshot from the paper below:

In this case, since the number of samples per task is roughly constant, the average is the same as the simple linear eval accuracy.

I will look for the checkpoint and post it here asap.

I have some questions:

Did you run hyperparam tuning for the offline linear evaluation? This is very important, different checkpoints might exhibit different feature scale, you need to retune everything if you change the checkpoint. For instance, here your curves don't look right, I think your lr is too low. This is especially true for BYOL, cause it tends to be unstable with fewer data so it might end up in a very different configuration each time you retrain it.
Did you notice instability during pre-training? How much did you get with online eval?
Do you have the exact package versions that we suggest in the readme?

kilickaya · 2023-02-08T10:58:10Z

Thanks for your response.

Hyper-param tuning: I have not performed any hyper-param tuning for linear-probing. I directly used yours for fair comparison. But since you say it, I will try. Thank you.
Instability: BYOL training is stable and smooth in my case, please see below. My question was more general, as I ran all the other models too, not only BYOL. My online evaluation accuracy was around 60% as well.
Environment: Yes, I use the exact same configuration.

Will have a look at tuning parameters further. Thank you.

DonkeyShot21 · 2023-02-08T10:59:16Z

I found some checkpoints that might be relevant: https://drive.google.com/drive/folders/1gOejzl4Q0cqAcmEjUhyStYPDbXPn1o9R?usp=share_link
You can find the pre-train args there as well.

I am not 100% sure that this is the correct checkpoint, so use it at your own risk.

EDIT: this checkpoint was probably obtained with a different version of the code, you might have issues resuming it

DonkeyShot21 · 2023-02-08T11:06:14Z

Yes, your curves look similar to mine. I think it is likely to be due to hyperparam tuning of the offline linear eval. Also, always remember that there might be some randomness involved, so a small decrease in performance might be due to that.

kilickaya · 2023-02-08T11:34:31Z

Thanks for the model, args and the info. I will have a look at these. Thanks!

DonkeyShot21 · 2023-02-08T11:38:23Z

One last thing that just came to my mind. We recently found that Pillow-SIMD can have a detrimental effect on some models (see the issue here vturrisi/solo-learn#313). I am not sure if we used it or not in our experiments. Might be another thing to check on.

EDIT: also make sure you use Dali for pre-training.

kilickaya · 2023-02-08T11:40:31Z

Cool. I was using it, actually. Will try without it and report any difference.

kilickaya · 2023-02-08T20:56:57Z

Update-1: I've spent the day to perform hyper-param tuning for offline linear eval. I update here in case someone else wants to see the end result as well.

Tl;dr: I could not reach above 62% despite brute-force search, still much lower than 66%. So the conclusion is that it is not about the linear-probing stage, but the actual pre-training.

Setting: (I highlight the author's recommended setting from this code base, which yields the best accuracy I can get: 62%)

BYOL, fine-tuned for 400-epochs per-task
5-class incremental
ImageNet-100
Offline-eval via linear-probe
lr: {0.01,0.1,1,**3**,10, 100}
batch_size: {64, 128, **256**}
weight_decay: {**0**, 1e-5}

Results: Some will appear shorter as they have different batch size/different number of iters consequently.

To-do: I'll try without pillow-SIMD. Then, I'll focus on improving the pre-training part.

DonkeyShot21 · 2023-02-09T07:28:40Z

How much did you get with online linear eval?

kilickaya · 2023-02-09T10:17:00Z

Generally 4-5% below the offline counterpart.

DonkeyShot21 · 2023-02-09T11:00:28Z

Ok, so around 57. The checkpoint that I shared should have online eval accuracy 58.8%.

kilickaya changed the title ~~Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured 60%)~~ Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

kilickaya commented Feb 7, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 8, 2023

kilickaya commented Feb 8, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading

kilickaya commented Feb 8, 2023

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 9, 2023

kilickaya commented Feb 9, 2023

DonkeyShot21 commented Feb 9, 2023

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

Comments

kilickaya commented Feb 7, 2023 • edited Loading

DonkeyShot21 commented Feb 8, 2023 • edited Loading

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 8, 2023

kilickaya commented Feb 8, 2023 • edited Loading

DonkeyShot21 commented Feb 8, 2023 • edited Loading

DonkeyShot21 commented Feb 8, 2023

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 8, 2023 • edited Loading

kilickaya commented Feb 8, 2023

kilickaya commented Feb 8, 2023

DonkeyShot21 commented Feb 9, 2023

kilickaya commented Feb 9, 2023

DonkeyShot21 commented Feb 9, 2023

kilickaya commented Feb 7, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading

kilickaya commented Feb 8, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading

DonkeyShot21 commented Feb 8, 2023 •

edited

Loading