Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) #12

Open
kilickaya opened this issue Feb 7, 2023 · 13 comments

Comments

@kilickaya
Copy link

kilickaya commented Feb 7, 2023

Hi,

Thanks a lot for your amazing work and releasing the code. I am trying to reproduce your Table 4 for sometime. I directly use the code and the scripts with NO modification.

For example, in this Table, BYOL fine-tuning on ImageNet-100 for 5-class incremental task performance is 66.0. Instead, I measured below <<60.0, at least 6% below. Please see the full results Table below if interested (a 5 x 5 Table).

results.pdf

Any idea what may be causing the gap? Is there any nuances in evaluation method? For example, for average accuracy, I simply take the mean of the below Table across all rows and colums (as also suggested by GEM, as you referenced).

Thanks a lot again for your response and your eye-opening work.

@kilickaya kilickaya changed the title Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured 60%) Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%) Feb 7, 2023
@DonkeyShot21
Copy link
Owner

DonkeyShot21 commented Feb 8, 2023

Hi, I have just checked my logs and the results seem consistent with the ones we published. See the screenshot below:
image
The second run (65.4%) is with slightly different hyperparams. We got around 59% with online linear eval and ~66% after offline linear eval. It might be that you are having some issues with the offline linear eval parameters. How much did you get with online eval? Maybe I can look for the checkpoint and you can try just running offline linear eval to debug?

I don't really understand the results you are reporting, you need to look for val_acc1 in the wandb run. Maybe you are not looking at the right metric?

@kilickaya
Copy link
Author

Hi Enrico,

Many thanks for the swift response!

Please see the wandb output for val_acc1 on ImageNet-100, for all the 5 checkpoints

BYOL_Finetune_ImageNet100

As is evident, the last model (task4, the highest) reaches to 62% accuracy at the very end of the linear-probing.

Please see below my offline linear probing parameters, equivalent to yours:


python main_linear.py \
    --dataset imagenet100 \
    --encoder resnet18 \
    --data_dir $DATA_DIR \
    --train_dir imagenet-100/train \
    --val_dir imagenet-100/val \
    --split_strategy class \
    --num_tasks 5 \
    --max_epochs 100 \
    --gpus 0 \
    --precision 16 \
    --optimizer sgd \
    --scheduler step \
    --lr 3.0 \
    --lr_decay_steps 60 80 \
    --weight_decay 0 \
    --batch_size 256 \
    --num_workers 8 \
    --dali \
    --name byol-imagenet100-5T-linear-eval \
    --pretrained_feature_extractor $PRETRAINED_PATH \
    --project benchmark \
    --entity swordrock \
    --wandb \
    --save_checkpoint

Is the accuracy in the paper just the accuracy of the final model (which I found as 62%)?

It would be great if you can share the checkpoint indeed. Then I can debug my evaluation code.

It would be great if you can share the evaluation script as well. It does not have to be clean, just to give the clearest idea possible.

Thank you.

@DonkeyShot21
Copy link
Owner

Yes, it is the accuracy of the final model as reported in the paper. Intermediate checkpoints are only used for forgetting. Please see this screenshot from the paper below:
image
In this case, since the number of samples per task is roughly constant, the average is the same as the simple linear eval accuracy.

I will look for the checkpoint and post it here asap.

I have some questions:

  • Did you run hyperparam tuning for the offline linear evaluation? This is very important, different checkpoints might exhibit different feature scale, you need to retune everything if you change the checkpoint. For instance, here your curves don't look right, I think your lr is too low. This is especially true for BYOL, cause it tends to be unstable with fewer data so it might end up in a very different configuration each time you retrain it.
  • Did you notice instability during pre-training? How much did you get with online eval?
  • Do you have the exact package versions that we suggest in the readme?

@kilickaya
Copy link
Author

kilickaya commented Feb 8, 2023

Thanks for your response.

  • Hyper-param tuning: I have not performed any hyper-param tuning for linear-probing. I directly used yours for fair comparison. But since you say it, I will try. Thank you.
  • Instability: BYOL training is stable and smooth in my case, please see below. My question was more general, as I ran all the other models too, not only BYOL. My online evaluation accuracy was around 60% as well.
  • Environment: Yes, I use the exact same configuration.

Training

Will have a look at tuning parameters further. Thank you.

@DonkeyShot21
Copy link
Owner

DonkeyShot21 commented Feb 8, 2023

I found some checkpoints that might be relevant: https://drive.google.com/drive/folders/1gOejzl4Q0cqAcmEjUhyStYPDbXPn1o9R?usp=share_link
You can find the pre-train args there as well.

I am not 100% sure that this is the correct checkpoint, so use it at your own risk.

EDIT: this checkpoint was probably obtained with a different version of the code, you might have issues resuming it

@DonkeyShot21
Copy link
Owner

Yes, your curves look similar to mine. I think it is likely to be due to hyperparam tuning of the offline linear eval. Also, always remember that there might be some randomness involved, so a small decrease in performance might be due to that.

@kilickaya
Copy link
Author

Thanks for the model, args and the info. I will have a look at these. Thanks!

@DonkeyShot21
Copy link
Owner

DonkeyShot21 commented Feb 8, 2023

One last thing that just came to my mind. We recently found that Pillow-SIMD can have a detrimental effect on some models (see the issue here vturrisi/solo-learn#313). I am not sure if we used it or not in our experiments. Might be another thing to check on.

EDIT: also make sure you use Dali for pre-training.

@kilickaya
Copy link
Author

Cool. I was using it, actually. Will try without it and report any difference.

@kilickaya
Copy link
Author

Update-1: I've spent the day to perform hyper-param tuning for offline linear eval. I update here in case someone else wants to see the end result as well.

Tl;dr: I could not reach above 62% despite brute-force search, still much lower than 66%. So the conclusion is that it is not about the linear-probing stage, but the actual pre-training.

Setting: (I highlight the author's recommended setting from this code base, which yields the best accuracy I can get: 62%)

BYOL, fine-tuned for 400-epochs per-task
5-class incremental
ImageNet-100
Offline-eval via linear-probe
lr: {0.01,0.1,1,**3**,10, 100}
batch_size: {64, 128, **256**}
weight_decay: {**0**, 1e-5}

Results: Some will appear shorter as they have different batch size/different number of iters consequently.
W B Chart 2_8_2023, 9_45_44 PM

To-do: I'll try without pillow-SIMD. Then, I'll focus on improving the pre-training part.

@DonkeyShot21
Copy link
Owner

How much did you get with online linear eval?

@kilickaya
Copy link
Author

Generally 4-5% below the offline counterpart.

@DonkeyShot21
Copy link
Owner

Ok, so around 57. The checkpoint that I shared should have online eval accuracy 58.8%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants