Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss Convergence and whether ViT is Trained #18

Open
SuperStacie opened this issue Apr 17, 2024 · 1 comment
Open

Loss Convergence and whether ViT is Trained #18

SuperStacie opened this issue Apr 17, 2024 · 1 comment

Comments

@SuperStacie
Copy link

Hi, thanks for the interesting work!

  1. I've running the updated code and observe that at the pretraining stage, the loss is coverged to ~3(slightly above 3), does my training show similar tendency as your official experiment setting? If it seems correct, in the orginal LLaVA-1.5 pretraining, the loss is finally converged to ~2, how to inteprete this difference?

  2. May I know the rough converged loss value of the fine-tuning stage?

  3. According to you paper, Sec. 3.1 "In our experiments, we show that ViT and position embedding parameters can be kept frozen during pretraining, and updating these parameters during the instruction-tuning stage is sufficient for good performance", it means the ViT is fine-tuned, but the author claims in another issue that the ViT is freezed all the time. Can you clarify on this point? From my understanding, the ViT positional embedding changed adapting the dynamic aspect ratio (similar to pix2instruct), the ViT need to be fine-tuned.

Many thanks!

@guozonghao96
Copy link
Collaborator

guozonghao96 commented Jan 4, 2025

  1. In our new implantation, llava-uhd v1 and llava-uhd v2 can finally be converged to ~2, which is a good phenomenon to check whether the model is converged.
  2. In sft stage, llava-uhd v1 is about 0.750.8; llava-uhd v2 is about 0.650.7. You can re-produce our model for detailed check.
  3. In our finding, the ViT does not need be fine-tuned when minimally changing to its position encoding. In contrast, it will improve the MLLM performance.

Moreover, our repository has been fully improved, and almost all bugs have been eliminated. For details, please refer to the main branch and the LLaVA-UHD v1 branch. If there are any new problems, feel free to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants