Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of SFT #1

Open
Hanqer opened this issue Dec 16, 2024 · 5 comments
Open

Performance of SFT #1

Hanqer opened this issue Dec 16, 2024 · 5 comments

Comments

@Hanqer
Copy link

Hanqer commented Dec 16, 2024

The performance reported in the paper showing promising results, especially the SFT term.

My question is: Is this imitate learning process can be generalized to other models, such as Qwen2.5-7B, LLaMa3.1-8B, etc. Because the imitate data are generated by QwQ-preview, which is based on Qwen2.5-32B, is it naturally benefit only for Qwen2.5-32B, or can be generalized to any other models?

Thanks.

@EliverQ
Copy link
Contributor

EliverQ commented Dec 22, 2024

Hi, Hanqer!

Thank you for your kind words about our work!

During our experimental process, we found that the parameter scale of the base model is quite important.

  • For the Qwen2.5 series, training on long-chain reasoning data with SFT may be more effective on models of 32B or larger, while the effects are not significant for 7B and 14B models. We speculate that this might be because the complex operations involved in long reasoning chains (e.g., reflection and planning) can confuse base models with insufficient capabilities, making it difficult for them to learn.
  • This conclusion also generally applies to the LLaMA-3.1 series; however, since LLaMA-3.1 only provides 8B and 70B models (we did not attempt the 405B model due to resource limitations), we conducted experiments only on these two scales.

We hope this answers your question.

@Hanqer
Copy link
Author

Hanqer commented Dec 26, 2024

@EliverQ Thanks for your reply! But I also have some concerns that deepseek-r1 and o1-mini are both small size model (smaller than 34B) which have strong reasoning and chain of thought ability. Why imitate learning is not effective for small size models?

@EliverQ
Copy link
Contributor

EliverQ commented Dec 26, 2024

@EliverQ Thanks for your reply! But I also have some concerns that deepseek-r1 and o1-mini are both small size model (smaller than 34B) which have strong reasoning and chain of thought ability. Why imitate learning is not effective for small size models?

Thank you for your question!

  • Firstly, while imitation learning is not entirely ineffective for smaller models, its impact is relatively modest compared to larger models like those with 32B parameters.
  • Secondly, both DeepSeek-R1 and O1 have undergone extensive reinforcement learning and self-exploration, which perhaps cannot be achieved merely through the straightforward "imitation" of SFT.

We are currently exploring ways to activate slow reasoning capabilities through imitation learning and then utilize reinforcement learning for effective scaling during training. If you're interested, welcome to stay tuned for our upcoming work!

@2proveit
Copy link

2proveit commented Jan 8, 2025

我在qwen2.5-7b-instruct上sft您开源的数据之后,模型在推理时出现了重复生成的情况,其在GPQA上的表现不佳,不如原始的qwen2.5-7b-instruct,但是在MATH数据集上基本持平,您认为这种现象是和模型的尺寸有关吗?您是否有做过小尺寸(<32B)模型的对比实验,是否能够分享一下,期待您的回复! @EliverQ

@EliverQ
Copy link
Contributor

EliverQ commented Jan 9, 2025

我在qwen2.5-7b-instruct上sft您开源的数据之后,模型在推理时出现了重复生成的情况,其在GPQA上的表现不佳,不如原始的qwen2.5-7b-instruct,但是在MATH数据集上基本持平,您认为这种现象是和模型的尺寸有关吗?您是否有做过小尺寸(<32B)模型的对比实验,是否能够分享一下,期待您的回复! @EliverQ

谢谢你的关注!我们确实有做过不同大小/系列模型的实验,可以参考一下这里~ #1 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants