Performance of SFT #1

Hanqer · 2024-12-16T07:42:48Z

The performance reported in the paper showing promising results, especially the SFT term.

My question is: Is this imitate learning process can be generalized to other models, such as Qwen2.5-7B, LLaMa3.1-8B, etc. Because the imitate data are generated by QwQ-preview, which is based on Qwen2.5-32B, is it naturally benefit only for Qwen2.5-32B, or can be generalized to any other models?

Thanks.

EliverQ · 2024-12-22T16:58:45Z

Hi, Hanqer!

Thank you for your kind words about our work!

During our experimental process, we found that the parameter scale of the base model is quite important.

For the Qwen2.5 series, training on long-chain reasoning data with SFT may be more effective on models of 32B or larger, while the effects are not significant for 7B and 14B models. We speculate that this might be because the complex operations involved in long reasoning chains (e.g., reflection and planning) can confuse base models with insufficient capabilities, making it difficult for them to learn.
This conclusion also generally applies to the LLaMA-3.1 series; however, since LLaMA-3.1 only provides 8B and 70B models (we did not attempt the 405B model due to resource limitations), we conducted experiments only on these two scales.

We hope this answers your question.

Hanqer · 2024-12-26T06:16:46Z

@EliverQ Thanks for your reply! But I also have some concerns that deepseek-r1 and o1-mini are both small size model (smaller than 34B) which have strong reasoning and chain of thought ability. Why imitate learning is not effective for small size models?

EliverQ · 2024-12-26T08:41:34Z

@EliverQ Thanks for your reply! But I also have some concerns that deepseek-r1 and o1-mini are both small size model (smaller than 34B) which have strong reasoning and chain of thought ability. Why imitate learning is not effective for small size models?

Thank you for your question!

Firstly, while imitation learning is not entirely ineffective for smaller models, its impact is relatively modest compared to larger models like those with 32B parameters.
Secondly, both DeepSeek-R1 and O1 have undergone extensive reinforcement learning and self-exploration, which perhaps cannot be achieved merely through the straightforward "imitation" of SFT.

We are currently exploring ways to activate slow reasoning capabilities through imitation learning and then utilize reinforcement learning for effective scaling during training. If you're interested, welcome to stay tuned for our upcoming work!

2proveit · 2025-01-08T09:01:22Z

我在qwen2.5-7b-instruct上sft您开源的数据之后，模型在推理时出现了重复生成的情况，其在GPQA上的表现不佳，不如原始的qwen2.5-7b-instruct，但是在MATH数据集上基本持平，您认为这种现象是和模型的尺寸有关吗？您是否有做过小尺寸（<32B)模型的对比实验，是否能够分享一下，期待您的回复！ @EliverQ

EliverQ · 2025-01-09T13:53:00Z

我在qwen2.5-7b-instruct上sft您开源的数据之后，模型在推理时出现了重复生成的情况，其在GPQA上的表现不佳，不如原始的qwen2.5-7b-instruct，但是在MATH数据集上基本持平，您认为这种现象是和模型的尺寸有关吗？您是否有做过小尺寸（<32B)模型的对比实验，是否能够分享一下，期待您的回复！ @EliverQ

谢谢你的关注！我们确实有做过不同大小/系列模型的实验，可以参考一下这里~ #1 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of SFT #1

Performance of SFT #1

Hanqer commented Dec 16, 2024

EliverQ commented Dec 22, 2024

Hanqer commented Dec 26, 2024

EliverQ commented Dec 26, 2024

2proveit commented Jan 8, 2025

EliverQ commented Jan 9, 2025

Performance of SFT #1

Performance of SFT #1

Comments

Hanqer commented Dec 16, 2024

EliverQ commented Dec 22, 2024

Hanqer commented Dec 26, 2024

EliverQ commented Dec 26, 2024

2proveit commented Jan 8, 2025

EliverQ commented Jan 9, 2025