Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFT的数据分布 #11

Open
cyzhh opened this issue Feb 28, 2024 · 1 comment
Open

SFT的数据分布 #11

cyzhh opened this issue Feb 28, 2024 · 1 comment

Comments

@cyzhh
Copy link

cyzhh commented Feb 28, 2024

恭喜你们的效果取得了非常好的效果! 我有一个问题想要请教一下各位大佬:

我想了解一下SFT的数据分布。看到training examples 是 776K,但是可能是我对于数据集的估算可能出现了一些问题。English mathematical datasets:GSM8K和MATH部分我看是根据ToRA进行标注的,所以根据ToRA那篇文章的估算应该是69K,MathInstruct 260K 的子集不是特别好估算我就按照200K来估算,Lila-OOD是32.2K。总计300K左右,而且MathInstruct里面的MATH和GSM8K应该会与前面的69K的数据重复。那么Chinese mathematical datasets的数据应该是476K,这个数据集是你们收集,后续会开源的嘛?

@cyzhh
Copy link
Author

cyzhh commented Feb 28, 2024

补充一下,RL部分的数据集我看到The training data of RL are chain-ofthought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. 这里的数据也有点儿好奇,可能是我前面的估算错误导致的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant