-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4
Comments
Hi! We're still working on getting all the preprocessing steps updated from the Lingua code to our code, sorry for the delay! I'll post here once its up :). |
Hi! |
Sorry for the delay, most of use were away the past 2 weeks with US holidays. I'll see what I can do this week to get this working with the data that lingua pulls. |
Hi!!!!!Has this issue been resolved? I'm currently stuck here and can't find a solution. How was this arrow file generated? I converted the corresponding jsonl file to an arrow file myself, but the train.py function is reporting an error. |
Hi, I'm currently working on making this easier to do (between holidays and start of year company things, been busy until now). If you want to jump the gun a bit and not wait longer, here is the rough sketch of what I'm planning to do:
That should enable training from lingua data. I'll make incremental updates/PRs so you can make use of this right away as its done versus waiting for the full thing to be done |
Hi!!!!!I'm very interested in this BLT model and want to train and reproduce it.Now,I am following the steps in the readme. |
Hi @Unpredictable-12, I'm working on making this a bit clearer and fixing some scripts that are needed for a full reproduction. In general, the steps will be:
I pushed changes yesterday that fixes the script to preprocess a given jsonl file to arrow for (3), and I'm working adding a config/instructions for training the entropy model now. |
I am following the steps in the readme.
download_prepare_hf_data.py created a folder of JSONL files, but when running train, I get:
Zero shard_files found corresponding to: fineweb_edu_10bt using preprocess_dir=entropy_preprocess and entropy_model_name=transformer_100m
So therefore:
The text was updated successfully, but these errors were encountered: