Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

Open
RefractAI opened this issue Dec 16, 2024 · 7 comments

Comments

@RefractAI
Copy link

I am following the steps in the readme.

download_prepare_hf_data.py created a folder of JSONL files, but when running train, I get:

Zero shard_files found corresponding to: fineweb_edu_10bt using preprocess_dir=entropy_preprocess and entropy_model_name=transformer_100m

So therefore:

  1. How are the arrow files created from the JSONL files?
  2. Where can I find transformer_100m for preprocessing?
@EntilZha
Copy link
Contributor

Hi! We're still working on getting all the preprocessing steps updated from the Lingua code to our code, sorry for the delay! I'll post here once its up :).

@nican2018
Copy link

Hi!
Please any update for this?
Have you been able to solve the problem?

@EntilZha
Copy link
Contributor

EntilZha commented Jan 6, 2025

Sorry for the delay, most of use were away the past 2 weeks with US holidays. I'll see what I can do this week to get this working with the data that lingua pulls.

@Unpredictable-12
Copy link

Hi!!!!!Has this issue been resolved? I'm currently stuck here and can't find a solution. How was this arrow file generated? I converted the corresponding jsonl file to an arrow file myself, but the train.py function is reporting an error.
self.dataset = pa.dataset.dataset(self.dataset_files, format="arrow")
error:
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/root/Desktop/BLT/blt-main/bytelatent/preprocess/dclm_baseline_1.0/transformer_100m/dclm_baseline_1.0.chunk.00.jsonl.shard_1.arrow'. Is this a 'ipc' file?: Could not open IPC input source '/root/Desktop/BLT/blt-main/bytelatent/preprocess/dclm_baseline_1.0/transformer_100m/dclm_baseline_1.0.chunk.00.jsonl.shard_1.arrow': Not an Arrow file

@EntilZha
Copy link
Contributor

EntilZha commented Jan 9, 2025

Hi, I'm currently working on making this easier to do (between holidays and start of year company things, been busy until now). If you want to jump the gun a bit and not wait longer, here is the rough sketch of what I'm planning to do:

  1. Double check the lingua download scripts work as expected + create config file to point to them
  2. Add a script that trains the small entropy model from this data, which is a pre-requisite to the next step
  3. The preprocessing code (given the JSON file and entropy model checkpoint) is in bytelatent.preprocess.preprocess_entropies. I need check if I need to make any changes for this to work with our OSS code (mainly around the entropy checkpoint loading). The rest of the code, namely the schema, will be the same and generates the individual arrow files in the list self.dataset_files
  4. The actual preprocessing is done by bytelatent.preprocess.parallel_entropies which runs bytelatent.preprocess.preprocess_entropies in parallel. This should work as is, but I can provide some scripts/configs that make it out of the box work with lingua data. It also hardcodes using slurm to parallelize, if this is an issue, there is an alternate executor in submitit that just uses python multiprocessing.

That should enable training from lingua data. I'll make incremental updates/PRs so you can make use of this right away as its done versus waiting for the full thing to be done

@Unpredictable-12
Copy link

Hi!!!!!I'm very interested in this BLT model and want to train and reproduce it.Now,I am following the steps in the readme.
If I want to conduct a complete training of a BLT model now and ultimately obtain a model file while also validating it, what functions do I need to run, or what is the complete process? It's currently a bit confusing, for example, with scripts like download_prepare_hf_data.py, train.py, etc.

@EntilZha
Copy link
Contributor

Hi @Unpredictable-12, I'm working on making this a bit clearer and fixing some scripts that are needed for a full reproduction. In general, the steps will be:

  1. Use the HF based download scripts
  2. Train the entropy model using the jsonl files
  3. Preprocess the HF data to arrow data, which mainly involves ahead-of-time running the entropy model on the training data and saving entropies (alternatively, can run the entropy model online, but we need to port the code to do this).
  4. Run train.py

I pushed changes yesterday that fixes the script to preprocess a given jsonl file to arrow for (3), and I'm working adding a config/instructions for training the entropy model now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants