train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

RefractAI · 2024-12-16T16:35:31Z

I am following the steps in the readme.

download_prepare_hf_data.py created a folder of JSONL files, but when running train, I get:

Zero shard_files found corresponding to: fineweb_edu_10bt using preprocess_dir=entropy_preprocess and entropy_model_name=transformer_100m

So therefore:

How are the arrow files created from the JSONL files?
Where can I find transformer_100m for preprocessing?

EntilZha · 2024-12-16T21:53:52Z

Hi! We're still working on getting all the preprocessing steps updated from the Lingua code to our code, sorry for the delay! I'll post here once its up :).

nican2018 · 2025-01-04T02:40:20Z

Hi!
Please any update for this?
Have you been able to solve the problem?

EntilZha · 2025-01-06T22:18:02Z

Sorry for the delay, most of use were away the past 2 weeks with US holidays. I'll see what I can do this week to get this working with the data that lingua pulls.

Unpredictable-12 · 2025-01-08T13:14:54Z

Hi！！！！！Has this issue been resolved? I'm currently stuck here and can't find a solution. How was this arrow file generated? I converted the corresponding jsonl file to an arrow file myself, but the train.py function is reporting an error.
self.dataset = pa.dataset.dataset(self.dataset_files, format="arrow")
error：
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/root/Desktop/BLT/blt-main/bytelatent/preprocess/dclm_baseline_1.0/transformer_100m/dclm_baseline_1.0.chunk.00.jsonl.shard_1.arrow'. Is this a 'ipc' file?: Could not open IPC input source '/root/Desktop/BLT/blt-main/bytelatent/preprocess/dclm_baseline_1.0/transformer_100m/dclm_baseline_1.0.chunk.00.jsonl.shard_1.arrow': Not an Arrow file

EntilZha · 2025-01-09T20:02:59Z

Hi, I'm currently working on making this easier to do (between holidays and start of year company things, been busy until now). If you want to jump the gun a bit and not wait longer, here is the rough sketch of what I'm planning to do:

Double check the lingua download scripts work as expected + create config file to point to them
Add a script that trains the small entropy model from this data, which is a pre-requisite to the next step
The preprocessing code (given the JSON file and entropy model checkpoint) is in bytelatent.preprocess.preprocess_entropies. I need check if I need to make any changes for this to work with our OSS code (mainly around the entropy checkpoint loading). The rest of the code, namely the schema, will be the same and generates the individual arrow files in the list self.dataset_files
The actual preprocessing is done by bytelatent.preprocess.parallel_entropies which runs bytelatent.preprocess.preprocess_entropies in parallel. This should work as is, but I can provide some scripts/configs that make it out of the box work with lingua data. It also hardcodes using slurm to parallelize, if this is an issue, there is an alternate executor in submitit that just uses python multiprocessing.

That should enable training from lingua data. I'll make incremental updates/PRs so you can make use of this right away as its done versus waiting for the full thing to be done

Unpredictable-12 · 2025-01-14T02:57:43Z

Hi！！！！！I'm very interested in this BLT model and want to train and reproduce it.Now，I am following the steps in the readme.
If I want to conduct a complete training of a BLT model now and ultimately obtain a model file while also validating it, what functions do I need to run, or what is the complete process? It's currently a bit confusing, for example, with scripts like download_prepare_hf_data.py, train.py, etc.

EntilZha · 2025-01-14T17:40:10Z

Hi @Unpredictable-12, I'm working on making this a bit clearer and fixing some scripts that are needed for a full reproduction. In general, the steps will be:

Use the HF based download scripts
Train the entropy model using the jsonl files
Preprocess the HF data to arrow data, which mainly involves ahead-of-time running the entropy model on the training data and saving entropies (alternatively, can run the entropy model online, but we need to port the code to do this).
Run train.py

I pushed changes yesterday that fixes the script to preprocess a given jsonl file to arrow for (3), and I'm working adding a config/instructions for training the entropy model now.

EntilZha mentioned this issue Jan 13, 2025

Missing script to train the entropy model #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

RefractAI commented Dec 16, 2024

EntilZha commented Dec 16, 2024

nican2018 commented Jan 4, 2025

EntilZha commented Jan 6, 2025

Unpredictable-12 commented Jan 8, 2025

EntilZha commented Jan 9, 2025

Unpredictable-12 commented Jan 14, 2025

EntilZha commented Jan 14, 2025

train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

train.py is expecting Arrow files and an entropy model but data preparation creates JSONL files #4

Comments

RefractAI commented Dec 16, 2024

EntilZha commented Dec 16, 2024

nican2018 commented Jan 4, 2025

EntilZha commented Jan 6, 2025

Unpredictable-12 commented Jan 8, 2025

EntilZha commented Jan 9, 2025

Unpredictable-12 commented Jan 14, 2025

EntilZha commented Jan 14, 2025