Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowInvalid: Column 4 named images expected length 360 but got length 352 #5

Open
1 task done
DhruvaBansal00 opened this issue Jan 16, 2025 · 3 comments
Open
1 task done
Assignees
Labels
enhancement New feature or request

Comments

@DhruvaBansal00
Copy link

DhruvaBansal00 commented Jan 16, 2025

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1
  • Platform: Linux-5.15.0-1048-oracle-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • PyTorch version: 2.3.1+cu121 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A100-SXM4-80GB
  • DeepSpeed version: 0.14.4
  • Bitsandbytes version: 0.45.0

Reproduction

### model
model_name_or_path: Qwen/Qwen2-VL-72B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true  # choices: [true, false]
train_mm_proj_only: false  # choices: [true, false]
deepspeed: examples/deepspeed/ds_z3_offload_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
use_adam_mini: true

### dataset
dataset: mllm_demo,identity,alpaca_en_demo,slimorca
template: qwen2_vl
cutoff_len: 128000
max_samples: 25000
overwrite_cache: true
preprocessing_num_workers: 16
sequence_parallel_size: 4

### output
output_dir: saves/qwen2_vl-72b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
packing: true
enable_liger_kernel: false
flash_attn: fa2
use_unsloth_gc: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

### logging
report_to: wandb
run_name: qwen2vl-72b-full-sft-1

^config file for training

Stack trace:

[rank7]: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
[rank7]: │ /360-LLaMA-Factory/src/train.py:28 in <module>                               │
[rank7]: │                                                                              │
[rank7]: │   25                                                                         │
[rank7]: │   26                                                                         │
[rank7]: │   27 if __name__ == "__main__":                                              │
[rank7]: │ ❱ 28 │   main()                                                              │
[rank7]: │   29                                                                         │
[rank7]: │                                                                              │
[rank7]: │ /360-LLaMA-Factory/src/train.py:19 in main                                   │
[rank7]: │                                                                              │
[rank7]: │   16                                                                         │
[rank7]: │   17                                                                         │
[rank7]: │   18 def main():                                                             │
[rank7]: │ ❱ 19 │   run_exp()                                                           │
[rank7]: │   20                                                                         │
[rank7]: │   21                                                                         │
[rank7]: │   22 def _mp_fn(index):                                                      │
[rank7]: │                                                                              │
[rank7]: │ /360-LLaMA-Factory/src/llamafactory/train/tuner.py:50 in run_exp             │
[rank7]: │                                                                              │
[rank7]: │    47 │   if finetuning_args.stage == "pt":                                  │
[rank7]: │    48 │   │   run_pt(model_args, data_args, training_args, finetuning_args,  │
[rank7]: │    49 │   elif finetuning_args.stage == "sft":                               │
[rank7]: │ ❱  50 │   │   run_sft(model_args, data_args, training_args, finetuning_args, │
[rank7]: │    51 │   elif finetuning_args.stage == "rm":                                │
[rank7]: │    52 │   │   run_rm(model_args, data_args, training_args, finetuning_args,  │
[rank7]: │    53 │   elif finetuning_args.stage == "ppo":                               │
[rank7]: │                                                                              │
[rank7]: │ /360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py:47 in run_sft      │
[rank7]: │                                                                              │
[rank7]: │    44 │   tokenizer_module = load_tokenizer(model_args)                      │
[rank7]: │    45 │   tokenizer = tokenizer_module["tokenizer"]                          │
[rank7]: │    46 │   template = get_template_and_fix_tokenizer(tokenizer, data_args)    │
[rank7]: │ ❱  47 │   dataset_module = get_dataset(template, model_args, data_args, trai │
[rank7]: │    48 │   model = load_model(tokenizer, model_args, finetuning_args, trainin │
[rank7]: │    49 │                                                                      │
[rank7]: │    50 │   if getattr(model, "is_quantized", False) and not training_args.do_ │
[rank7]: │                                                                              │
[rank7]: │ /360-LLaMA-Factory/src/llamafactory/data/loader.py:279 in                    │
[rank7]: │ sequence_parallel_processor                                                  │
[rank7]: │                                                                              │
[rank7]: │   276 │   │   │   │   if data_args.shuffle_for_sequence_parallel:            │
[rank7]: │   277 │   │   │   │   │   dataset = dataset.shuffle(seed=training_args.seed) │
[rank7]: │   278 │   │   │   │   padded_dataset = dataset.map(pad_sequence, batched=Tru │
[rank7]: │ ❱ 279 │   │   │   │   sp_dataset = padded_dataset.map(sp_split, batched=True │
[rank7]: │   280 │   │   │   │   dataset_module[k] = sp_dataset                         │
[rank7]: │   281 │   │                                                                  │
[rank7]: │   282 │   │   else:                                                          │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/datasets/arrow_dataset.py:560 in wrapper      │
[rank7]: │                                                                              │
[rank7]: │    557 │   │   │   "output_all_columns": self._output_all_columns,           │
[rank7]: │    558 │   │   }                                                             │
[rank7]: │    559 │   │   # apply actual function                                       │
[rank7]: │ ❱  560 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kw │
[rank7]: │    561 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance( │
[rank7]: │    562 │   │   # re-apply format to the output                               │
[rank7]: │    563 │   │   for dataset in datasets:                                      │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/datasets/arrow_dataset.py:3055 in map         │
[rank7]: │                                                                              │
[rank7]: │   3052 │   │   │   │   │   total=pbar_total,                                 │
[rank7]: │   3053 │   │   │   │   │   desc=desc or "Map",                               │
[rank7]: │   3054 │   │   │   │   ) as pbar:                                            │
[rank7]: │ ❱ 3055 │   │   │   │   │   for rank, done, content in Dataset._map_single(** │
[rank7]: │   3056 │   │   │   │   │   │   if done:                                      │
[rank7]: │   3057 │   │   │   │   │   │   │   shards_done += 1                          │
[rank7]: │   3058 │   │   │   │   │   │   │   logger.debug(f"Finished processing shard  │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/datasets/arrow_dataset.py:3481 in _map_single │
[rank7]: │                                                                              │
[rank7]: │   3478 │   │   │   │   │   │   │   ):                                        │
[rank7]: │   3479 │   │   │   │   │   │   │   │   writer.write_table(batch.to_arrow())  │
[rank7]: │   3480 │   │   │   │   │   │   │   else:                                     │
[rank7]: │ ❱ 3481 │   │   │   │   │   │   │   │   writer.write_batch(batch)             │
[rank7]: │   3482 │   │   │   │   │   │   num_examples_progress_update += num_examples_ │
[rank7]: │   3483 │   │   │   │   │   │   if time.time() > _time + config.PBAR_REFRESH_ │
[rank7]: │   3484 │   │   │   │   │   │   │   _time = time.time()                       │
[rank7]: │                                                                              │
[rank7]: │ /usr/lib/python3/dist-packages/datasets/arrow_writer.py:608 in write_batch   │
[rank7]: │                                                                              │
[rank7]: │   605 │   │   │   │   arrays.append(pa.array(typed_sequence))                │
[rank7]: │   606 │   │   │   │   inferred_features[col] = typed_sequence.get_inferred_t │
[rank7]: │   607 │   │   schema = inferred_features.arrow_schema if self.pa_writer is N │
[rank7]: │ ❱ 608 │   │   pa_table = pa.Table.from_arrays(arrays, schema=schema)         │
[rank7]: │   609 │   │   self.write_table(pa_table, writer_batch_size)                  │
[rank7]: │   610 │                                                                      │
[rank7]: │   611 │   def write_table(self, pa_table: pa.Table, writer_batch_size: Optio │
[rank7]: │                                                                              │
[rank7]: │ in pyarrow.lib.Table.from_arrays:4868                                        │
[rank7]: │                                                                              │
[rank7]: │ in pyarrow.lib.Table.validate:4214                                           │
[rank7]: │                                                                              │
[rank7]: │ in pyarrow.lib.check_status:92                                               │
[rank7]: ╰──────────────────────────────────────────────────────────────────────────────╯
[rank7]: ArrowInvalid: Column 4 named images expected length 360 but got length 352

Expected behavior

Training should proceed for Qwen 2.5 VL 72b normally

Others

mllm_demo is a dataset with images. Has this repo been tested with multimodal datasets yet?

@DhruvaBansal00
Copy link
Author

Ran the same config through without the mllm_demo dataset and training succeeded.

I am hoping to train on multimodal datasets with sequence parallelism - would love advice on how we could enable training on image datasets too.

@HaoshengZou
Copy link
Collaborator

Thanks for your interest!
Multimodal SP is our next internal milestone - the current release hasn't been tested with multimodal data but we are playing with them already. Stay tuned for a tested multimodal SP support!

@HaoshengZou HaoshengZou self-assigned this Jan 17, 2025
@HaoshengZou HaoshengZou added the enhancement New feature or request label Jan 17, 2025
@DhruvaBansal00
Copy link
Author

Thanks for the update!

What timelines are you tracking internally for releasing multimodal SP? And any way I could help with adding support for multimodal SP? This is relatively high on my priority list at the moment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants