Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with a large json dataset (>650K) throw error:pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #1888

Open
6 of 8 tasks
bofei5675 opened this issue Sep 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@bofei5675
Copy link

bofei5675 commented Sep 3, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Should be able to train just like using small dataset.

Current behaviour

Throw error such as

[rank0]:   File "pyarrow/table.pxi", line 4387, in pyarrow.lib.Table.combine_chunks
[rank0]:   File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
[rank0]:   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
[rank0]: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Steps to reproduce

Prepare a dataset, configure it as

datasets:
  - path: data/dataset-test.json
    ds_type: json
    type: sharegpt
    conversation: qwen-7b-chat

dataset-test.json has more than 650K rows, and each json object is something like

{
  "id": "",
  "conversations": [
    {
       "from": "xxx", "value": "xxx"
    },
   ...
  ]
}

Delete some items in this dataset to make it less or equal to 650K rows will fix this error.

Config yaml

base_model: Qwen/Qwen2-1.5B-Instruct
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: data/dataset-test.json
    ds_type: json
    type: sharegpt
    conversation: qwen-7b-chat

dataset_prepared_path:
val_set_size: 0
output_dir: ./checkpoints/Qwen2-1.5B-Infinity-Instruct

sequence_len: 4096
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

adapter: 
lora_model_dir:
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: project_1025
wandb_entity:
wandb_watch:
wandb_name: qwen-2-1.5B-infinity-instruct
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 8
num_epochs: 3
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 40
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
save_total_limit: 1 # Checkpoints saved at a time

Possible solution

NA

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@bofei5675 bofei5675 added the bug Something isn't working label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
@bofei5675 and others