Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

michaellin99999 · 2024-09-23T15:23:04Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

This issue should not occur, as H100 definitely supports bf16.

Current behaviour

outputs error: Value error, bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above.

Steps to reproduce

run the script https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/multi-node.qmd

Config yaml

base_model: openlm-research/open_llama_3b_v2 [0/3]model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint::

lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"

Possible solution

no idea what is causing this issue.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11.9

axolotl branch-commit

none

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

michaellin99999 · 2024-09-23T15:36:59Z

the same settings used in Regular training, works.

michaellin99999 · 2024-09-23T15:40:22Z

settings in accelerate:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

michaellin99999 · 2024-09-23T15:43:09Z

this is the snippet for multinode slave settings:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 1
main_process_ip: 192.168.108.22
main_process_port: 5000
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

winglian · 2024-09-23T18:18:49Z

I recommend not using the accelerate config and removing that file. axolotl handles much of that automatically. See https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on

michaellin99999 · 2024-09-24T00:27:14Z

ok, is it the accelerate config causing the issue?

ehartford · 2024-09-24T01:51:20Z

Often, it is

michaellin99999 · 2024-09-24T15:31:17Z

we tried that still same issue, also went through https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on this requires axolot cloud, Im using my own two 8xh100 clusters. any scripts that work?

michaellin99999 added the bug Something isn't working label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

michaellin99999 commented Sep 23, 2024

michaellin99999 commented Sep 23, 2024

michaellin99999 commented Sep 23, 2024

michaellin99999 commented Sep 23, 2024

winglian commented Sep 23, 2024

michaellin99999 commented Sep 24, 2024

ehartford commented Sep 24, 2024

michaellin99999 commented Sep 24, 2024

Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. #1924

Comments

michaellin99999 commented Sep 23, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

michaellin99999 commented Sep 23, 2024

michaellin99999 commented Sep 23, 2024

michaellin99999 commented Sep 23, 2024

winglian commented Sep 23, 2024

michaellin99999 commented Sep 24, 2024

ehartford commented Sep 24, 2024

michaellin99999 commented Sep 24, 2024