ML Flow Checkpointing #1938

wanderingweights · 2024-10-02T01:43:20Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I have the below config and I was hoping to have the model checkpoints saving as artifacts but I only get the metrics and config saving.

Is this expected to work or am I missing something?

Thanks for looking over!

Current behaviour

No model checkpoints.

Steps to reproduce

I use the docker image:

"winglian/axolotl:main-latest"

hf login

then

accelerate launch -m axolotl.cli.train theconfig.yml

Config yaml

base_model: NousResearch/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: true
load_in_4bit: false
strict: false

chat_template: llama3
datasets:
  - path: base.yml
    type: alpaca
    ds_type: json

dataset_prepared_path:
val_set_size: 0.05
output_dir: attempt_1

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

mlflow_tracking_uri: ml_flow_url
mlflow_experiment_name: test_1
hf_mlflow_log_artifacts: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:

hub_model_id: some_repo
hub_strategy: checkpoint
hub_token: 

  
save_steps: 5
save_strategy: steps
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
max_steps: 10
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
   pad_token: " "

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

winglian/axolotl:main-latest

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 · 2024-10-15T10:59:16Z

Hey, thanks for reporting this. I noticed that our callback doesn't inherit MLflowCallback, so it's not properly reading that env.

At least, do you notice that the yaml is saved to mlflow?

https://huggingface.co/docs/transformers/v4.45.2/en/main_classes/callback#transformers.integrations.MLflowCallback

winglian · 2024-10-15T12:24:21Z

I think @awhazell added support for mlflow and might be able to help

awhazell · 2024-10-15T13:15:02Z

I can confirm the config yaml is saved as an artifact, but also that the checkpoints/end model are not.

A fix could be as simple as inheriting the callback @NanoCode012 linked (or adding it to the callbacks list separately) but would need to make sure it doesn't conflict with any of the setup from HFCausalTrainerBuilder.build

NanoCode012 · 2024-10-16T05:25:27Z

Hey @awhazell , what potential conflicts were you thinking of?

My only concern may be duplicate logs due to report_to config and MLflowCallback logging.

Regarding the change needed, I believe we can just import the callback and append to the callbacks variable. Would you be interested in working on this PR?

awhazell · 2024-10-17T10:36:03Z

I was thinking about whether setting mlflow options in both the trainer kwargs and env variables could cause issues- but I think you're right and it shouldn't be an issue, they should always be consistent anyway

Opened a PR here #1976

wanderingweights added the bug Something isn't working label Oct 2, 2024

awhazell mentioned this issue Oct 17, 2024

Log checkpoints as mlflow artifacts #1976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Flow Checkpointing #1938

ML Flow Checkpointing #1938

wanderingweights commented Oct 2, 2024

NanoCode012 commented Oct 15, 2024

winglian commented Oct 15, 2024

awhazell commented Oct 15, 2024

NanoCode012 commented Oct 16, 2024

awhazell commented Oct 17, 2024

ML Flow Checkpointing #1938

ML Flow Checkpointing #1938

Comments

wanderingweights commented Oct 2, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 15, 2024

winglian commented Oct 15, 2024

awhazell commented Oct 15, 2024

NanoCode012 commented Oct 16, 2024

awhazell commented Oct 17, 2024