Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

Open
thuytrang32 opened this issue Jan 7, 2025 · 20 comments
Open

Comments

@thuytrang32
Copy link

What happened?

I am trying to fine-tune an LLM using Kubeflow without GPU devices. However, I encountered two issues during the process :

  • When I removed the gpu key from resources_per_worker, the training job still attempted to allocate GPUs, resulting in the CUDA error: invalid device ordinal (the training job tried to allocate GPUs)
    image

  • To address this, I tried adding ddp_backend="gloo" to training_parameters. However, this led to another error:
    image

I followed this instruction : https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ . This is the code i ran :

`import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceTrainerParams,
HuggingFaceDatasetParams,
)

TrainingClient().train(
name="fine-tune-bert",
# BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://google-bert/bert-base-cased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
    ),
    # Set LoRA config to reduce number of trainable model parameters.
    lora_config=LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        bias="none",
    ),
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=16, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    #"gpu": 0,
    "cpu": 16,
    "memory": "16G",
},

)`

What did you expect to happen?

The training job should correctly initialize without attempting to allocate GPUs.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6+k3s2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6+k3s2

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest

Training Operator Python SDK version:

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

👍

@thuytrang32
Copy link
Author

I also have error [rank0]: ValueError: Please specify target_modules in peft_config . I tried to delete the lora config but that error still exists
image
image
image

`
import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceTrainerParams,
HuggingFaceDatasetParams,
)

TrainingClient().train(
name="fine-tune-bert",
# BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://distilbert/distilbert-base-uncased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        #ddp_backend="gloo",
    ),
    
    # Set LoRA config to reduce number of trainable model parameters.
    
    #lora_config=LoraConfig(
        #r=8,
        #lora_alpha=8,
        #lora_dropout=0.1,
        #bias="none",
        #target_modules=["encoder.layer.*.attention.self.query", "encoder.layer.*.attention.self.key"]
    #),
    
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=20, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    "cpu": 20,
    "memory": "20G",
},

)
`

@andreyvelich
Copy link
Member

Thank you for creating this!
For the first error, please can you check the PyTorchJob ?
It should create it without GPU resources.

kubectl get pytorchjob -n <NAMESPACE> -o yaml

@thuytrang32
Copy link
Author

Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.

kubectl get pytorchjob -n <NAMESPACE> -o yaml

Hi , this is the output
(base) jovyan@ex-0:~$ kubectl get pytorchjobs -n kubeflow-user-example-com -o yaml
apiVersion: v1
items:

  • apiVersion: kubeflow.org/v1
    kind: PyTorchJob
    metadata:
    creationTimestamp: "2025-01-07T14:10:47Z"
    generation: 1
    name: fine-tune-bert
    namespace: kubeflow-user-example-com
    resourceVersion: "580563"
    uid: 72ed106c-5299-4de3-9f27-8e5464d4e59b
    spec:
    nprocPerNode: "20"
    pytorchReplicaSpecs:
    Master:
    replicas: 1
    template:
    metadata:
    annotations:
    sidecar.istio.io/inject: "false"
    spec:
    containers:
    - args:
    - --model_uri
    - hf://distilbert/distilbert-base-uncased
    - --transformer_type
    - AutoModelForSequenceClassification
    - --num_labels
    - None
    - --model_dir
    - /workspace/model
    - --dataset_dir
    - /workspace/dataset
    - --lora_config
    - '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
    null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
    null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
    "modules_to_save": null, "init_lora_weights": true}'
    - --training_parameters
    - '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
    false, "do_eval": false, "do_predict": false, "evaluation_strategy":
    "no", "prediction_loss_only": false, "per_device_train_batch_size":
    8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
    "per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
    "eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
    5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
    "adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
    "max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
    {}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
    "warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
    "logging_strategy": "steps", "logging_first_step": false, "logging_steps":
    500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
    500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
    false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
    "use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
    false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
    "O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
    false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
    null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
    false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
    null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
    true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
    false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
    false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
    0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
    "fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
    {"split_batches": false, "dispatch_batches": null, "even_batches":
    true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
    0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
    "group_by_length": false, "length_column_name": "length", "report_to":
    [], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
    "ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
    false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
    false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
    null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
    false, "hub_always_push": false, "gradient_checkpointing": false,
    "gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
    false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
    null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
    "", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
    null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
    "torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
    null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
    false, "neftune_noise_alpha": null}'
    image: docker.io/kubeflow/trainer-huggingface
    name: pytorch
    resources:
    limits:
    cpu: 20
    memory: 20G
    requests:
    cpu: 20
    memory: 20G
    volumeMounts:
    - mountPath: /workspace
    name: storage-initializer
    initContainers:
    - args:
    - --model_provider
    - hf
    - --model_provider_parameters
    - '{"model_uri": "hf://distilbert/distilbert-base-uncased", "transformer_type":
    "AutoModelForSequenceClassification", "access_token": null, "num_labels":
    null}'
    - --dataset_provider
    - hf
    - --dataset_provider_parameters
    - '{"repo_id": "yelp_review_full", "access_token": null, "split": "train[:100]"}'
    image: docker.io/kubeflow/storage-initializer
    name: storage-initializer
    volumeMounts:
    - mountPath: /workspace
    name: storage-initializer
    volumes:
    - name: storage-initializer
    persistentVolumeClaim:
    claimName: storage-initializer
    Worker:
    replicas: 1
    template:
    metadata:
    annotations:
    sidecar.istio.io/inject: "false"
    spec:
    containers:
    - args:
    - --model_uri
    - hf://distilbert/distilbert-base-uncased
    - --transformer_type
    - AutoModelForSequenceClassification
    - --num_labels
    - None
    - --model_dir
    - /workspace/model
    - --dataset_dir
    - /workspace/dataset
    - --lora_config
    - '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
    null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
    null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
    "modules_to_save": null, "init_lora_weights": true}'
    - --training_parameters
    - '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
    false, "do_eval": false, "do_predict": false, "evaluation_strategy":
    "no", "prediction_loss_only": false, "per_device_train_batch_size":
    8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
    "per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
    "eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
    5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
    "adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
    "max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
    {}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
    "warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
    "logging_strategy": "steps", "logging_first_step": false, "logging_steps":
    500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
    500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
    false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
    "use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
    false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
    "O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
    false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
    null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
    false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
    null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
    true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
    false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
    false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
    0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
    "fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
    {"split_batches": false, "dispatch_batches": null, "even_batches":
    true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
    0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
    "group_by_length": false, "length_column_name": "length", "report_to":
    [], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
    "ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
    false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
    false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
    null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
    false, "hub_always_push": false, "gradient_checkpointing": false,
    "gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
    false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
    null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
    "", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
    null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
    "torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
    null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
    false, "neftune_noise_alpha": null}'
    image: docker.io/kubeflow/trainer-huggingface
    name: pytorch
    resources:
    limits:
    cpu: 20
    memory: 20G
    requests:
    cpu: 20
    memory: 20G
    volumeMounts:
    - mountPath: /workspace
    name: storage-initializer
    volumes:
    - name: storage-initializer
    persistentVolumeClaim:
    claimName: storage-initializer
    runPolicy:
    suspend: false
    status:
    conditions:
    • lastTransitionTime: "2025-01-07T14:10:47Z"
      lastUpdateTime: "2025-01-07T14:10:47Z"
      message: PyTorchJob fine-tune-bert is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    • lastTransitionTime: "2025-01-07T14:11:16Z"
      lastUpdateTime: "2025-01-07T14:11:16Z"
      message: PyTorchJob fine-tune-bert is running.
      reason: PyTorchJobRunning
      status: "True"
      type: Running
      replicaStatuses:
      Master:
      active: 1
      selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master
      Worker:
      active: 1
      selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
      startTime: "2025-01-07T14:10:47Z"
      kind: List
      metadata:
      resourceVersion: ""
      (base) jovyan@ex-0:~$

@andreyvelich
Copy link
Member

So, as you can see the GPU has not been allocated to your PyTorch's pod:

resources:
  limits:
    cpu: 20
    memory: 20G
  requests:
    cpu: 20
    memory: 20G

Locally on Kind using MacOS, I was able to run the example on CPU using docker.io/kubeflow/trainer-huggingface image.

Where do you run your Kubernetes cluster ?

@thuytrang32
Copy link
Author

Yes but it still has this error when I check with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com
image

I ran this code in Notebook of Kubeflow UI

@andreyvelich
Copy link
Member

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ?
@deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

@thuytrang32
Copy link
Author

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? @deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

I used Jarvice to create a Kubeflow instance.

@andreyvelich
Copy link
Member

Do you know which instances do they run for Kubernetes Nodes ?
E.g. is it AMD Linux machines with CPUs ?

@thuytrang32
Copy link
Author

Sorry , i don't know

@andreyvelich
Copy link
Member

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

@thuytrang32
Copy link
Author

thuytrang32 commented Jan 7, 2025

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

When I ran with both ddp_backend and use_cpu , it still had the old error
image

Then I tried to run with use_cpu = True only , the code passed. Then i checked with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com, it had these errors again

image

For kubectl describe pod fine-tune-bert-worker-0 -n kubeflow-user-example-com , because the worker has GPU , it didn't have CUDA error but it still had this

image
image

@helenxie-bit
Copy link
Contributor

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)

Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

@thuytrang32
Copy link
Author

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

@thuytrang32
Copy link
Author

thuytrang32 commented Jan 7, 2025

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)

Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

It didn't work even though i put both no_cuda=True, use_cpu = True
image

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?
image

@helenxie-bit
Copy link
Contributor

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

Oh I see. It seems that when lora_config is not explicitly set, the API assigns its default values and passes them into the container, as shown in the output of kubectl get pytorchjob -n <NAMESPACE> -o yaml:

- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'

As a result, the trainer still attempts to configure the PEFT model as indicated in the script:

def setup_peft_model(model, lora_config):
# Set up the PEFT model
lora_config = LoraConfig(**json.loads(lora_config))
reference_lora_config = LoraConfig()
for key, val in lora_config.__dict__.items():
old_attr = getattr(reference_lora_config, key, None)
if old_attr is not None:
val = type(old_attr)(val)
setattr(lora_config, key, val)
model.enable_input_require_grads()
model = get_peft_model(model, lora_config)
return model

It seems that even without specifying lora_config, it is still included in the fine-tuning process. Could you try setting lora_config and explicitly defining target_modules to see if that resolves the issue?

Meanwhile, @andreyvelich do you think this might be a bug?

@helenxie-bit
Copy link
Contributor

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.
I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)

Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

It didn't work even though i put both no_cuda=True, use_cpu = True image

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ? image

For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?

Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?

@thuytrang32

This comment was marked as resolved.

@andreyvelich
Copy link
Member

Meanwhile, @andreyvelich do you think this might be a bug?

Yes, I think we should fix it if users want to use train API without LoRA.
cc @deepanker13 @saileshd1402 @johnugeorge

@andreyvelich
Copy link
Member

@thuytrang32 Please can you try the CPU image for your Trainer ?
You can use image that I built locally:

export TRAINER_TRANSFORMER_IMAGE=docker.io/andreyvelichkevich/llm-trainer-cpu

@deepanker13
Copy link
Contributor

deepanker13 commented Jan 13, 2025

@thuytrang32 can you please share the results by using less number of num_procs_per_worker maybe 2 for now, reducing cpu in resource_per_worker to 8, setting ddp_backend="gloo" and removing gpu key from resources. Maybe it is a resource issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants