New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[USERGUIDE] LLM Hyperparameter Optimization API #3952

Open

mahdikhashan wants to merge 22 commits into kubeflow:master from mahdikhashan:llm-hp-optimization

+364 −0

mahdikhashan commented Jan 7, 2025

ref: #3951

google-oss-prow bot added the do-not-merge/work-in-progress label

google-oss-prow bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

content/en/docs/components/katib/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from andreyvelich and johnugeorge

January 7, 2025 09:24

google-oss-prow bot added the needs-ok-to-test label

google-oss-prow bot commented Jan 7, 2025

Hi @mahdikhashan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow bot added the size/S label

Author

mahdikhashan commented Jan 7, 2025 •

edited

Loading

hi @andreyvelich , shall i keep it under user-guides/hp-tuning/?

Member

andreyvelich commented Jan 7, 2025

Sure, I think we can create a new page for this feature.
FYI, please follow the contribution guide to sign the commits: https://www.kubeflow.org/docs/about/contributing/#getting-started
cc @helenxie-bit

Member

andreyvelich commented Jan 7, 2025

Part of: kubeflow/katib#2339

Arhell reviewed

View reviewed changes

Member

Arhell left a comment

/ok-to-test

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels

mahdikhashan added 2 commits

January 8, 2025 10:47


          add base md

181719d

Signed-off-by: mahdikhashan <[email protected]>


          update title and description

aa3b2be

Signed-off-by: mahdikhashan <[email protected]>

mahdikhashan force-pushed the llm-hp-optimization branch from a120413 to aa3b2be Compare

January 8, 2025 09:48

google-oss-prow bot added size/M size/L and removed size/S size/M labels

mahdikhashan marked this pull request as ready for review

January 11, 2025 18:26

google-oss-prow bot removed the do-not-merge/work-in-progress label

google-oss-prow bot requested a review from sperlingxx

January 11, 2025 18:26

mahdikhashan added 7 commits

January 11, 2025 19:47


          add draft code

36f2c1e

Signed-off-by: mahdikhashan <[email protected]>


          add prerequisites

b7120e0

Signed-off-by: mahdikhashan <[email protected]>


          add huggingface api details,s3 api, update example

182b493

Signed-off-by: mahdikhashan <[email protected]>


          remove redundant text

21646a8

Signed-off-by: mahdikhashan <[email protected]>


          add HuggingFaceTrainerParams description

cc71dee

Signed-off-by: mahdikhashan <[email protected]>


          update prerequisites

56e4e53

Signed-off-by: mahdikhashan <[email protected]>


          update code example

6f36f1e

Signed-off-by: mahdikhashan <[email protected]>


          add sections

41efc88

Signed-off-by: mahdikhashan <[email protected]>

mahdikhashan force-pushed the llm-hp-optimization branch from ccb0c64 to 41efc88 Compare

January 11, 2025 18:47

Author

mahdikhashan commented Jan 11, 2025

i'm not sure why the netlify tests are failing, i checked the logs and seems my links are not appropriate, i appreciate your hint on how to fixing them.

helenxie-bit reviewed

View reviewed changes

Contributor

helenxie-bit left a comment

Thank you so much for your contribution! It’s very detailed, and I’m sure it will greatly help users get started with this API. I’ve left my initial comments for your review.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md Outdated

    
            @@ -0,0 +1,337 @@
          
              +++

              title = "How to Optimize Language Models Hyperparameters"

Contributor

helenxie-bit Jan 11, 2025

Since "Large Language Model" is a standard term, I suggest updating the title to: "How to Optimize Large Language Models (LLMs) Hyperparameters" for consistency and clarity.

Author

mahdikhashan Jan 12, 2025

i agree with you.

Author

mahdikhashan Jan 12, 2025

done.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md Outdated

    
              weight = 20

              +++

              This page describes Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure

Contributor

helenxie-bit Jan 11, 2025

Same as above.

Author

mahdikhashan Jan 12, 2025

done.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

    
              ## Sections

              - [Prerequisites](#Prerequisites)

              - [Load Model and Dataset](#Load-Model-and-Dataset)

              - [Finetune](#Finetune-Language-Models)

Contributor

helenxie-bit Jan 11, 2025

In the user guide for the train API in the Training Operator, we titled it "How to Fine-Tune LLMs with Kubeflow". To differentiate between these two APIs and provide more clarity for users, I suggest replacing all instances of "fine-tune" in this user guide with "hyperparameter optimization" or "optimizing hyperparameters". WDYT @andreyvelich 👀

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

    
              ## Prerequisites

              You need to install the following Katib components to run code in this guide:

Contributor

helenxie-bit Jan 11, 2025

Since the Kubeflow Training SDK (which includes Transformers and PEFT) is already integrated into the extra_requires of the Katib Python SDK, users only need to install the following to use this API:

Katib control plane;
Katib Python SDK with LLM hyperparameter optimization support: pip install -U kubeflow-katib[huggingface]

Additionally, this API supports both non-distributed training and PyTorchJob distributed training. To enable PyTorchJob distributed training, in addition to the two prerequisites above, users also need to install the Training Operator control plane.

Author

mahdikhashan Jan 12, 2025

done

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

    
              |----------------------------|-------------------------------------|-------------------------------------------------------------------------------------------------|

              | `training_parameters`      | `transformers.TrainingArguments`    | Contains the training arguments like learning rate, epochs, batch size, etc.                    |

              | `lora_config`              | `LoraConfig`                        | LoRA configuration to reduce the number of trainable parameters in the model.                   |

Contributor

helenxie-bit Jan 11, 2025

I suggest adding a paragraph here to explain how to define the hyperparameter search space. Katib currently supports three hyperparameter search methods: float, int, and categorical. Their usage is documented here: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. And we could include example usage of these methods in the 'Example Usage' section below.

Author

mahdikhashan Jan 12, 2025

done. thanks for reminding me of this.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md Outdated

    
              | `env_per_trial`                  | Environment variables for each trial.                                           | Optional     |

              | `algorithm_name`                 | Algorithm used for the hyperparameter search.                                   | Required     |

              | `algorithm_settings`             | Settings for the search algorithm.                                              | Optional     |

              | `objective_metric_name`          | Name of the objective metric for optimization.                                  | Required     |

Contributor

helenxie-bit Jan 11, 2025

I think objective_metric_name is optional and defaults to "None" if not specified. However, since setting an objective metric is essential for a meaningful experiment, it should actually be required. Alternatively, do you think we should update the default value to a meaningful metric name, such as 'train_loss' in the API? @andreyvelich

Author

mahdikhashan Jan 12, 2025

changed to optional is done. lets fix it in another issue - kubeflow/katib#2481

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

    
              | `packages_to_install`            | List of additional Python packages to install.                                  | Optional     |

              | `pip_index_url`                  | The PyPI URL from which to install Python packages.                             | Optional     |

              | `metrics_collector_config`       | Configuration for the metrics collector.                                        | Optional     |

Contributor

helenxie-bit Jan 11, 2025

I suggest adding three notes here for clarity:

For LLM hyperparameter optimization, we currently support only train_loss as the objective metric (This is because it is the default metric generated by the trainer.train() function in Hugging Face, which our trainer uses). We plan to add support for more metrics in the future.
Users might feel confused about the presence of parameters like objective, base_image, and parameters, and what are their relationships with model_provider_parameters, dataset_provider_parameters, and trainer_parameters we've explained.
As mentioned, objective, base_image, and parameters are originally parameters in this API which allow custom objective functions, images, and parameters for optimization, and we kept this option. Therefore, we could explain this by stating:
If you want to define your own objective function, you need to specify objective, base_image, and parameters instead. See usage examples here:: https://www.kubeflow.org/docs/components/katib/getting-started/.
To enable PyTorchJob distributed training for hyperparameter optimization, users can specify a types.TrainerResources object for resources_per_trial. For example:

resources_per_trial=types.TrainerResources(
        num_workers=4,
        num_procs_per_worker=2,
        resources_per_worker={"gpu": 2, "cpu": 5, "memory": "10G",},
    ),

Author

mahdikhashan Jan 12, 2025

done. thanks for your detailed response.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md Outdated

    
              | **Parameter**                   | **Description**                                                                 | **Required** |

              |----------------------------------|---------------------------------------------------------------------------------|--------------|

              | `name`                           | Name of the experiment.                                                          | Required     |

              | `model_provider_parameters`      | Parameters for the model provider, such as model type and configuration.        | Required     |

Contributor

helenxie-bit Jan 11, 2025

I think model_provider_parameters, dataset_provider_parameters, and trainer_parameters are optional and default to "None" if not specified. This is because we allow users the flexibility to define custom objective functions, images, and parameters for optimization. Users have two options:

Import models and datasets from external platforms by specifying model_provider_parameters, dataset_provider_parameters, and trainer_parameters.
Define a custom objective function by specifying objective, base_image, and parameters.

These parameters are all optional, but the API will check their existence to ensure consistency.

Author

mahdikhashan Jan 12, 2025

ok, i never noticed that in code. i'll mention both scenarios. thanks for bringing this to my attention 🙏.

Author

mahdikhashan Jan 12, 2025

done.

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md Outdated

    
              | `metrics_collector_config`       | Configuration for the metrics collector.                                        | Optional     |

              ### Example:

Contributor

helenxie-bit Jan 11, 2025

I suggest not adding this example here, as it would be very similar to the code provided in the "Example: Fine-Tuning Llama-3.2 for Binary Classification on the IMDB Dataset" section.

Author

mahdikhashan Jan 12, 2025

done

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

    
              | `parallel_trial_count`     | Number of trials to run in parallel, set to `2`.                     |

              | `resources_per_trial`      | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory.    |

              ```python

Contributor

helenxie-bit Jan 11, 2025

Have you tested this code example? Does it work well? If not, I can run this example locally when I have time and let you know if I encounter any issues.

Author

mahdikhashan Jan 12, 2025

i have created an issue for myself to add a notebook for it, i'll try to do it myself (for learning purpose) and will contact you with issue i may encounter. would it be possible for me to message you on slack?

related issue: kubeflow/katib#2480

mahdikhashan added 2 commits

January 12, 2025 10:49


          replace langauge models with large language models

856f822

Signed-off-by: mahdikhashan <[email protected]>


          improve prerequisites

9e79820

Signed-off-by: mahdikhashan <[email protected]>

mahdikhashan mentioned this pull request

[SDK] objective_metric_name to be required kubeflow/katib#2481

Open

mahdikhashan added 10 commits

January 12, 2025 14:06


          algorithm_name is optional

db9cabf

Signed-off-by: mahdikhashan <[email protected]>


          objective_type is optional

0b9d9bd

Signed-off-by: mahdikhashan <[email protected]>


          objective_metric_name is optional

6996c45

Signed-off-by: mahdikhashan <[email protected]>


          remove redundant example

87e0a1b

Signed-off-by: mahdikhashan <[email protected]>


          change tune args to optional

e1b7b35

Signed-off-by: mahdikhashan <[email protected]>


          add search api

4addd74

Signed-off-by: mahdikhashan <[email protected]>


          update link title

d0d3c92

Signed-off-by: mahdikhashan <[email protected]>


          add two scenarios for tune function with custom objective or loading …

2eda5a5

…model and parameters from hugging face

Signed-off-by: mahdikhashan <[email protected]>


          add link for custom objective function example

d308043

Signed-off-by: mahdikhashan <[email protected]>


          improve tune section

f416581

Signed-off-by: mahdikhashan <[email protected]>

mahdikhashan requested a review from helenxie-bit

January 12, 2025 13:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Arhell Arhell left review comments

andreyvelich Awaiting requested review from andreyvelich

johnugeorge Awaiting requested review from johnugeorge

sperlingxx Awaiting requested review from sperlingxx

helenxie-bit Awaiting requested review from helenxie-bit

Labels

ok-to-test size/L