Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USERGUIDE] LLM Hyperparameter Optimization API #3952

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

mahdikhashan
Copy link

ref: #3951

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Hi @mahdikhashan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mahdikhashan
Copy link
Author

mahdikhashan commented Jan 7, 2025

hi @andreyvelich , shall i keep it under user-guides/hp-tuning/?

@andreyvelich
Copy link
Member

Sure, I think we can create a new page for this feature.
FYI, please follow the contribution guide to sign the commits: https://www.kubeflow.org/docs/about/contributing/#getting-started
cc @helenxie-bit

@andreyvelich
Copy link
Member

Part of: kubeflow/katib#2339

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
@mahdikhashan
Copy link
Author

i'm not sure why the netlify tests are failing, i checked the logs and seems my links are not appropriate, i appreciate your hint on how to fixing them.

Copy link
Contributor

@helenxie-bit helenxie-bit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your contribution! It’s very detailed, and I’m sure it will greatly help users get started with this API. I’ve left my initial comments for your review.

@@ -0,0 +1,337 @@
+++
title = "How to Optimize Language Models Hyperparameters"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since "Large Language Model" is a standard term, I suggest updating the title to: "How to Optimize Large Language Models (LLMs) Hyperparameters" for consistency and clarity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

weight = 20
+++

This page describes Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

## Sections
- [Prerequisites](#Prerequisites)
- [Load Model and Dataset](#Load-Model-and-Dataset)
- [Finetune](#Finetune-Language-Models)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the user guide for the train API in the Training Operator, we titled it "How to Fine-Tune LLMs with Kubeflow". To differentiate between these two APIs and provide more clarity for users, I suggest replacing all instances of "fine-tune" in this user guide with "hyperparameter optimization" or "optimizing hyperparameters". WDYT @andreyvelich 👀


## Prerequisites

You need to install the following Katib components to run code in this guide:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Kubeflow Training SDK (which includes Transformers and PEFT) is already integrated into the extra_requires of the Katib Python SDK, users only need to install the following to use this API:

  1. Katib control plane;
  2. Katib Python SDK with LLM hyperparameter optimization support: pip install -U kubeflow-katib[huggingface]

Additionally, this API supports both non-distributed training and PyTorchJob distributed training. To enable PyTorchJob distributed training, in addition to the two prerequisites above, users also need to install the Training Operator control plane.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

|----------------------------|-------------------------------------|-------------------------------------------------------------------------------------------------|
| `training_parameters` | `transformers.TrainingArguments` | Contains the training arguments like learning rate, epochs, batch size, etc. |
| `lora_config` | `LoraConfig` | LoRA configuration to reduce the number of trainable parameters in the model. |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a paragraph here to explain how to define the hyperparameter search space. Katib currently supports three hyperparameter search methods: float, int, and categorical. Their usage is documented here: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. And we could include example usage of these methods in the 'Example Usage' section below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for reminding me of this.

| `env_per_trial` | Environment variables for each trial. | Optional |
| `algorithm_name` | Algorithm used for the hyperparameter search. | Required |
| `algorithm_settings` | Settings for the search algorithm. | Optional |
| `objective_metric_name` | Name of the objective metric for optimization. | Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think objective_metric_name is optional and defaults to "None" if not specified. However, since setting an objective metric is essential for a meaningful experiment, it should actually be required. Alternatively, do you think we should update the default value to a meaningful metric name, such as 'train_loss' in the API? @andreyvelich

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to optional is done. lets fix it in another issue - kubeflow/katib#2481

| `packages_to_install` | List of additional Python packages to install. | Optional |
| `pip_index_url` | The PyPI URL from which to install Python packages. | Optional |
| `metrics_collector_config` | Configuration for the metrics collector. | Optional |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding three notes here for clarity:

  1. For LLM hyperparameter optimization, we currently support only train_loss as the objective metric (This is because it is the default metric generated by the trainer.train() function in Hugging Face, which our trainer uses). We plan to add support for more metrics in the future.

  2. Users might feel confused about the presence of parameters like objective, base_image, and parameters, and what are their relationships with model_provider_parameters, dataset_provider_parameters, and trainer_parameters we've explained.
    As mentioned, objective, base_image, and parameters are originally parameters in this API which allow custom objective functions, images, and parameters for optimization, and we kept this option. Therefore, we could explain this by stating:
    If you want to define your own objective function, you need to specify objective, base_image, and parameters instead. See usage examples here:: https://www.kubeflow.org/docs/components/katib/getting-started/.

  3. To enable PyTorchJob distributed training for hyperparameter optimization, users can specify a types.TrainerResources object for resources_per_trial. For example:

resources_per_trial=types.TrainerResources(
        num_workers=4,
        num_procs_per_worker=2,
        resources_per_worker={"gpu": 2, "cpu": 5, "memory": "10G",},
    ),

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for your detailed response.

| **Parameter** | **Description** | **Required** |
|----------------------------------|---------------------------------------------------------------------------------|--------------|
| `name` | Name of the experiment. | Required |
| `model_provider_parameters` | Parameters for the model provider, such as model type and configuration. | Required |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think model_provider_parameters, dataset_provider_parameters, and trainer_parameters are optional and default to "None" if not specified. This is because we allow users the flexibility to define custom objective functions, images, and parameters for optimization. Users have two options:

  1. Import models and datasets from external platforms by specifying model_provider_parameters, dataset_provider_parameters, and trainer_parameters.

  2. Define a custom objective function by specifying objective, base_image, and parameters.

These parameters are all optional, but the API will check their existence to ensure consistency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i never noticed that in code. i'll mention both scenarios. thanks for bringing this to my attention 🙏.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

| `metrics_collector_config` | Configuration for the metrics collector. | Optional |


### Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not adding this example here, as it would be very similar to the code provided in the "Example: Fine-Tuning Llama-3.2 for Binary Classification on the IMDB Dataset" section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| `parallel_trial_count` | Number of trials to run in parallel, set to `2`. |
| `resources_per_trial` | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory. |

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this code example? Does it work well? If not, I can run this example locally when I have time and let you know if I encounter any issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have created an issue for myself to add a notebook for it, i'll try to do it myself (for learning purpose) and will contact you with issue i may encounter. would it be possible for me to message you on slack?

related issue: kubeflow/katib#2480

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
…model and parameters from hugging face

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants