-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[USERGUIDE] LLM Hyperparameter Optimization API #3952
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mahdikhashan. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hi @andreyvelich , shall i keep it under |
Sure, I think we can create a new page for this feature. |
Part of: kubeflow/katib#2339 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
a120413
to
aa3b2be
Compare
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
ccb0c64
to
41efc88
Compare
i'm not sure why the netlify tests are failing, i checked the logs and seems my links are not appropriate, i appreciate your hint on how to fixing them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your contribution! It’s very detailed, and I’m sure it will greatly help users get started with this API. I’ve left my initial comments for your review.
@@ -0,0 +1,337 @@ | |||
+++ | |||
title = "How to Optimize Language Models Hyperparameters" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since "Large Language Model" is a standard term, I suggest updating the title to: "How to Optimize Large Language Models (LLMs) Hyperparameters" for consistency and clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
weight = 20 | ||
+++ | ||
|
||
This page describes Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
## Sections | ||
- [Prerequisites](#Prerequisites) | ||
- [Load Model and Dataset](#Load-Model-and-Dataset) | ||
- [Finetune](#Finetune-Language-Models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the user guide for the train
API in the Training Operator, we titled it "How to Fine-Tune LLMs with Kubeflow". To differentiate between these two APIs and provide more clarity for users, I suggest replacing all instances of "fine-tune" in this user guide with "hyperparameter optimization" or "optimizing hyperparameters". WDYT @andreyvelich 👀
|
||
## Prerequisites | ||
|
||
You need to install the following Katib components to run code in this guide: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Kubeflow Training SDK (which includes Transformers and PEFT) is already integrated into the extra_requires
of the Katib Python SDK, users only need to install the following to use this API:
- Katib control plane;
- Katib Python SDK with LLM hyperparameter optimization support:
pip install -U kubeflow-katib[huggingface]
Additionally, this API supports both non-distributed training and PyTorchJob distributed training. To enable PyTorchJob distributed training, in addition to the two prerequisites above, users also need to install the Training Operator control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|----------------------------|-------------------------------------|-------------------------------------------------------------------------------------------------| | ||
| `training_parameters` | `transformers.TrainingArguments` | Contains the training arguments like learning rate, epochs, batch size, etc. | | ||
| `lora_config` | `LoraConfig` | LoRA configuration to reduce the number of trainable parameters in the model. | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a paragraph here to explain how to define the hyperparameter search space. Katib currently supports three hyperparameter search methods: float, int, and categorical. Their usage is documented here: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. And we could include example usage of these methods in the 'Example Usage' section below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks for reminding me of this.
| `env_per_trial` | Environment variables for each trial. | Optional | | ||
| `algorithm_name` | Algorithm used for the hyperparameter search. | Required | | ||
| `algorithm_settings` | Settings for the search algorithm. | Optional | | ||
| `objective_metric_name` | Name of the objective metric for optimization. | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think objective_metric_name
is optional and defaults to "None" if not specified. However, since setting an objective metric is essential for a meaningful experiment, it should actually be required. Alternatively, do you think we should update the default value to a meaningful metric name, such as 'train_loss' in the API? @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to optional is done. lets fix it in another issue - kubeflow/katib#2481
| `packages_to_install` | List of additional Python packages to install. | Optional | | ||
| `pip_index_url` | The PyPI URL from which to install Python packages. | Optional | | ||
| `metrics_collector_config` | Configuration for the metrics collector. | Optional | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding three notes here for clarity:
-
For LLM hyperparameter optimization, we currently support only
train_loss
as the objective metric (This is because it is the default metric generated by thetrainer.train()
function in Hugging Face, which our trainer uses). We plan to add support for more metrics in the future. -
Users might feel confused about the presence of parameters like
objective
,base_image
, andparameters
, and what are their relationships withmodel_provider_parameters
,dataset_provider_parameters
, andtrainer_parameters
we've explained.
As mentioned,objective
,base_image
, andparameters
are originally parameters in this API which allow custom objective functions, images, and parameters for optimization, and we kept this option. Therefore, we could explain this by stating:
If you want to define your own objective function, you need to specifyobjective
,base_image
, andparameters
instead. See usage examples here:: https://www.kubeflow.org/docs/components/katib/getting-started/. -
To enable PyTorchJob distributed training for hyperparameter optimization, users can specify a
types.TrainerResources
object forresources_per_trial
. For example:
resources_per_trial=types.TrainerResources(
num_workers=4,
num_procs_per_worker=2,
resources_per_worker={"gpu": 2, "cpu": 5, "memory": "10G",},
),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks for your detailed response.
| **Parameter** | **Description** | **Required** | | ||
|----------------------------------|---------------------------------------------------------------------------------|--------------| | ||
| `name` | Name of the experiment. | Required | | ||
| `model_provider_parameters` | Parameters for the model provider, such as model type and configuration. | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think model_provider_parameters
, dataset_provider_parameters
, and trainer_parameters
are optional and default to "None" if not specified. This is because we allow users the flexibility to define custom objective functions, images, and parameters for optimization. Users have two options:
-
Import models and datasets from external platforms by specifying
model_provider_parameters
,dataset_provider_parameters
, andtrainer_parameters
. -
Define a custom objective function by specifying
objective
,base_image
, andparameters
.
These parameters are all optional, but the API will check their existence to ensure consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i never noticed that in code. i'll mention both scenarios. thanks for bringing this to my attention 🙏.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| `metrics_collector_config` | Configuration for the metrics collector. | Optional | | ||
|
||
|
||
### Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not adding this example here, as it would be very similar to the code provided in the "Example: Fine-Tuning Llama-3.2 for Binary Classification on the IMDB Dataset" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| `parallel_trial_count` | Number of trials to run in parallel, set to `2`. | | ||
| `resources_per_trial` | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory. | | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this code example? Does it work well? If not, I can run this example locally when I have time and let you know if I encounter any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have created an issue for myself to add a notebook for it, i'll try to do it myself (for learning purpose) and will contact you with issue i may encounter. would it be possible for me to message you on slack?
related issue: kubeflow/katib#2480
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
…model and parameters from hugging face Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
ref: #3951