- Fine-Tune LLMs with Ray and DeepSpeed on OpenShift AI
-
Admin access to an OpenShift cluster (CRC is fine)
-
Installed OpenDataHub or RHOAI, enabled all Distributed Workload components
-
Installed Go 1.21
-
CODEFLARE_TEST_OUTPUT_DIR
- Output directory for test logs -
CODEFLARE_TEST_TIMEOUT_SHORT
- Timeout duration for short tasks -
CODEFLARE_TEST_TIMEOUT_MEDIUM
- Timeout duration for medium tasks -
CODEFLARE_TEST_TIMEOUT_LONG
- Timeout duration for long tasks -
CODEFLARE_TEST_RAY_IMAGE
(Optional) - Ray image used for raycluster configurationNOTE:
quay.io/modh/ray:2.35.0-py39-cu121
is the default image used for creating a RayCluster resource. If you have your own custom ray image which suits your purposes, specify it inCODEFLARE_TEST_RAY_IMAGE
environment variable.
FMS_HF_TUNING_IMAGE
- Image tag used in PyTorchJob CR for model training
ODH_NAMESPACE
- Namespace where ODH components are installed toNOTEBOOK_USER_NAME
- Username of user used for running WorkbenchNOTEBOOK_USER_TOKEN
- Login token of user used for running WorkbenchNOTEBOOK_IMAGE
- Image used for running Workbench
To download MNIST training script datasets from S3 compatible storage, use the environment variables mentioned below :
AWS_DEFAULT_ENDPOINT
- Storage bucket endpoint from which to download MNIST datasetsAWS_ACCESS_KEY_ID
- Storage bucket access keyAWS_SECRET_ACCESS_KEY
- Storage bucket secret keyAWS_STORAGE_BUCKET
- Storage bucket nameAWS_STORAGE_BUCKET_MNIST_DIR
- Storage bucket directory from which to download MNIST datasets.
Execute tests like standard Go unit tests.
go test -timeout 60m ./tests/kfto/