Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] strict driver and cuda version assignment #1275

Open
wants to merge 111 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 13, 2024

Resolves Issues

gpu/install_gpu_driver.sh

  • Driver version defaults to version in driver .run file if specified

  • CUDA version defaults to version in cuda .run file if specified

  • exclusively using .run file installation method for cuda and driver installation

    • Installing non-open driver from .run file on rocky8
  • build nccl from source, since that is the only mechanism which supports all Dataproc OSs

  • wrap expensive functions in completion checks to reduce re-run time when testing manually

  • cache build results in GCS

  • waiting on apt lock when it exists

  • only install build dependencies if build is necessary

  • fix problem with ops agent not installing ; using venv

  • Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS

  • mark task completion by creating a file rather than setting a variable

  • added functions to check and report secure-boot and os version details

  • if the correct metadata attributes are specified, pytorch will be installed, but not by default

gpu/manual-test-runner.sh

  • order commands correctly
  • point to origin rather than staging repo

gpu/test_gpu.py

  • clearer test skipping logic

@cjac cjac self-assigned this Dec 13, 2024
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

Tests took 43 minutes to run this time. I think it's because the binary driver build cache was invalidated. Let's see if this one takes more like 14 minutes.

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

the intention is for this PR to resolve issue #1268

cloudbuild/Dockerfile Outdated Show resolved Hide resolved
cloudbuild/Dockerfile Outdated Show resolved Hide resolved
cloudbuild/Dockerfile Outdated Show resolved Hide resolved
cloudbuild/create-key-pair.sh Outdated Show resolved Hide resolved
@@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must remove this before merging

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

slightly longer than the 14 minutes I predicted, this run completed in 16:29

gpu/test_gpu.py Outdated Show resolved Hide resolved
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

10 similar comments
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

okay, taking a little break...

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

…out requirements for ramdisk ; roll back spark.SQLPlugin change
@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

Capacity Scheduler does not support gpu resources on Dataproc 2.0, so in order to use spark with yarn, fall back to Fair Scheduler ; I know there are concurrency concerns at high load, but that concern can be managed, and there is no other option on 2.0 clusters.

* protect against race condition on removing the .building files
* add logic for pre-11.7 cuda package repo back in
* clean up and verify yarn config
@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 24, 2025

  • if cuda version and gpu driver version are not specified with metadata values, but the customer supplies a cuda runfile url, extract the driver and cuda version from the filename.
  • There are inconsistencies between the driver version mentioned in the filename and the actual driver version that can be installed - I have mapped these out, so that if customer uses the latest minor version of any cuda release from 11.7 through 12.6.3 we have a known good driver for that cuda version without their having to think about it. We also have likely to work configurations from 10.0 through 11.6 if the customer is willing to run an older dataproc subminor version with equally older kernels.
  • Customer can explicitly specify driver version with either gpu-driver-version or gpu-driver-url metadata value
  • selecting default cuda version based on dataproc image version
  • The Yarn Capacity scheduler from hadoop 3.2.4 does not recognize the gpu resource type and so must be replaced with the Fair scheduler. NB: The Fair scheduler has problems at high concurrency when underprovisioned and will time out.
  • NCCL is now built from source which takes a long first time for each cuda version per dataproc image the customer deploys. Which reminds me that
  • tasks which used to take a long time to perform in previous versions of this script have had their results cached so that subsequent deployments of the same configuration will take significantly less time.

Copy link
Contributor

@kuldeepkk-dev kuldeepkk-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cjac I would recommend getting one more review from the product team since there are lot of changes.

@cjac cjac requested review from bcheena and cnauroth January 24, 2025 23:29
@@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # to be removed before merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.

@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2025

I met with customer to exercise the cluster and they asked for one additional change. If the correct metadata key/value pair are passed, we now install pytorch and register a new kernel for use with JupyterLab.

@cjac
Copy link
Contributor Author

cjac commented Jan 28, 2025

/gcbrun

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 29, 2025
@cjac
Copy link
Contributor Author

cjac commented Jan 29, 2025

/gcbrun

Copy link
Member

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @cjac ! I entered a few comments. In general, can we also review the curl calls to make sure they have the necessary retries?

gpu/run-bazel-tests.sh Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
gpu/install_gpu_driver.sh Show resolved Hide resolved
gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
gpu/install_gpu_driver.sh Show resolved Hide resolved
gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
Copy link
Contributor Author

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, Chris and Prince. I will make these changes and resolve each.

integration_tests/dataproc_test_case.py Show resolved Hide resolved
gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
gpu/install_gpu_driver.sh Show resolved Hide resolved
@@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # to be removed before merge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.

gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/test_gpu.py Show resolved Hide resolved
gpu/install_gpu_driver.sh Outdated Show resolved Hide resolved
gpu/install_gpu_driver.sh Show resolved Hide resolved
gpu/install_gpu_driver.sh Show resolved Hide resolved
@cjac
Copy link
Contributor Author

cjac commented Feb 2, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2025

@cnauroth - I believe that I addressed all of your concerns in the previous change, eacb99f

Please let me know if you see anything that needs further attention.

@@ -134,13 +134,13 @@ readonly ROLE

# Minimum supported version for open kernel driver is 515.43.04
# https://github.com/NVIDIA/open-gpu-kernel-modules/tags
latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')"
#latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line if latest is no longer being used?

while gsutil ls "${gcs_tarball}.building" 2>&1 | grep -q "${gcs_tarball}.building" ; do
sleep 5m
done
if gsutil ls -j "${gcs_tarball}.building" > "${local_tarball}.building.json" ; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -j a recent new feature? I couldn't find this option in my gsutil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it dumps the object metadata in JSON format ; I thought it would be --format json to match other gcloud commands, but I guess that it's slightly different because of gsutil. The argument is documented for gcloud storage ls --help which I assumed used the same ABI as gsutil, but I don't see the argument in the gcloud ls --help documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for whether it's new, I don't think it is. I looked through the release history[1], and it seems that JSON support has been included from the beginning. But I couldn't find the argument parsing code where the argument is detected on a quick review of [2].

[1] https://github.com/GoogleCloudPlatform/gsutil/blob/master/CHANGES.md
[2] https://github.com/GoogleCloudPlatform/gsutil

* use the same retry arguments in all calls to curl
* correct 12.3's driver and sub-version
* improve logic for pause as other workers perform build
* remove call to undefined clear_nvsmi_cache
* move closing "fi" to line of its own
* added comments for unclear logic
* removed commented code
* remove unused curl for latest driver version
@cjac
Copy link
Contributor Author

cjac commented Feb 3, 2025

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gpu] versions installed by gpu/install_gpu_driver.sh do not match requested versions
6 participants