-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] strict driver and cuda version assignment #1275
base: master
Are you sure you want to change the base?
[gpu] strict driver and cuda version assignment #1275
Conversation
/gcbrun |
1 similar comment
/gcbrun |
/gcbrun |
1 similar comment
/gcbrun |
Tests took 43 minutes to run this time. I think it's because the binary driver build cache was invalidated. Let's see if this one takes more like 14 minutes. |
the intention is for this PR to resolve issue #1268 |
cloudbuild/presubmit.sh
Outdated
@@ -70,6 +70,7 @@ determine_tests_to_run() { | |||
changed_dir="${changed_dir%%/*}/" | |||
# Run all tests if common directories modified | |||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must remove this before merging
slightly longer than the 14 minutes I predicted, this run completed in 16:29 |
/gcbrun |
10 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
okay, taking a little break... |
/gcbrun |
…out requirements for ramdisk ; roll back spark.SQLPlugin change
/gcbrun |
1 similar comment
/gcbrun |
Capacity Scheduler does not support gpu resources on Dataproc 2.0, so in order to use spark with yarn, fall back to Fair Scheduler ; I know there are concurrency concerns at high load, but that concern can be managed, and there is no other option on 2.0 clusters. |
* protect against race condition on removing the .building files * add logic for pre-11.7 cuda package repo back in * clean up and verify yarn config
/gcbrun |
/gcbrun |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@cjac I would recommend getting one more review from the product team since there are lot of changes.
@@ -70,6 +70,7 @@ determine_tests_to_run() { | |||
changed_dir="${changed_dir%%/*}/" | |||
# Run all tests if common directories modified | |||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | |||
continue # to be removed before merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.
I met with customer to exercise the cluster and they asked for one additional change. If the correct metadata key/value pair are passed, we now install pytorch and register a new kernel for use with JupyterLab. |
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, @cjac ! I entered a few comments. In general, can we also review the curl
calls to make sure they have the necessary retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review, Chris and Prince. I will make these changes and resolve each.
@@ -70,6 +70,7 @@ determine_tests_to_run() { | |||
changed_dir="${changed_dir%%/*}/" | |||
# Run all tests if common directories modified | |||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | |||
continue # to be removed before merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes. I will remove it once bcheena or cnaurath have had an opportunity to suggest changes, and those changes, if any, are implemented and tested.
/gcbrun |
@@ -134,13 +134,13 @@ readonly ROLE | |||
|
|||
# Minimum supported version for open kernel driver is 515.43.04 | |||
# https://github.com/NVIDIA/open-gpu-kernel-modules/tags | |||
latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')" | |||
#latest="$(curl -s https://us.download.nvidia.com/XFree86/Linux-x86_64/latest.txt | awk '{print $1}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line if latest
is no longer being used?
while gsutil ls "${gcs_tarball}.building" 2>&1 | grep -q "${gcs_tarball}.building" ; do | ||
sleep 5m | ||
done | ||
if gsutil ls -j "${gcs_tarball}.building" > "${local_tarball}.building.json" ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is -j
a recent new feature? I couldn't find this option in my gsutil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it dumps the object metadata in JSON format ; I thought it would be --format json
to match other gcloud commands, but I guess that it's slightly different because of gsutil. The argument is documented for gcloud storage ls --help
which I assumed used the same ABI as gsutil, but I don't see the argument in the gcloud ls --help
documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for whether it's new, I don't think it is. I looked through the release history[1], and it seems that JSON support has been included from the beginning. But I couldn't find the argument parsing code where the argument is detected on a quick review of [2].
[1] https://github.com/GoogleCloudPlatform/gsutil/blob/master/CHANGES.md
[2] https://github.com/GoogleCloudPlatform/gsutil
* use the same retry arguments in all calls to curl * correct 12.3's driver and sub-version * improve logic for pause as other workers perform build * remove call to undefined clear_nvsmi_cache * move closing "fi" to line of its own * added comments for unclear logic * removed commented code * remove unused curl for latest driver version
/gcbrun |
Resolves Issues
gpu/install_gpu_driver.sh
Driver version defaults to version in driver .run file if specified
CUDA version defaults to version in cuda .run file if specified
exclusively using .run file installation method for cuda and driver installation
build nccl from source, since that is the only mechanism which supports all Dataproc OSs
wrap expensive functions in completion checks to reduce re-run time when testing manually
cache build results in GCS
waiting on apt lock when it exists
only install build dependencies if build is necessary
fix problem with ops agent not installing ; using venv
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
mark task completion by creating a file rather than setting a variable
added functions to check and report secure-boot and os version details
if the correct metadata attributes are specified, pytorch will be installed, but not by default
gpu/manual-test-runner.sh
gpu/test_gpu.py