-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix in distributed GPU tests and Distributed set!
#3880
Changes from 47 commits
5227a13
1225061
1652c6b
9323203
37b17ff
2ac8cde
0eb2720
50d0ec3
6e183bd
bd84d38
c56b15b
371a45b
e30973f
e4cb16e
0c1f01c
bec1cd1
75546af
5f49ec0
9b334af
f8c6401
1dc42bb
5a870e7
9e63f56
59548f8
642cfd9
4cee49a
a46b25d
4dffbe5
9c3c6cd
3b28ecb
7126c7c
733ab2b
2dbf1a0
a4b129a
29f7d69
b174313
0283e6a
b4c1f2a
bc53a97
811bfdb
cd86a6a
4039299
2c6ad90
781992c
08949b3
908b31a
466ec0c
a27b383
eec18c2
62c5834
8011ef5
0965067
cd00381
eebfc04
8fc903e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
agents: | ||
queue: new-central | ||
slurm_mem: 8G | ||
modules: climacommon/2024_10_09 | ||
modules: climacommon/2024_10_08 | ||
|
||
env: | ||
JULIA_LOAD_PATH: "${JULIA_LOAD_PATH}:${BUILDKITE_BUILD_CHECKOUT_PATH}/.buildkite/distributed" | ||
|
@@ -16,60 +16,74 @@ steps: | |
key: "init_central" | ||
env: | ||
TEST_GROUP: "init" | ||
GPU_TEST: "true" | ||
command: | ||
- echo "--- Instantiate project" | ||
- "julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
- echo "--- Initialize tests" | ||
- "julia -O0 --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_gpus: 1 | ||
slurm_cpus_per_task: 8 | ||
slurm_mem: 8G | ||
slurm_ntasks: 1 | ||
slurm_gpus_per_task: 1 | ||
|
||
- wait | ||
|
||
- label: "🐉 cpu distributed unit tests" | ||
key: "distributed_cpu" | ||
env: | ||
TEST_GROUP: "distributed" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 8G | ||
slurm_ntasks: 4 | ||
retry: | ||
automatic: | ||
- exit_status: 1 | ||
limit: 1 | ||
|
||
- label: "🐲 gpu distributed unit tests" | ||
key: "distributed_gpu" | ||
env: | ||
TEST_GROUP: "distributed" | ||
GPU_TEST: "true" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👀 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can probably reduce the memory usage of the tests right? I think often a bigger grid is used than needed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, unit tests do not require too much memory. I have seen that 32G was not enough for the regression tests on the GPU. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they might be too big |
||
slurm_ntasks: 4 | ||
slurm_gpus_per_task: 1 | ||
retry: | ||
automatic: | ||
- exit_status: 1 | ||
limit: 1 | ||
|
||
|
||
- label: "🦾 cpu distributed solvers tests" | ||
key: "distributed_solvers_cpu" | ||
env: | ||
TEST_GROUP: "distributed_solvers" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
retry: | ||
automatic: | ||
- exit_status: 1 | ||
limit: 1 | ||
|
||
- label: "🛸 gpu distributed solvers tests" | ||
key: "distributed_solvers_gpu" | ||
env: | ||
TEST_GROUP: "distributed_solvers" | ||
GPU_TEST: "true" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
slurm_gpus_per_task: 1 | ||
retry: | ||
|
@@ -81,20 +95,27 @@ steps: | |
key: "distributed_hydrostatic_model_cpu" | ||
env: | ||
TEST_GROUP: "distributed_hydrostatic_model" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
retry: | ||
automatic: | ||
- exit_status: 1 | ||
limit: 1 | ||
|
||
- label: "🦏 gpu distributed hydrostatic model tests" | ||
key: "distributed_hydrostatic_model_gpu" | ||
env: | ||
TEST_GROUP: "distributed_hydrostatic_model" | ||
GPU_TEST: "true" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
slurm_gpus_per_task: 1 | ||
retry: | ||
|
@@ -106,20 +127,27 @@ steps: | |
key: "distributed_nonhydrostatic_regression_cpu" | ||
env: | ||
TEST_GROUP: "distributed_nonhydrostatic_regression" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
retry: | ||
automatic: | ||
- exit_status: 1 | ||
limit: 1 | ||
|
||
- label: "🕺 gpu distributed nonhydrostatic regression" | ||
key: "distributed_nonhydrostatic_regression_gpu" | ||
env: | ||
TEST_GROUP: "distributed_nonhydrostatic_regression" | ||
GPU_TEST: "true" | ||
MPI_TEST: "true" | ||
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 50G | ||
slurm_ntasks: 4 | ||
slurm_gpus_per_task: 1 | ||
retry: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120G is much more than we need for those tests. After some frustration, because tests were extremely slow to start, I noticed that the agents began much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in when the cluster is busy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good reason. might warrant a comment