Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random mass test failures in PR builds due to failure to load libcuda.so starting at least by 2024-11-20 #13730

Open
bartlettroscoe opened this issue Jan 16, 2025 · 3 comments
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/framework, @sebrowne, @achauphan

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) 11144 tests failed, including 2839 unique tests in the unique GenConfig builds:

  • rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables

started failing on testing day 2024-11-20.

The specific set of CDash builds impacted where:

  • PR-13622-test-rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-859
  • PR-13622-test-rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-865
  • PR-13715-test-rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1039
  • PR-13715-test-rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1041

The most recent failures for that last PR #13715 were from 2025-01-10.

The failures looked like:

/scratch/trilinos/workspace/PR_cuda-uvm/pull_request_test/packages/adelus/test/vector_random_fs/Adelus_vector_random_fs.exe: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[52251,1],0]
  Exit code:    127

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

See:

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.

@bartlettroscoe bartlettroscoe added impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests labels Jan 16, 2025
@bartlettroscoe
Copy link
Member Author

FYI: I created this issue for these failures since they got lumped in with the initial query in #13728. I don't know if this is a major problem because it has just impacted two PRs in all recorded PR history (which currently goes back to 2024-09-16).

NOTE: I created this issue using:

$ time ../Trilinos/commonTools/framework/github_issue_creator/create_trilinos_github_test_failure_issue_driver.sh -u "https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now&field4=testoutput&compare4=97&value4=error%20while%20loading%20shared%20libraries.*libcuda.so"
tribitsDir = '/home/rabartl/Trilinos.base/Trilinos/cmake/tribits'

***
*** Getting data to create a new issue tracker
***

Downloading full list of nonpassing tests from CDash URL:

   https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now&field4=testoutput&compare4=97&value4=error%20while%20loading%20shared%20libraries.*libcuda.so

  Downloading CDash data from:
    https://trilinos-cdash.sandia.gov/api/v1/queryTests.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=status&compare2=61&value2=Failed&field3=buildstarttime&compare3=84&value3=now&field4=testoutput&compare4=97&value4=error%20while%20loading%20shared%20libraries.*libcuda.so

Total number of nonpassing tests over all days = 11144

Total number of unique nonpassing test/build pairs over all days = 11144

Number of test names = 2839

Number of build names = 4

Writing out new issue tracker text to 'newGithubMarkdownIssueBody.md'

real    0m12.165s
user    0m0.125s
sys     0m0.171s

real    0m12.527s
user    0m0.140s
sys     0m0.217s

@bartlettroscoe
Copy link
Member Author

FYI: This query shows that if you filter out cases were people tried to use CUDA were it was not supported, the only time libcuda.so is mentioned in failing tests is those same 11144 tests for failures to load libcuda.so as shown in the above query.

@achauphan
Copy link
Contributor

After looking through the two PRs (one of them mine), I do not think this is a random failure.

Both of those PRs contained work around the configuration rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables in which Trilinos_ENABLE_TESTS=ON was enabled or overridden at some point in the configuration hierarchy accidentally (in #13715 case, enabled on purpose but implemented the ability to not run the test) . This specific configuration during PR testing runs on non-gpu machines, likely explaining why all the errors could not find libcuda.so

These are the results of us making changes to the CI system and having them test themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants