Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeuchosCore_RCP_PerformanceTests_basic_MPI_1 randomly failing in Trilinos_pullrequest_gcc_7.2.0_debug builds #8648

Closed
bartlettroscoe opened this issue Jan 27, 2021 · 4 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/teuchos, @trilinos/framework

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the test:

  • TeuchosCore_RCP_PerformanceTests_basic_MPI_1

in the build:

  • Trilinos_pullrequest_gcc_7.2.0_debug

is randomly failing with history (and the failures shown in the "Matching Output" column):

Site Build Name Test Name Status Time Proc Time Details Build Time Processors Matching Output
ascic166 PR-8644-test-Trilinos_pullrequest_gcc_7.2.0_debug-1488 TeuchosCore_RCP_PerformanceTests_basic_MPI_1 Failed 300ms 300ms Completed (Failed) 2021-01-27T02:32:06 MST 1 1.506383e+01 1.374757e+00 finalRcpRawRatio = 13.75 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0_d
ascic158 PR-8643-test-Trilinos_pullrequest_gcc_7.2.0_debug-1487 TeuchosCore_RCP_PerformanceTests_basic_MPI_1 Failed 420ms 420ms Completed (Failed) 2021-01-27T00:31:21 MST 1 4.178000e+01 4.120316e+00 finalRcpRawRatio = 13.75 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic158/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0_d
ascic158 PR-8287-test-Trilinos_pullrequest_gcc_7.2.0_debug-792 TeuchosCore_RCP_PerformanceTests_basic_MPI_1 Failed 330ms 330ms Completed (Failed) 2020-10-30T13:47:25 MDT 1 08 1.444000e+01 1.354597e+00 finalRcpRawRatio = 14 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic158/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0_d
ascic166 PR-8142-test-Trilinos_pullrequest_gcc_7.2.0_debug-602 TeuchosCore_RCP_PerformanceTests_basic_MPI_1 Failed 320ms 320ms Completed (Failed) 2020-10-08T01:18:54 MDT 1 1.434000e+01 1.335196e+00 finalRcpRawRatio = 14.2308 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0_d

As shown in the "Matching Output" column in the above table, when it fails, it fails the check like:

 finalRcpRawRatio = 13.75 <= maxRcpRawObjAccessRatio = 13.5 : FAILED ==> /scratch/trilinos/jenkins/ascic166/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0_debug/Trilinos/packages/teuchos/core/test/MemoryManagement/RCP_Performance_UnitTests.cpp:705

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

See https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Teuchos Issues primarily dealing with the Teuchos Package impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Jan 27, 2021
@bartlettroscoe
Copy link
Member Author

Seems pretty clear how to fix this. For debug builds if we increase maxRcpRawObjAccessRatio from 13.5 to say 20, this error should go away. And note that it is just this one build Trilinos_pullrequest_gcc_7.2.0_debug where this is occurring and as shown in this query it only occurred 4 times in 125 runnings of this test going back to 2020-09-28. But still that is a 1 in 30 chance of failing which is too much. So this needs to be fixed.

Since I wrote this test, I will post a PR that will fix this.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Mar 16, 2022
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Apr 16, 2022
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jan 16, 2025
…8648, trilinos#13728)

That should be high enough to avoid every random failure of this check ever
observed in Trilinos PR testing.

It is debatable if a test such as this should be run in all builds or in just
dedicated performance builds.  (The default timing ratios are very loose.)  We
just want to make sure these tests are not broken in every build so that this
test will be able to run in performance builds.

Signed-off-by: Roscoe A. Bartlett <[email protected]>
@bartlettroscoe
Copy link
Member Author

See new issue #13728.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jan 16, 2025
…8648, trilinos#13728)

That should be high enough to avoid every random failure of this check ever
observed in Trilinos PR testing.

It is debatable if a test such as this should be run in all builds or in just
dedicated performance builds.  (The default timing ratios are very loose.)  We
just want to make sure these tests are not broken in every build so that this
test will be able to run in performance builds.

Signed-off-by: Roscoe A. Bartlett <[email protected]>
trilinos-autotester added a commit that referenced this issue Jan 16, 2025
…-default-maxRcpRawObjAccessRatio

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: b'Increase default maxRcpRawObjAccessRatio from 13.5 to 20.0 (#8648, #13728)'
PR Author: bartlettroscoe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

1 participant