Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #776 #777

Merged
merged 1 commit into from
Apr 16, 2024
Merged

Fixed #776 #777

merged 1 commit into from
Apr 16, 2024

Conversation

hfp
Copy link
Member

@hfp hfp commented Apr 4, 2024

  • Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
  • Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

@hfp
Copy link
Member Author

hfp commented Apr 4, 2024

Issue #776 was discovered when testing with enabled assertions, i.e., DBCSR's CUDA tests may have assertions removed. Perhaps it is valuable to test with enabled assertions.

@alazzaro
Copy link
Member

alazzaro commented Apr 4, 2024

@hfp for my understanding:

  • We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...
  • We really never call printf on the GPU, do we?
  • I don't understand the relation between NDEBUG and ACC_API_CALL

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

@alazzaro
Copy link
Member

alazzaro commented Apr 4, 2024

(BTW, trying to recover the Daint-CI output...)

@alazzaro
Copy link
Member

alazzaro commented Apr 4, 2024

CSCS CI seems broken on their side:

+ sbatch --wait --time=0:15:00 --account=g90 --job-name=DBCSR.gnu.build --output=sbatch.jenkins-g90-DBCSR-1116.gnu.build.out .ci/daint.cscs.ch/gnu.build.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
java.nio.file.NoSuchFileException: /users/jenkg90/workspace/g90/DBCSR/sbatch.jenkins-g90-DBCSR-1116.gnu.build.out

but we have budget... Please ignore it for the moment.

@mkrack
Copy link
Member

mkrack commented Apr 5, 2024

The cp2k regression tests on Piz Daint are also disable, because the project g90 has expired. sbatch returns

project "g90" expired on 2024-03-31

@juerghutter could you have a look?

@juerghutter
Copy link
Contributor

juerghutter commented Apr 5, 2024 via email

@hfp
Copy link
Member Author

hfp commented Apr 10, 2024

@hfp for my understanding:

* We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...

Ok, I just assumed this because the issue came up when I removed NDEBUG. I will check grep through CP2K just to make sure there is nothing else.

* We really never call printf on the GPU, do we?

No, we don't. Perhaps someone did so during development and wanted to keep this setting.

* I don't understand the relation between NDEBUG and [ACC_API_CALL](https://github.com/cp2k/dbcsr/blob/6db5b28d236de28e7293f783a3c2cc672d202f6b/src/acc/cuda/acc_cuda.h#L29)

ACK; see above (me neither ;-).

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

OK, this is good to go in principle. However, I will move the call into the init function.

@hfp
Copy link
Member Author

hfp commented Apr 16, 2024

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

* Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
* Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.
@hfp
Copy link
Member Author

hfp commented Apr 16, 2024

I rebased the PR and if it's green (let's hope for Daint-CI), I will merge it. Removing (or moving) the code in question might be another PR.

@alazzaro
Copy link
Member

ACK.

@hfp hfp merged commit 31ddf41 into cp2k:develop Apr 16, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants