Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAMS3D not producing correct results with PPPL GCC #109

Closed
zhucaoxiang opened this issue Mar 16, 2021 · 19 comments · Fixed by #79
Closed

BEAMS3D not producing correct results with PPPL GCC #109

zhucaoxiang opened this issue Mar 16, 2021 · 19 comments · Fixed by #79
Assignees
Labels
bug Something isn't working

Comments

@zhucaoxiang
Copy link
Collaborator

This is for tracking a known issue reported by @pomphrey. Using the GCC version (gcc/9.3) on PPPL, the BEAMS3D executable is not able to produce correct results.

Here is a simple example. Running the input.ORBITS case in BENCHMARKS/BEAMS3D with -N 1 -n 16. The GCC version somehow gives non-sense results. The Intel version gives close results (though not exactly the same).

@lazersos Do you have any idea what causes this issue? You can find the makefile at https://github.com/PrincetonUniversity/STELLOPT/blob/develop/SHARE/make_pppl_gcc.inc

image

@zhucaoxiang zhucaoxiang added the bug Something isn't working label Mar 16, 2021
@zhucaoxiang
Copy link
Collaborator Author

Looks like the GCC one is not calling "Trajectory Calculation".

Here is the screen output for the GCC executable.

(base) czhu@sunfire03:BEAMS3D_TEST$ debug -N 1 -n 16 xbeams3d -vmec ORBITS -plasma
BEAMS3D Version  2.70
-----  HDF5 Parameters  -----
   HDF5_version:   1.12 release: 00
-----  MPI Parameters  -----
   MPI_version:   3.01
   Open MPI v4.0.3, package: Open MPI [email protected] Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020
   Nproc_total:        16
   Nproc_shared:       16
----- Input Parameters -----
   FILE: input.ORBITS
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   # of Particles to Start:       40
   MAGNETIC FIELD FROM PLASMA ONLY!
----- Profile Parameters -----
   Zeff = [  1.00000,  1.00000];  NZEFF:    6
   PLASMA_MASS =    1.00728 amu
   PLASMA_ZAVG =    1.00000 <Z>
   PLASMA_ZMEAN =    1.00000 [Z]
----- VMEC Information -----
   FILE: ORBITS
   R       = [  9.00000, 11.00000]
   Z       = [-0.95106, 0.95106]
   BETA    =   0.000;  I  =  -0.636 [MA]
   AMINOR  =   1.000 [m]
   PHIEDGE =  15.700 [Wb]
   VOLUME  = 197.392 [m^3]
 -----  Vessel Information  -----
   Wall Name :  HARMONICS
   Date      :  TODAY
   Faces     :   28800

----- Constructing Splines -----
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   HERMITE FORM: 1
----- FOLLOWING PARTICLE TRAJECTORIES -----
      Method: LSODE
   Particles:        40
       Steps:     32511   Delta-t:  0.3076E-07
      NPOINC:      1000    dt_out:  0.1000E-05
         Tol:  0.1000E-07  Type: 10

----- BEAM DIAGNOSTICS -----
 BEAMLINE     ENERGY [keV]   CHARGE [e]   MASS [Mp]   Particles [#]   Lost [%]  Shinethrough [%]  Port [%]
    1             0             0             0              40         40.0          0.0          0.0

And from the Intel executable.

(base) czhu@sunfire03:BEAMS3D_TEST$ debug -N 1 -n 16 xbeams3d -vmec ORBITS -plasma
BEAMS3D Version  2.70
-----  HDF5 Parameters  -----
   HDF5_version:   1.10 release: 05
-----  MPI Parameters  -----
   MPI_version:   3.01
   Open MPI v4.0.3, package: Open MPI [email protected] Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020
   Nproc_total:        16
   Nproc_shared:       16
----- Input Parameters -----
   FILE: input.ORBITS
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   # of Particles to Start:       40
   MAGNETIC FIELD FROM PLASMA ONLY!
----- Profile Parameters -----
   Zeff = [  1.00000,  1.00000];  NZEFF:    6
   PLASMA_MASS =    1.00728 amu
   PLASMA_ZAVG =    1.00000 <Z>
   PLASMA_ZMEAN =    1.00000 [Z]
----- VMEC Information -----
   FILE: ORBITS
   R       = [  9.00000, 11.00000]
   Z       = [-0.95106, 0.95106]
   BETA    =   0.000;  I  =  -0.636 [MA]
   AMINOR  =   1.000 [m]
   PHIEDGE =  15.700 [Wb]
   VOLUME  = 197.392 [m^3]
 -----  Vessel Information  -----
   Wall Name :  HARMONICS
   Date      :  TODAY
   Faces     :   28800
     Plasma Field Calculation [100]%
----- Constructing Splines -----
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   HERMITE FORM: 1
----- FOLLOWING PARTICLE TRAJECTORIES -----
      Method: LSODE
   Particles:        40
       Steps:     32511   Delta-t:  0.3076E-07
      NPOINC:      1000    dt_out:  0.1000E-05
         Tol:  0.1000E-07  Type: 10
     Trajectory Calculation [  0]%
----- BEAM DIAGNOSTICS -----
 BEAMLINE     ENERGY [keV]   CHARGE [e]   MASS [Mp]   Particles [#]   Lost [%]  Shinethrough [%]  Port [%]
    1             0             0           ***              40          0.0          0.0          0.0
----- BEAMS3D DONE -----

@lazersos
Copy link
Collaborator

@zhucaoxiang This is an odd behavior. For reference I usually debug on my Mac with GCC and OpenMPI. For production runs I'm usually on the MPCDF machines (or Cineca) which compile with Intel. What branch was @pomphrey working on?

@zhucaoxiang
Copy link
Collaborator Author

@lazersos The develop branch.

@pomphrey
Copy link

@lazersos :
/p/stellopt/SOURCES/stellopt_develop_gcc/bin/xbeams3d -plasma -vmec RUN12B.00515
---trying to reproduce an alpha loss calculation that Matt had previously run, using identical input and wout files

@lazersos
Copy link
Collaborator

Run the BENCHMARKS with make test_beams3d in the BENCHMARKS directory and see if they pass. If they don't then there must be an issue with the compiler options. Do you get this for non-debug GCC runs?

@lazersos
Copy link
Collaborator

I can confirm this behavior for GCC on Mac when using the debugging options in the make_pppl_gnu.inc makefile. In fact here I don't even get as far.

lazerson@Lazersons-MacBook-Air BENCHMARKS % make test_beams3d
Beginning testing of BEAMS3D
/Applications/Xcode.app/Contents/Developer/usr/bin/make test_beams3d_basic
mpiexec -np 2 /Users/lazerson/bin/xbeams3d -vmec ORBITS -plasma
BEAMS3D Version  2.95
-----  HDF5 Parameters  -----
   HDF5_version:   1.12 release: 00
-----  MPI Parameters  -----
   MPI_version:   3.01
   Open MPI v4.1.0, package: Open MPI [email protected] Distribution, ident: 4.1.0, repo rev: v4.1.0, Dec 18, 2020
   Nproc_total:         2
   Nproc_shared:        2
----- Input Parameters -----
   FILE: input.ORBITS
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   # of Particles to Start:       40
   MAGNETIC FIELD FROM PLASMA ONLY!
----- Profile Parameters -----
   Zeff = [  1.00000,  1.00000];  NZEFF:    6
   PLASMA_MASS =    1.00728 amu
   PLASMA_ZAVG =    1.00000 <Z>
   PLASMA_ZMEAN =    1.00000 [Z]

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x107455d3e
#1  0x10745549a
#2  0x7fff7006a5fc
#3  0x1036e1feb
#4  0x10372f808
#5  0x10372fa34
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node Lazersons-MacBook-Air exited on signal 8 (Floating point exception: 8).
--------------------------------------------------------------------------
make[1]: *** [test_beams3d_basic] Error 136
make: *** [test_beams3d] Error 2

I can confirm that the issue seems to stem from the compiler option -ffpe-trap=zero. I suspect this is because there are times where division by zero can happen. When this can happen we explicitly deal with the situation after the fact. However, I need to look into this in more detail.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos Good find! So there might be a "division by zero" issue. The GCC compiler with extra debug flags catches it while the release flags just return unphysical results without errors. For Intel, there are some hidden default flags that make it work normally.

@lazersos lazersos linked a pull request Mar 18, 2021 that will close this issue
@lazersos
Copy link
Collaborator

The issue came from a grid helper variable that was declared to be n sized but should really have only been n-1 sized. As a result the nth position was never initialized. A second helper variable was also created with the same problem, only it's value was set to the inverse of the other variable. As a result in the nth position we got a divide by Nan. Now this never cause a direct issue in the code since we never referenced the array beyond the n-1 position. However the compiler flags for debug threw an error at that step.

For most situations this never caused an issue because the array element which was Nan was never reference by the code. Except in the beams3d_init_fusion.f90 routine. This has now been addressed. The fix is in pull request #79. @zhucaoxiang could you approve the merge it's been sitting there a while.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos The new commits fix the error, but still on the PPPL cluster, neither using GCC nor Intel will give correct numbers when running the input.ORBITS case.

@lazersos
Copy link
Collaborator

@zhucaoxiang I can't replicate the error on Macports, MPCDF, or Cincea, so I can't fix that. The Intel results you posted look pretty close to correct. The GCC results look totally wrong, but a cursory glance at the make_pppl_gcc.inc doesn't seem to indicate anything that should cause a problem with a release build.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos This doesn't make sense. I can reproduce your results on Eddy with the Intel compiler (all True). On PPPL, the Intel version only gives about half Trues. The GCC version just returns ridiculous numbers. I will test on my laptop and also over the GitHub server.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos @pomphrey test_beams3d_basic passed on Eddy and Stellar with the Intel compiler. On my laptop with GCC-10, it was half passed.

mpirun -np 2 /Users/caoxiang/bin/xbeams3d -vmec ORBITS -plasma
BEAMS3D Version  2.95
-----  HDF5 Parameters  -----
   HDF5_version:   1.12 release: 00
-----  MPI Parameters  -----
   MPI_version:   3.01
   Open MPI v4.0.5, package: Open MPI brew@Catalina Distribution, ident: 4.0.5, repo rev: v4.0.5, Aug 26, 2020
   Nproc_total:         2
   Nproc_shared:        2
----- Input Parameters -----
   FILE: input.ORBITS
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   # of Particles to Start:       40
   MAGNETIC FIELD FROM PLASMA ONLY!
----- Profile Parameters -----
   Zeff = [  1.00000,  1.00000];  NZEFF:    6
   PLASMA_MASS =    1.00728 amu
   PLASMA_ZAVG =    1.00000 <Z>
   PLASMA_ZMEAN =    1.00000 [Z]
----- VMEC Information -----
   FILE: ORBITS
   R       = [  9.00000, 11.00000]
   Z       = [-0.95106, 0.95106]
   BETA    =    0.00;  I  =   -0.64 [MA]
   AMINOR  =    1.00 [m]
   PHIEDGE =   15.70 [Wb]
   VOLUME  =  197.39 [m^3]

----- Constructing Splines -----
   R   = [  9.00000, 11.00000];  NR:    128
   PHI = [ 0.00000, 6.28319];  NPHI:   32
   Z   = [-1.00000, 1.00000];  NZ:    128
   HERMITE FORM: 1
 -----  Vessel Information  -----
   Wall Name :  HARMONICS
   Date      :  TODAY
   Faces     :   43200
   R_WALL   = [  9.00000, 11.00000]
   Z_WALL   = [ -1.00000,  1.00000]
----- FOLLOWING PARTICLE TRAJECTORIES -----
      Method: LSODE
   Particles:        40
       Steps:     33000   Delta-t:   30.303E-09
      NPOINC:      1000    dt_out:    1.000E-06
         Tol:   10.000E-09  Type: 10

----- BEAM DIAGNOSTICS -----
 BEAMLINE     ENERGY [keV]   CHARGE [e]   MASS [Mp]   Particles [#]   Lost [%]  Shinethrough [%]  Port [%]
    1             0             1             1              40          5.0          0.0          0.0
----- BEAMS3D DONE -----
BEAMS3D VERSION: 2.950000047683716
==== Vectors ====
  turning: 0.12277330387554908   0.12275448   105%
  delta: 0.009462204370692617   0.00946046   1778%
  R0: 10.5   10.5   0%
  R1: 10.499992712044104   10.499993   0%
  R100: 10.499753912830686   10.499754   68%
  R500: 10.503794778665622   10.503795   55%
=================
  STATUS: FAIL!!!!!

@lazersos
Copy link
Collaborator

@zhucaoxiang OK so this looks like the issue may just be a roundoff error. The code is showing that the particle positions agree well after 1 time step but for some particles (probably deeply trapped) after 100 timesteps the position isn't correct. There are a few things which could be the issue.

Can you upload the beams3d HDF5 file this run produced so I can take a look at it? This could be related to #110.

@lazersos
Copy link
Collaborator

lazersos commented Mar 23, 2021

@zhucaoxiang I suspect the issue is the -finit-real=snan option. It seems to screw up vmec_utils.f even when I run with the -O2 flag the code runs significantly slower when calling the routines from this code. It could be the magnetic field lookup is being messed up by said flag.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos But both on my laptop and the PPPL cluster, I was using the "release" mode. -finit-real=snan is not activated.
Attached you can find the two HDF5 files (one from my laptop with GCC-10 and the other from PPPL portal).

BTW, I am surprised that two 50-Mb-files can be compressed to 6 Mb.

beams3d.zip

@lazersos
Copy link
Collaborator

@zhucaoxiang Yeah it's what I suspected. The magnetic field grid isn't being initialized correctly:
Plot of B_R
Something odd if happening here. What I can confirm from my end is that the tests work fine for Macports with

FLAGS_R = -O2 -g -fexternal-blas -fbacktrace -fcheck=all

The code also runs fine on Draco, Cobra, Raven, and Marconi.

The file size is probably small because the grids have quite a bit of redundancy for this axisymmetric case.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos On the PPPL cluster, I was using FLAGS_R = -O2 -fexternal-blas and on my laptop with "homebrew", I am using FLAGS_R = -O2 -fallow-argument-mismatch.

@zhucaoxiang
Copy link
Collaborator Author

@lazersos Source of the discrepancy on my laptop is found. I changed the compiler flag from FLAGS_R = -O2 -fallow-argument-mismatch to FLAGS_R = -O2 -fallow-argument-mismatch -g -fbacktrace -fcheck=all. Now I got all the numbers matched.

I am really surprised that -g -fbacktrace -fcheck=all makes a difference. In my understanding, they are either for debugging or for checking.

----- BEAMS3D DONE -----
BEAMS3D VERSION: 2.950000047683716
==== Vectors ====
  turning: 0.12277346481644447   0.12275448   0%
  delta: 0.009462146399886628   0.00946046   0%
  R0: 10.5   10.5   0%
  R1: 10.499992711949272   10.499993   0%
  R100: 10.499753902139718   10.499754   0%
  R500: 10.503794697358195   10.503795   0%
=================
  STATUS: PASS

@zhucaoxiang
Copy link
Collaborator Author

@lazersos There might be something wrong with the sources. If you add one line before line 784 in vmec_utils.f to print the value of ui, you will get NaN values when checking the BENCHMARKS/BEAMS3D/input.ORBITS (probably you have to run in debugging mode). This is probably the reason why it doesn't work. Do you have any clues?

[stellar-i01n1:2290228:0:2290228] Caught signal 8 (Floating point exception: floating-point invalid operation)
 ui =                      NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants