Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonlinear solver: deal with failure #2744

Merged
merged 1 commit into from
Jul 24, 2024

Conversation

tjhei
Copy link
Member

@tjhei tjhei commented Dec 19, 2018

This introduces a new setting to decide what should happen if the
nonlinear solver fails to converge.

The old behavior was to "just continue", which is an unsafe choice.

@tjhei
Copy link
Member Author

tjhei commented Dec 19, 2018

@jdannberg something like this?

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change and structure of the implementation, but I would like to keep the code out of Simulator::run and have some suggestions for the wording. What do you think about it?

if (! (parameters.skip_solvers_on_initial_refinement
&& pre_refinement_step < parameters.initial_adaptive_refinement))
{
start_timestep ();

// then do the core work: assemble systems and solve
solve_timestep ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat dislike adding so many lines in the main run loop that are unrelated to it. Conceptionally it is the responsibility of the solve_timestep function to handle non-convergence. Could we move the whole try-catch block into solve_timestep, and modify the code in run to the following:

const bool retry_timestep = solve_timestep();
if (retry_timestep)
  continue;

All of the exception handling can stay the same, but it keep this loop readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check my comment from 2019: https://github.com/geodynamics/aspect/pull/2744/files#r244753861
does this still apply?

"Select the strategy on what to do if the nonlinear solver scheme fails to "
"converge. The options are:\n"
"`continue with next timestep`: ignore error and continue to the next timestep\n"
"`cut timestep size`: reduce the current timestep size and redo the timestep\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe by a factor of two?

}
catch (...)
{
pcout << "ERROR: the solver in the current timestep failed to converge." << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

pcout << "WARNING: The nonlinear solver in the current timestep failed to converge." << std::endl
          << "Acting according to the parameter 'Nonlinear solver failure strategy'." << std::endl;

{
if (timestep_number == 0)
{
pcout << "Note: we can not cut the timestep in step 0, so we are aborting."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this an ERROR:...

include/aspect/simulator.h Outdated Show resolved Hide resolved

// reduce timestep size and update time to "go back":
const double new_time_step = 0.5*time_step;
pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output should be in years or seconds, depending on the parameter.

pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl;

time -= time_step - new_time_step;
time_step = new_time_step;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one thing that I think doesn't work well with the current implementation: The time step size is reduced for the current time step, until the solver converges, but the next time step bases the time step size on the CFL number (rather than on what time step size was necessary to actually get the nonlinear solver to converge). So if the problem stays as hard to solve, we need to go though a number of failing nonlinear solver loops for every time step, rather than continuing with the new (smaller) time step that is adapted to the nonlinear problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but how do you want to decide when to make bigger timestep again? I don't think there is a good answer here. One thing that might help in an extreme case is not increasing it by more than a factor of say 2 in each step. But in general, this seems to be something that would require more user input.

@@ -0,0 +1 @@
#include "../benchmarks/nonlinear_channel_flow/simple_nonlinear.cc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test problem is some sort of benchmark. So would it make sense to do some postprocessing, so that we see that our new method actually still gives us the correct solution?

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch from 3adb0fd to c2ada01 Compare May 31, 2024 03:35
@tjhei
Copy link
Member Author

tjhei commented May 31, 2024

I am resurrecting this old PR because @danieldouglas92 is interested in it. I rebased and I am curious to see if things still work, but I know I have to make some more changes.

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch from 07b738d to ebfb354 Compare June 1, 2024 23:14
@tjhei tjhei changed the title [WIP]Nonlinear solver: deal with failure Nonlinear solver: deal with failure Jun 1, 2024
@tjhei
Copy link
Member Author

tjhei commented Jun 1, 2024

This has been updated to the current main branch and now uses the TimeStepping infrastructure. The test added here is working correctly.

I would be grateful if somebody can take a look at the current state (it is still a bit messy).

@tjhei
Copy link
Member Author

tjhei commented Jun 1, 2024

@jdannberg ? @gassmoeller ?

@gassmoeller
Copy link
Member

I will take a look asap.

@gassmoeller gassmoeller self-assigned this Jun 1, 2024
@tjhei
Copy link
Member Author

tjhei commented Jun 1, 2024

it helps to include all files. :-)

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch from a4cb698 to c8ee1d0 Compare June 2, 2024 04:26
@tjhei tjhei mentioned this pull request Jun 2, 2024
@tjhei tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from e6f5301 to 79bdf32 Compare June 2, 2024 21:00
Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, this will be useful for a bunch of models. Just some smallish comments and a question about the best way to react in the statistics postprocessor.

Also please add a changelog entry.

include/aspect/simulator.h Outdated Show resolved Hide resolved
include/aspect/time_stepping/interface.h Outdated Show resolved Hide resolved
include/aspect/time_stepping/interface.h Outdated Show resolved Hide resolved
include/aspect/time_stepping/interface.h Outdated Show resolved Hide resolved
include/aspect/time_stepping/interface.h Outdated Show resolved Hide resolved
source/simulator/parameters.cc Outdated Show resolved Hide resolved
Comment on lines +817 to 815
AssertThrow(nonlinear_solver_control.last_check() != SolverControl::failure, ExcNonlinearSolverNoConvergence());
signals.post_nonlinear_solver(nonlinear_solver_control);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So whether we throw the assert before or after the signal may actually have some impact on the behavior. post_nonlinear_solver e.g. informs the postprocessor about the number of nonlinear iterations, and the postprocessor probably keeps accumulating the number of Stokes iterations until it receives the signal that the nonlinear solver has finished. That being said I dont know if the other order would be better, it could mean that one timestep appears twice in the statistics file. Have you check what the output statistics file looks like for a repeated timestep?

source/time_stepping/repeat_on_nonlinear_fail.cc Outdated Show resolved Hide resolved
source/time_stepping/repeat_on_nonlinear_fail.cc Outdated Show resolved Hide resolved
# 12: Velocity iterations in Stokes preconditioner
# 13: Schur complement iterations in Stokes preconditioner
0 0.000000000000e+00 0.000000000000e+00 512 4851 2145 2145 2 0 0 58 60 59
1 6.625491997652e+03 6.625491997652e+03 512 4851 2145 2145 8 83 0 338 346 346
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks like the statistics file records all nonlinear iterations (2x4) and all Stokes iterations, etc. from all tries. I guess that is fair since we actually made all these iterations?

Copy link
Contributor

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not getting to the rest of it fast enough, but here's a couple comments.

include/aspect/parameters.h Show resolved Hide resolved
include/aspect/parameters.h Outdated Show resolved Hide resolved
Copy link
Contributor

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to get the uninterrupted time to finish this review, so here's what I have so far. I'll get back to the rest later.

include/aspect/time_stepping/repeat_on_nonlinear_fail.h Outdated Show resolved Hide resolved
source/simulator/core.cc Outdated Show resolved Hide resolved
source/simulator/core.cc Outdated Show resolved Hide resolved
@tjhei
Copy link
Member Author

tjhei commented Jun 4, 2024

Comments addressed. There are still many tests that need updating, though.

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch 4 times, most recently from a7e36db to 50b7468 Compare June 5, 2024 01:37
@tjhei
Copy link
Member Author

tjhei commented Jun 5, 2024

Something is still wrong with nonlinear problems:

WARNING: The nonlinear solver in the current timestep failed to converge.
Acting according to the parameter 'Nonlinear solver failure strategy'...
   Postprocessing:
     RMS, max velocity:                               7.49 m/s, 56.3 m/s
     Temperature min/avg/max:                         0 K, 0.5048 K, 1 K
     Heat fluxes through boundary parts:              0 W, 0 W, -3.127 W, 1 W
     Max, min, and rms velocity along boundary parts: 11.29 m/s, 0.001015 m/s, 4.795 m/s, 58.32 m/s, 0.001293 m/s, 23.75 m/s, 30.68 m/s, 0.02087 m/s, 11.42 m/s, 0.04261 m/s, 0.0008854 m/s, 0.03199 m/s

*** Timestep 2:  t=0.0021 seconds, dt=9.27806e-05 seconds
   Solving temperature system... 5 iterations.
   Initial Newton Stokes residual = 3.59864e+06, v = 3.59864e+06, p = 0

   Rebuilding Stokes preconditioner...
   Solving Stokes system... 14+0 iterations.
      Relative nonlinear residual (total Newton system) after nonlinear iteration 1: 1, norm of the rhs: 3.59864e+06

   Solving temperature system... 95 iterations.
[0b3d367e5bc0:34578] *** Process received signal ***
[0b3d367e5bc0:34578] Signal: Floating point exception (8)
[0b3d367e5bc0:34578] Signal code: Floating point divide-by-zero (3)
[0b3d367e5bc0:34578] Failing at address: 0x7fc1f8c5cb14
[0b3d367e5bc0:34578] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc2048bc420]
[0b3d367e5bc0:34578] [ 1] ./libtosi_benchmark_newton_solver.debug.so(_ZNK6aspect13TosiBenchmark12TosiMaterialILi2EE9viscosityEddRKSt6vectorIdSaIdEERKN6dealii15SymmetricTensorILi2ELi2EdEERKNS8_5PointILi2EdEE+0x114)[0x7fc1f8c5cb14]
[0b3d367e5bc0:34578] [ 2] ./libtosi_benchmark_newton_solver.debug.so(_ZNK6aspect13TosiBenchmark12TosiMaterialILi2EE8evaluateERKNS_13MaterialModel19MaterialModelInputsILi2EEERNS3_20MaterialModelOutputsILi2EEE+0x2ff)[0x7fc1f8c5e24f]
[0b3d367e5bc0:34578] [ 3] /__w/aspect/aspect/build/aspect(_ZNK6aspect9SimulatorILi2EE31compute_pressure_scaling_factorEv+0x222)[0x563654f78982]
[0b3d367e5bc0:34578] [ 4] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE22assemble_stokes_systemEv+0x101)[0x563655027a71]
[0b3d367e5bc0:34578] [ 5] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE36do_one_defect_correction_Stokes_stepERNS_25DefectCorrectionResidualsEb+0x1fa)[0x563654e8b02a]
[0b3d367e5bc0:34578] [ 6] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE42solve_iterated_advection_and_newton_stokesEv+0x195)[0x563654e8d835]
[0b3d367e5bc0:34578] [ 7] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE14solve_timestepEv+0x1a5)[0x563655a14b25]
[0b3d367e5bc0:34578] [ 8] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x4ef)[0x563655a21ccf]
[0b3d367e5bc0:34578] [ 9] /__w/aspect/aspect/build/aspect(_Z13run_simulatorILi2EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_bbbb+0x2cf)[0x5636559ef89f]
[0b3d367e5bc0:34578] [10] /__w/aspect/aspect/build/aspect(main+0x5ad)[0x5636549fb66d]
[0b3d367e5bc0:34578] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc2046da083]
[0b3d367e5bc0:34578] [12] /__w/aspect/aspect/build/aspect(_start+0x2e)[0x5636549fd11e]

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from 4a47c48 to 4659297 Compare June 18, 2024 18:51
@tjhei
Copy link
Member Author

tjhei commented Jun 18, 2024

@gassmoeller @jdannberg can you remind me what the conclusion of our discussion regarding default of the new setting is?

@gassmoeller
Copy link
Member

I think we agreed on continue_with_next_timestep, because it is closest to the old behavior, and at least provides the user with a warning. I think we also talked about somehow tracking if there are any non-converging timesteps and then issue a warning at the end, because warnings for individual timesteps might get lost in the long output text.

@naliboff
Copy link
Contributor

@tjhei - I tested out these changes on top of #4370, and so far using set Nonlinear solver failure strategy = cut timestep size is allowing a complex model to get through non convergent time steps after repeating the time step anywhere from 1-3 times.

I will report back after additional testing, but so far good news!

@jdannberg
Copy link
Contributor

@tjhei From what I remember, this was almost finished, except for the default. Or was there still a problem?
@naliboff found that this helps with convergence in lithosphere-scale models (and that the results are actually different), so it would be great to get this merged soon!

@@ -1850,91 +1851,125 @@ namespace aspect
if (parameters.use_operator_splitting)
compute_reactions ();

switch (parameters.nonlinear_solver)
try
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naliboff tested this PR for a model, and we just realized that it looks like the mesh deformation above is not reset if the nonlinear solver fails. it would be nice to make the repeat work with mesh deformation as well, but at least we should probably document this. (also I havent tested this extensively yet, maybe it is reset in a place I dont see).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is done here:

if (mesh_deformation)
mesh_deformation->mesh_displacements = mesh_deformation->old_mesh_displacements;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you are right that looks correct. I had not seen that line. But @naliboff had a model with mesh deformation where upon repeating a timestep there was a runaway effect at the following nonlinear solver attempts. As far as I remember the solver converged worse and worse at shorter timesteps and velocities increased. @naliboff do I remember this correctly? Was that something that is simple for us to reproduce? (I think it was the continental extension cookbook or something similar, but with the branch of #4370)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gassmoeller @tjhei - Yes, the example in question was a version of the continental extension cookbook using #4370 and the commits from this PR. Indeed, in some time steps where the time step was repeated 4-5 times and the time size got quite small (10,000 years -> 500 years) the solver eventually converged, but the solution looked quite different even the nonlinear solve eventually converged. I ran the same model without mesh deformation and the same issue persisted, and I think it is due to the dependence of the rheology on the time step size rather than an issue with this PR. In general, the functionality here helped quite a bit when the final time step size only ends up being a factor of 2 or 4 lower than previous time steps.

@tjhei
Copy link
Member Author

tjhei commented Jul 17, 2024

@jdannberg Do you remember what we decided regarding the default? I vaguely remember some kind of statistic to be printed at the end of the run as well? Sorry, I don't remember any of the details.

@gassmoeller
Copy link
Member

I think we decided to continue on non-convergence, but print a warning (and possibly another warning at the end of the model run if at least one timestep didnt converge). I dont remember about the statistics, maybe it would be useful to report how many timesteps didnt reach the nonlinear solver tolerance?

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch 4 times, most recently from a79d2c7 to 25b9f59 Compare July 18, 2024 21:02
@@ -95,3 +101,5 @@ Termination requested by criterion: end time




WARNING: During this computation 2 nonlinear solver failures occurred!
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of the new output.

@tjhei
Copy link
Member Author

tjhei commented Jul 19, 2024

Ok, I changed the default to continue and I added printing of information. Any other comments?

@naliboff
Copy link
Contributor

Ok, I changed the default to continue and I added printing of information. Any other comments?

Not for this PR, but based on my testing so far I think it would be good to implement options in the future where:

  1. There is a maximum number of repeated time steps (or minimum time step size) to prevent spurious behavior in models with viscoelasticity
  2. An option to repeat the time step and increase the maximum number of nonlinear iterations. This is based on a few cases where the model would have converged with a few more iterations if the maximum number of nonlinear iterations had been set to a higher value. I'll think about ways that this could be automated.

Thanks again for this functionality @tjhei, after a bit of initial testing it also helping in models with two-phase flow and reactions that can lead to rapid changes in porosity.

@tjhei tjhei requested a review from bangerth July 20, 2024 11:43
Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is almost ready, could you look into my last remaining comments? I think all of @bangerth's comments have been addressed, unless we hear something more I am willing to merge this.

}
case Parameters<dim>::NonlinearSolverFailureStrategy::abort_program:
{
// rethrow the current exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other two options have an additional line that says what is happening next (continue, cut), but this one doesnt, it just throws the error. Can you add another line here, saying that we are now aborting the model run? This would explain to the user we purposefully abort (they would see that from the error message as well, but this makes it more obvious).

@@ -2173,6 +2210,9 @@ namespace aspect
// throwing an exception. Therefore, we have to do this manually here:
computing_timer.print_summary ();

if (nonlinear_solver_failures>0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (nonlinear_solver_failures>0)
if (nonlinear_solver_failures > 0)

{
prm.enter_subsection("Repeat on nonlinear solver failure");

prm.declare_entry("Cut back amount", "0.5",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe and the same for the internal variable name:

Suggested change
prm.declare_entry("Cut back amount", "0.5",
prm.declare_entry("Cut back factor", "0.5",

@tjhei tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from 889f03d to 6b86bee Compare July 23, 2024 19:26
This introduces a new setting to decide what should happen if the
nonlinear solver fails to converge.
@tjhei tjhei force-pushed the nonlinear_fail_strategy branch from 6b86bee to f50bd8b Compare July 23, 2024 20:30
@tjhei
Copy link
Member Author

tjhei commented Jul 23, 2024

updated.

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks.

@gassmoeller gassmoeller dismissed bangerth’s stale review July 24, 2024 08:09

All comments have been addressed.

@gassmoeller gassmoeller merged commit 694655a into geodynamics:main Jul 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants