Nonlinear solver: deal with failure #2744

tjhei · 2018-12-19T03:44:27Z

This introduces a new setting to decide what should happen if the
nonlinear solver fails to converge.

The old behavior was to "just continue", which is an unsafe choice.

tjhei · 2018-12-19T03:44:45Z

@jdannberg something like this?

gassmoeller

I like the change and structure of the implementation, but I would like to keep the code out of Simulator::run and have some suggestions for the wording. What do you think about it?

gassmoeller · 2019-01-02T14:47:20Z

source/simulator/core.cc

        if (! (parameters.skip_solvers_on_initial_refinement
               && pre_refinement_step < parameters.initial_adaptive_refinement))
          {
            start_timestep ();

            // then do the core work: assemble systems and solve
-            solve_timestep ();


I somewhat dislike adding so many lines in the main run loop that are unrelated to it. Conceptionally it is the responsibility of the solve_timestep function to handle non-convergence. Could we move the whole try-catch block into solve_timestep, and modify the code in run to the following:

const bool retry_timestep = solve_timestep(); if (retry_timestep) continue;

All of the exception handling can stay the same, but it keep this loop readable.

please check my comment from 2019: https://github.com/geodynamics/aspect/pull/2744/files#r244753861
does this still apply?

gassmoeller · 2019-01-02T14:48:14Z

source/simulator/parameters.cc

+                       "Select the strategy on what to do if the nonlinear solver scheme fails to "
+                       "converge. The options are:\n"
+                       "`continue with next timestep`: ignore error and continue to the next timestep\n"
+                       "`cut timestep size`: reduce the current timestep size and redo the timestep\n"


maybe by a factor of two?

gassmoeller · 2019-01-02T14:53:40Z

source/simulator/core.cc

+              }
+            catch (...)
+              {
+                pcout << "ERROR: the solver in the current timestep failed to converge." << std::endl;


What about

pcout << "WARNING: The nonlinear solver in the current timestep failed to converge." << std::endl << "Acting according to the parameter 'Nonlinear solver failure strategy'." << std::endl;

gassmoeller · 2019-01-02T14:53:58Z

source/simulator/core.cc

+                  {
+                    if (timestep_number == 0)
+                      {
+                        pcout << "Note: we can not cut the timestep in step 0, so we are aborting."


I would call this an ERROR:...

include/aspect/simulator.h

jdannberg · 2020-08-03T21:26:27Z

source/simulator/core.cc

+
+                    // reduce timestep size and update time to "go back":
+                    const double new_time_step = 0.5*time_step;
+                    pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl;


This output should be in years or seconds, depending on the parameter.

jdannberg · 2020-08-03T21:33:59Z

source/simulator/core.cc

+                    pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl;
+
+                    time -= time_step - new_time_step;
+                    time_step = new_time_step;


There's one thing that I think doesn't work well with the current implementation: The time step size is reduced for the current time step, until the solver converges, but the next time step bases the time step size on the CFL number (rather than on what time step size was necessary to actually get the nonlinear solver to converge). So if the problem stays as hard to solve, we need to go though a number of failing nonlinear solver loops for every time step, rather than continuing with the new (smaller) time step that is adapted to the nonlinear problem.

That makes sense, but how do you want to decide when to make bigger timestep again? I don't think there is a good answer here. One thing that might help in an extreme case is not increasing it by more than a factor of say 2 in each step. But in general, this seems to be something that would require more user input.

jdannberg · 2020-08-03T21:36:01Z

tests/nonlinear_failure_strategy_cut.cc

@@ -0,0 +1 @@
+#include "../benchmarks/nonlinear_channel_flow/simple_nonlinear.cc"


It looks like this test problem is some sort of benchmark. So would it make sense to do some postprocessing, so that we see that our new method actually still gives us the correct solution?

tjhei · 2024-05-31T03:53:22Z

I am resurrecting this old PR because @danieldouglas92 is interested in it. I rebased and I am curious to see if things still work, but I know I have to make some more changes.

tjhei · 2024-06-01T23:23:09Z

This has been updated to the current main branch and now uses the TimeStepping infrastructure. The test added here is working correctly.

I would be grateful if somebody can take a look at the current state (it is still a bit messy).

tjhei · 2024-06-01T23:23:51Z

@jdannberg ? @gassmoeller ?

gassmoeller · 2024-06-01T23:47:46Z

I will take a look asap.

tjhei · 2024-06-01T23:55:44Z

it helps to include all files. :-)

gassmoeller

Very nice, this will be useful for a bunch of models. Just some smallish comments and a question about the best way to react in the statistics postprocessor.

Also please add a changelog entry.

include/aspect/simulator.h

include/aspect/time_stepping/interface.h

source/simulator/parameters.cc

gassmoeller · 2024-06-02T21:23:20Z

source/simulator/solver_schemes.cc

+    AssertThrow(nonlinear_solver_control.last_check() != SolverControl::failure, ExcNonlinearSolverNoConvergence());
    signals.post_nonlinear_solver(nonlinear_solver_control);


So whether we throw the assert before or after the signal may actually have some impact on the behavior. post_nonlinear_solver e.g. informs the postprocessor about the number of nonlinear iterations, and the postprocessor probably keeps accumulating the number of Stokes iterations until it receives the signal that the nonlinear solver has finished. That being said I dont know if the other order would be better, it could mean that one timestep appears twice in the statistics file. Have you check what the output statistics file looks like for a repeated timestep?

source/time_stepping/repeat_on_nonlinear_fail.cc

gassmoeller · 2024-06-02T21:32:08Z

tests/nonlinear_failure_strategy_cut/statistics

+# 12: Velocity iterations in Stokes preconditioner
+# 13: Schur complement iterations in Stokes preconditioner
+0 0.000000000000e+00 0.000000000000e+00 512 4851 2145 2145 2  0 0  58  60  59 
+1 6.625491997652e+03 6.625491997652e+03 512 4851 2145 2145 8 83 0 338 346 346 


So this looks like the statistics file records all nonlinear iterations (2x4) and all Stokes iterations, etc. from all tries. I guess that is fair since we actually made all these iterations?

bangerth

Not getting to the rest of it fast enough, but here's a couple comments.

include/aspect/parameters.h

bangerth

I can't seem to get the uninterrupted time to finish this review, so here's what I have so far. I'll get back to the rest later.

include/aspect/time_stepping/repeat_on_nonlinear_fail.h

source/simulator/core.cc

tjhei · 2024-06-04T21:54:29Z

Comments addressed. There are still many tests that need updating, though.

tjhei · 2024-06-05T02:58:20Z

Something is still wrong with nonlinear problems:

WARNING: The nonlinear solver in the current timestep failed to converge.
Acting according to the parameter 'Nonlinear solver failure strategy'...
   Postprocessing:
     RMS, max velocity:                               7.49 m/s, 56.3 m/s
     Temperature min/avg/max:                         0 K, 0.5048 K, 1 K
     Heat fluxes through boundary parts:              0 W, 0 W, -3.127 W, 1 W
     Max, min, and rms velocity along boundary parts: 11.29 m/s, 0.001015 m/s, 4.795 m/s, 58.32 m/s, 0.001293 m/s, 23.75 m/s, 30.68 m/s, 0.02087 m/s, 11.42 m/s, 0.04261 m/s, 0.0008854 m/s, 0.03199 m/s

*** Timestep 2:  t=0.0021 seconds, dt=9.27806e-05 seconds
   Solving temperature system... 5 iterations.
   Initial Newton Stokes residual = 3.59864e+06, v = 3.59864e+06, p = 0

   Rebuilding Stokes preconditioner...
   Solving Stokes system... 14+0 iterations.
      Relative nonlinear residual (total Newton system) after nonlinear iteration 1: 1, norm of the rhs: 3.59864e+06

   Solving temperature system... 95 iterations.
[0b3d367e5bc0:34578] *** Process received signal ***
[0b3d367e5bc0:34578] Signal: Floating point exception (8)
[0b3d367e5bc0:34578] Signal code: Floating point divide-by-zero (3)
[0b3d367e5bc0:34578] Failing at address: 0x7fc1f8c5cb14
[0b3d367e5bc0:34578] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc2048bc420]
[0b3d367e5bc0:34578] [ 1] ./libtosi_benchmark_newton_solver.debug.so(_ZNK6aspect13TosiBenchmark12TosiMaterialILi2EE9viscosityEddRKSt6vectorIdSaIdEERKN6dealii15SymmetricTensorILi2ELi2EdEERKNS8_5PointILi2EdEE+0x114)[0x7fc1f8c5cb14]
[0b3d367e5bc0:34578] [ 2] ./libtosi_benchmark_newton_solver.debug.so(_ZNK6aspect13TosiBenchmark12TosiMaterialILi2EE8evaluateERKNS_13MaterialModel19MaterialModelInputsILi2EEERNS3_20MaterialModelOutputsILi2EEE+0x2ff)[0x7fc1f8c5e24f]
[0b3d367e5bc0:34578] [ 3] /__w/aspect/aspect/build/aspect(_ZNK6aspect9SimulatorILi2EE31compute_pressure_scaling_factorEv+0x222)[0x563654f78982]
[0b3d367e5bc0:34578] [ 4] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE22assemble_stokes_systemEv+0x101)[0x563655027a71]
[0b3d367e5bc0:34578] [ 5] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE36do_one_defect_correction_Stokes_stepERNS_25DefectCorrectionResidualsEb+0x1fa)[0x563654e8b02a]
[0b3d367e5bc0:34578] [ 6] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE42solve_iterated_advection_and_newton_stokesEv+0x195)[0x563654e8d835]
[0b3d367e5bc0:34578] [ 7] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE14solve_timestepEv+0x1a5)[0x563655a14b25]
[0b3d367e5bc0:34578] [ 8] /__w/aspect/aspect/build/aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x4ef)[0x563655a21ccf]
[0b3d367e5bc0:34578] [ 9] /__w/aspect/aspect/build/aspect(_Z13run_simulatorILi2EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_bbbb+0x2cf)[0x5636559ef89f]
[0b3d367e5bc0:34578] [10] /__w/aspect/aspect/build/aspect(main+0x5ad)[0x5636549fb66d]
[0b3d367e5bc0:34578] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc2046da083]
[0b3d367e5bc0:34578] [12] /__w/aspect/aspect/build/aspect(_start+0x2e)[0x5636549fd11e]

tjhei · 2024-06-18T18:52:39Z

@gassmoeller @jdannberg can you remind me what the conclusion of our discussion regarding default of the new setting is?

gassmoeller · 2024-06-18T23:32:52Z

I think we agreed on continue_with_next_timestep, because it is closest to the old behavior, and at least provides the user with a warning. I think we also talked about somehow tracking if there are any non-converging timesteps and then issue a warning at the end, because warnings for individual timesteps might get lost in the long output text.

naliboff · 2024-06-20T21:01:32Z

@tjhei - I tested out these changes on top of #4370, and so far using set Nonlinear solver failure strategy = cut timestep size is allowing a complex model to get through non convergent time steps after repeating the time step anywhere from 1-3 times.

I will report back after additional testing, but so far good news!

jdannberg · 2024-07-16T15:22:19Z

@tjhei From what I remember, this was almost finished, except for the default. Or was there still a problem?
@naliboff found that this helps with convergence in lithosphere-scale models (and that the results are actually different), so it would be great to get this merged soon!

gassmoeller · 2024-07-16T15:25:12Z

source/simulator/core.cc

@@ -1850,91 +1851,125 @@ namespace aspect
    if (parameters.use_operator_splitting)
      compute_reactions ();

-    switch (parameters.nonlinear_solver)
+    try


@naliboff tested this PR for a model, and we just realized that it looks like the mesh deformation above is not reset if the nonlinear solver fails. it would be nice to make the repeat work with mesh deformation as well, but at least we should probably document this. (also I havent tested this extensively yet, maybe it is reset in a place I dont see).

I thought this is done here:

aspect/source/simulator/core.cc

Lines 2133 to 2134 in 994f686

if (mesh_deformation)

mesh_deformation->mesh_displacements = mesh_deformation->old_mesh_displacements;

Hmm, you are right that looks correct. I had not seen that line. But @naliboff had a model with mesh deformation where upon repeating a timestep there was a runaway effect at the following nonlinear solver attempts. As far as I remember the solver converged worse and worse at shorter timesteps and velocities increased. @naliboff do I remember this correctly? Was that something that is simple for us to reproduce? (I think it was the continental extension cookbook or something similar, but with the branch of #4370)

@gassmoeller @tjhei - Yes, the example in question was a version of the continental extension cookbook using #4370 and the commits from this PR. Indeed, in some time steps where the time step was repeated 4-5 times and the time size got quite small (10,000 years -> 500 years) the solver eventually converged, but the solution looked quite different even the nonlinear solve eventually converged. I ran the same model without mesh deformation and the same issue persisted, and I think it is due to the dependence of the rheology on the time step size rather than an issue with this PR. In general, the functionality here helped quite a bit when the final time step size only ends up being a factor of 2 or 4 lower than previous time steps.

tjhei · 2024-07-17T16:22:33Z

@jdannberg Do you remember what we decided regarding the default? I vaguely remember some kind of statistic to be printed at the end of the run as well? Sorry, I don't remember any of the details.

gassmoeller · 2024-07-18T14:10:13Z

I think we decided to continue on non-convergence, but print a warning (and possibly another warning at the end of the model run if at least one timestep didnt converge). I dont remember about the statistics, maybe it would be useful to report how many timesteps didnt reach the nonlinear solver tolerance?

tjhei · 2024-07-19T16:52:14Z

tests/global_melt/screen-output

@@ -95,3 +101,5 @@ Termination requested by criterion: end time



+
+WARNING: During this computation 2 nonlinear solver failures occurred!


Here is an example of the new output.

tjhei · 2024-07-19T16:52:28Z

Ok, I changed the default to continue and I added printing of information. Any other comments?

naliboff · 2024-07-20T03:39:05Z

Ok, I changed the default to continue and I added printing of information. Any other comments?

Not for this PR, but based on my testing so far I think it would be good to implement options in the future where:

There is a maximum number of repeated time steps (or minimum time step size) to prevent spurious behavior in models with viscoelasticity
An option to repeat the time step and increase the maximum number of nonlinear iterations. This is based on a few cases where the model would have converged with a few more iterations if the maximum number of nonlinear iterations had been set to a higher value. I'll think about ways that this could be automated.

Thanks again for this functionality @tjhei, after a bit of initial testing it also helping in models with two-phase flow and reactions that can lead to rapid changes in porosity.

gassmoeller

Thank you! This is almost ready, could you look into my last remaining comments? I think all of @bangerth's comments have been addressed, unless we hear something more I am willing to merge this.

gassmoeller · 2024-07-23T11:02:07Z

source/simulator/core.cc

+            }
+            case Parameters<dim>::NonlinearSolverFailureStrategy::abort_program:
+            {
+              // rethrow the current exception


the other two options have an additional line that says what is happening next (continue, cut), but this one doesnt, it just throws the error. Can you add another line here, saying that we are now aborting the model run? This would explain to the user we purposefully abort (they would see that from the error message as well, but this makes it more obvious).

gassmoeller · 2024-07-23T11:02:50Z

source/simulator/core.cc

@@ -2173,6 +2210,9 @@ namespace aspect
    // throwing an exception. Therefore, we have to do this manually here:
    computing_timer.print_summary ();

+    if (nonlinear_solver_failures>0)


Suggested change

if (nonlinear_solver_failures>0)

if (nonlinear_solver_failures > 0)

gassmoeller · 2024-07-23T11:04:11Z

source/time_stepping/repeat_on_nonlinear_fail.cc

+      {
+        prm.enter_subsection("Repeat on nonlinear solver failure");
+
+        prm.declare_entry("Cut back amount", "0.5",


Maybe and the same for the internal variable name:

Suggested change

prm.declare_entry("Cut back amount", "0.5",

prm.declare_entry("Cut back factor", "0.5",

This introduces a new setting to decide what should happen if the nonlinear solver fails to converge.

tjhei · 2024-07-23T20:33:16Z

updated.

gassmoeller

👍 Thanks.

All comments have been addressed.

gassmoeller reviewed Jan 2, 2019

View reviewed changes

tjhei mentioned this pull request Jun 22, 2020

More control on timestepping logic #3549

Closed

jdannberg reviewed Aug 3, 2020

View reviewed changes

tjhei force-pushed the nonlinear_fail_strategy branch from 3adb0fd to c2ada01 Compare May 31, 2024 03:35

tjhei force-pushed the nonlinear_fail_strategy branch from 07b738d to ebfb354 Compare June 1, 2024 23:14

tjhei changed the title ~~[WIP]Nonlinear solver: deal with failure~~ Nonlinear solver: deal with failure Jun 1, 2024

gassmoeller self-assigned this Jun 1, 2024

tjhei force-pushed the nonlinear_fail_strategy branch from a4cb698 to c8ee1d0 Compare June 2, 2024 04:26

tjhei mentioned this pull request Jun 2, 2024

unity build compile issues #5728

Closed

tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from e6f5301 to 79bdf32 Compare June 2, 2024 21:00

gassmoeller reviewed Jun 2, 2024

View reviewed changes

bangerth reviewed Jun 2, 2024

View reviewed changes

include/aspect/parameters.h Show resolved Hide resolved

include/aspect/parameters.h Outdated Show resolved Hide resolved

bangerth previously requested changes Jun 3, 2024

View reviewed changes

include/aspect/time_stepping/repeat_on_nonlinear_fail.h Outdated Show resolved Hide resolved

source/simulator/core.cc Outdated Show resolved Hide resolved

source/simulator/core.cc Outdated Show resolved Hide resolved

tjhei force-pushed the nonlinear_fail_strategy branch 4 times, most recently from a7e36db to 50b7468 Compare June 5, 2024 01:37

tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from 4a47c48 to 4659297 Compare June 18, 2024 18:51

gassmoeller reviewed Jul 16, 2024

View reviewed changes

tjhei force-pushed the nonlinear_fail_strategy branch 4 times, most recently from a79d2c7 to 25b9f59 Compare July 18, 2024 21:02

tjhei commented Jul 19, 2024

View reviewed changes

tjhei requested a review from bangerth July 20, 2024 11:43

gassmoeller reviewed Jul 23, 2024

View reviewed changes

tjhei force-pushed the nonlinear_fail_strategy branch 2 times, most recently from 889f03d to 6b86bee Compare July 23, 2024 19:26

Nonlinear solver: deal with failure

f50bd8b

This introduces a new setting to decide what should happen if the nonlinear solver fails to converge.

tjhei force-pushed the nonlinear_fail_strategy branch from 6b86bee to f50bd8b Compare July 23, 2024 20:30

gassmoeller approved these changes Jul 24, 2024

View reviewed changes

gassmoeller merged commit 694655a into geodynamics:main Jul 24, 2024
8 checks passed

		@@ -0,0 +1 @@
		#include "../benchmarks/nonlinear_channel_flow/simple_nonlinear.cc"

		AssertThrow(nonlinear_solver_control.last_check() != SolverControl::failure, ExcNonlinearSolverNoConvergence());
		signals.post_nonlinear_solver(nonlinear_solver_control);

	if (mesh_deformation)
	mesh_deformation->mesh_displacements = mesh_deformation->old_mesh_displacements;

		@@ -95,3 +101,5 @@ Termination requested by criterion: end time




		WARNING: During this computation 2 nonlinear solver failures occurred!

	if (nonlinear_solver_failures>0)
	if (nonlinear_solver_failures > 0)

	prm.declare_entry("Cut back amount", "0.5",
	prm.declare_entry("Cut back factor", "0.5",

Nonlinear solver: deal with failure #2744

Nonlinear solver: deal with failure #2744

Conversation

tjhei commented Dec 19, 2018 • edited Loading

tjhei commented Dec 19, 2018

gassmoeller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjhei commented May 31, 2024

tjhei commented Jun 1, 2024 • edited Loading

tjhei commented Jun 1, 2024

gassmoeller commented Jun 1, 2024

tjhei commented Jun 1, 2024

gassmoeller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bangerth left a comment

Choose a reason for hiding this comment

bangerth left a comment

Choose a reason for hiding this comment

tjhei commented Jun 4, 2024

tjhei commented Jun 5, 2024

tjhei commented Jun 18, 2024

gassmoeller commented Jun 18, 2024

naliboff commented Jun 20, 2024

jdannberg commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjhei commented Jul 17, 2024

gassmoeller commented Jul 18, 2024

Choose a reason for hiding this comment

tjhei commented Jul 19, 2024

naliboff commented Jul 20, 2024

gassmoeller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjhei commented Jul 23, 2024

gassmoeller left a comment

Choose a reason for hiding this comment

tjhei commented Dec 19, 2018 •

edited

Loading

tjhei commented Jun 1, 2024 •

edited

Loading