-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonlinear solver: deal with failure #2744
Conversation
@jdannberg something like this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the change and structure of the implementation, but I would like to keep the code out of Simulator::run
and have some suggestions for the wording. What do you think about it?
source/simulator/core.cc
Outdated
if (! (parameters.skip_solvers_on_initial_refinement | ||
&& pre_refinement_step < parameters.initial_adaptive_refinement)) | ||
{ | ||
start_timestep (); | ||
|
||
// then do the core work: assemble systems and solve | ||
solve_timestep (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhat dislike adding so many lines in the main run loop that are unrelated to it. Conceptionally it is the responsibility of the solve_timestep function to handle non-convergence. Could we move the whole try-catch block into solve_timestep, and modify the code in run
to the following:
const bool retry_timestep = solve_timestep();
if (retry_timestep)
continue;
All of the exception handling can stay the same, but it keep this loop readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check my comment from 2019: https://github.com/geodynamics/aspect/pull/2744/files#r244753861
does this still apply?
source/simulator/parameters.cc
Outdated
"Select the strategy on what to do if the nonlinear solver scheme fails to " | ||
"converge. The options are:\n" | ||
"`continue with next timestep`: ignore error and continue to the next timestep\n" | ||
"`cut timestep size`: reduce the current timestep size and redo the timestep\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe by a factor of two
?
source/simulator/core.cc
Outdated
} | ||
catch (...) | ||
{ | ||
pcout << "ERROR: the solver in the current timestep failed to converge." << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
pcout << "WARNING: The nonlinear solver in the current timestep failed to converge." << std::endl
<< "Acting according to the parameter 'Nonlinear solver failure strategy'." << std::endl;
source/simulator/core.cc
Outdated
{ | ||
if (timestep_number == 0) | ||
{ | ||
pcout << "Note: we can not cut the timestep in step 0, so we are aborting." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this an ERROR:...
source/simulator/core.cc
Outdated
|
||
// reduce timestep size and update time to "go back": | ||
const double new_time_step = 0.5*time_step; | ||
pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output should be in years or seconds, depending on the parameter.
source/simulator/core.cc
Outdated
pcout << "\tReducing timestep to " << new_time_step << '\n' << std::endl; | ||
|
||
time -= time_step - new_time_step; | ||
time_step = new_time_step; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one thing that I think doesn't work well with the current implementation: The time step size is reduced for the current time step, until the solver converges, but the next time step bases the time step size on the CFL number (rather than on what time step size was necessary to actually get the nonlinear solver to converge). So if the problem stays as hard to solve, we need to go though a number of failing nonlinear solver loops for every time step, rather than continuing with the new (smaller) time step that is adapted to the nonlinear problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but how do you want to decide when to make bigger timestep again? I don't think there is a good answer here. One thing that might help in an extreme case is not increasing it by more than a factor of say 2 in each step. But in general, this seems to be something that would require more user input.
@@ -0,0 +1 @@ | |||
#include "../benchmarks/nonlinear_channel_flow/simple_nonlinear.cc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this test problem is some sort of benchmark. So would it make sense to do some postprocessing, so that we see that our new method actually still gives us the correct solution?
3adb0fd
to
c2ada01
Compare
I am resurrecting this old PR because @danieldouglas92 is interested in it. I rebased and I am curious to see if things still work, but I know I have to make some more changes. |
07b738d
to
ebfb354
Compare
This has been updated to the current main branch and now uses the TimeStepping infrastructure. The test added here is working correctly. I would be grateful if somebody can take a look at the current state (it is still a bit messy). |
I will take a look asap. |
it helps to include all files. :-) |
a4cb698
to
c8ee1d0
Compare
e6f5301
to
79bdf32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, this will be useful for a bunch of models. Just some smallish comments and a question about the best way to react in the statistics postprocessor.
Also please add a changelog entry.
AssertThrow(nonlinear_solver_control.last_check() != SolverControl::failure, ExcNonlinearSolverNoConvergence()); | ||
signals.post_nonlinear_solver(nonlinear_solver_control); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So whether we throw the assert before or after the signal may actually have some impact on the behavior. post_nonlinear_solver
e.g. informs the postprocessor about the number of nonlinear iterations, and the postprocessor probably keeps accumulating the number of Stokes iterations until it receives the signal that the nonlinear solver has finished. That being said I dont know if the other order would be better, it could mean that one timestep appears twice in the statistics file. Have you check what the output statistics file looks like for a repeated timestep?
# 12: Velocity iterations in Stokes preconditioner | ||
# 13: Schur complement iterations in Stokes preconditioner | ||
0 0.000000000000e+00 0.000000000000e+00 512 4851 2145 2145 2 0 0 58 60 59 | ||
1 6.625491997652e+03 6.625491997652e+03 512 4851 2145 2145 8 83 0 338 346 346 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks like the statistics file records all nonlinear iterations (2x4) and all Stokes iterations, etc. from all tries. I guess that is fair since we actually made all these iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not getting to the rest of it fast enough, but here's a couple comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to get the uninterrupted time to finish this review, so here's what I have so far. I'll get back to the rest later.
Comments addressed. There are still many tests that need updating, though. |
a7e36db
to
50b7468
Compare
Something is still wrong with nonlinear problems:
|
4a47c48
to
4659297
Compare
@gassmoeller @jdannberg can you remind me what the conclusion of our discussion regarding default of the new setting is? |
I think we agreed on |
@tjhei - I tested out these changes on top of #4370, and so far using I will report back after additional testing, but so far good news! |
@@ -1850,91 +1851,125 @@ namespace aspect | |||
if (parameters.use_operator_splitting) | |||
compute_reactions (); | |||
|
|||
switch (parameters.nonlinear_solver) | |||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naliboff tested this PR for a model, and we just realized that it looks like the mesh deformation above is not reset if the nonlinear solver fails. it would be nice to make the repeat work with mesh deformation as well, but at least we should probably document this. (also I havent tested this extensively yet, maybe it is reset in a place I dont see).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is done here:
aspect/source/simulator/core.cc
Lines 2133 to 2134 in 994f686
if (mesh_deformation) | |
mesh_deformation->mesh_displacements = mesh_deformation->old_mesh_displacements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you are right that looks correct. I had not seen that line. But @naliboff had a model with mesh deformation where upon repeating a timestep there was a runaway effect at the following nonlinear solver attempts. As far as I remember the solver converged worse and worse at shorter timesteps and velocities increased. @naliboff do I remember this correctly? Was that something that is simple for us to reproduce? (I think it was the continental extension cookbook or something similar, but with the branch of #4370)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gassmoeller @tjhei - Yes, the example in question was a version of the continental extension cookbook using #4370 and the commits from this PR. Indeed, in some time steps where the time step was repeated 4-5 times and the time size got quite small (10,000 years -> 500 years) the solver eventually converged, but the solution looked quite different even the nonlinear solve eventually converged. I ran the same model without mesh deformation and the same issue persisted, and I think it is due to the dependence of the rheology on the time step size rather than an issue with this PR. In general, the functionality here helped quite a bit when the final time step size only ends up being a factor of 2 or 4 lower than previous time steps.
@jdannberg Do you remember what we decided regarding the default? I vaguely remember some kind of statistic to be printed at the end of the run as well? Sorry, I don't remember any of the details. |
I think we decided to continue on non-convergence, but print a warning (and possibly another warning at the end of the model run if at least one timestep didnt converge). I dont remember about the statistics, maybe it would be useful to report how many timesteps didnt reach the nonlinear solver tolerance? |
a79d2c7
to
25b9f59
Compare
@@ -95,3 +101,5 @@ Termination requested by criterion: end time | |||
|
|||
|
|||
|
|||
|
|||
WARNING: During this computation 2 nonlinear solver failures occurred! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of the new output.
Ok, I changed the default to continue and I added printing of information. Any other comments? |
Not for this PR, but based on my testing so far I think it would be good to implement options in the future where:
Thanks again for this functionality @tjhei, after a bit of initial testing it also helping in models with two-phase flow and reactions that can lead to rapid changes in porosity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is almost ready, could you look into my last remaining comments? I think all of @bangerth's comments have been addressed, unless we hear something more I am willing to merge this.
} | ||
case Parameters<dim>::NonlinearSolverFailureStrategy::abort_program: | ||
{ | ||
// rethrow the current exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other two options have an additional line that says what is happening next (continue, cut), but this one doesnt, it just throws the error. Can you add another line here, saying that we are now aborting the model run? This would explain to the user we purposefully abort (they would see that from the error message as well, but this makes it more obvious).
source/simulator/core.cc
Outdated
@@ -2173,6 +2210,9 @@ namespace aspect | |||
// throwing an exception. Therefore, we have to do this manually here: | |||
computing_timer.print_summary (); | |||
|
|||
if (nonlinear_solver_failures>0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (nonlinear_solver_failures>0) | |
if (nonlinear_solver_failures > 0) |
{ | ||
prm.enter_subsection("Repeat on nonlinear solver failure"); | ||
|
||
prm.declare_entry("Cut back amount", "0.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe and the same for the internal variable name:
prm.declare_entry("Cut back amount", "0.5", | |
prm.declare_entry("Cut back factor", "0.5", |
889f03d
to
6b86bee
Compare
This introduces a new setting to decide what should happen if the nonlinear solver fails to converge.
6b86bee
to
f50bd8b
Compare
updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks.
This introduces a new setting to decide what should happen if the
nonlinear solver fails to converge.
The old behavior was to "just continue", which is an unsafe choice.