Need restart #139

MTCam · 2020-11-05T17:57:04Z

A restart capability will be required in order to run sufficiently long enough for meaningful flow simulations. We will need these capabilities:

recording the conserved quantities, time, step number, and some user-defined dependent variables (e.g. temperature) for every point on the discretization for the purpose of restart
reading the data create in the previous step into simulation data structures
restarting the simulation with the data read from previous step
verification the same advanced state is simulated at step N + M regardless of intermediate restarts
Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

cc: @inducer @anderson2981

inducer · 2020-11-05T22:14:00Z

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

inducer · 2020-11-06T01:32:13Z

#140

MTCam · 2020-11-12T01:08:48Z

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

Totally agree. It would be good to leave this on the radar, however. We need to handle changing resource availability for production runs. Even for lead-up science runs; consider the situation where a big resource is used to run several flow-throughs, then a much smaller resource is used to run many "shots" or ignition instances.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

That's a great question. We can discuss it with JBF, Anderson, and Esteban - perhaps we can do this better (or we already have) - but here is the issue stated in a meta sort of way:

Currently the temperature (T) is calculated as a function of state, and the last temperature (i.e. T = temperature(state, Tguess)).

For Cantera, the user cannot specify Tguess! Cantera just uses the internal state that it kept from the last call of it! Because we use a single instance of Cantera to calculate many points, this means that the answers we get from Cantera depend on partitioning! (i.e. because partitioning affects the point ordering and each call of Cantera just starts its iterations from Tguess = Tlastpoint).

Prometheus does one step better by providing an API to specify Tguess. So for us, Tguess = the temperature that the given point was the last time. Because we store T(i.e. at runtime and at I/O time), we have T available to use for Tguess, but if we don't store it, then our Tguess is lost.

We could just give up and set Tguess = 300 (or whatever is appropriate from user-chosen units) and be done; this soln has some pretty hefty performance implications
or we can accept that we get a slightly different answer when we restart (this verges on unacceptable).
or we can define a function that will get a deterministic value for Tguess (i.e. Tguess = approximate_temp(state)) and use that as Tguess [ my preferred solution ].
or we can write out temperature as a restart quantity and restart it just like state [ current practice ]

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

Experience tells me that deterministic restart is quite important, but I can also imagine some cases in which that would not be a show-stopper. We should bring this up with the physics guys.

This was referenced Nov 6, 2020

Investigate binary Vtk files inducer/meshmode#75

Closed

Prototype a driver with restart #140

Closed

inducer mentioned this issue Apr 28, 2021

Update doc for ConservedVars dataclass #326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need restart #139

Need restart #139

MTCam commented Nov 5, 2020 •

edited

Loading

inducer commented Nov 5, 2020

inducer commented Nov 6, 2020

MTCam commented Nov 12, 2020

Need restart #139

Need restart #139

Comments

MTCam commented Nov 5, 2020 • edited Loading

inducer commented Nov 5, 2020

inducer commented Nov 6, 2020

MTCam commented Nov 12, 2020

MTCam commented Nov 5, 2020 •

edited

Loading