Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need restart #139

Open
5 tasks
MTCam opened this issue Nov 5, 2020 · 3 comments
Open
5 tasks

Need restart #139

MTCam opened this issue Nov 5, 2020 · 3 comments

Comments

@MTCam
Copy link
Member

MTCam commented Nov 5, 2020

A restart capability will be required in order to run sufficiently long enough for meaningful flow simulations. We will need these capabilities:

  • recording the conserved quantities, time, step number, and some user-defined dependent variables (e.g. temperature) for every point on the discretization for the purpose of restart
  • reading the data create in the previous step into simulation data structures
  • restarting the simulation with the data read from previous step
  • verification the same advanced state is simulated at step N + M regardless of intermediate restarts
  • Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

cc: @inducer @anderson2981

@inducer
Copy link
Contributor

inducer commented Nov 5, 2020

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

@inducer
Copy link
Contributor

inducer commented Nov 6, 2020

#140

@MTCam
Copy link
Member Author

MTCam commented Nov 12, 2020

Optionally restarting with a different partitioning (e.g. different number of MPI ranks)

I'd vote that this is out of scope for round one. It would take a while to implement, especially if we're not willing to centralize the mesh and DOF data on a single rank.

Totally agree. It would be good to leave this on the radar, however. We need to handle changing resource availability for production runs. Even for lead-up science runs; consider the situation where a big resource is used to run several flow-throughs, then a much smaller resource is used to run many "shots" or ignition instances.

recording [...] some user-defined dependent variables (e.g. temperature)

Why do we need to save these?

That's a great question. We can discuss it with JBF, Anderson, and Esteban - perhaps we can do this better (or we already have) - but here is the issue stated in a meta sort of way:

Currently the temperature (T) is calculated as a function of state, and the last temperature (i.e. T = temperature(state, Tguess)).

For Cantera, the user cannot specify Tguess! Cantera just uses the internal state that it kept from the last call of it! Because we use a single instance of Cantera to calculate many points, this means that the answers we get from Cantera depend on partitioning! (i.e. because partitioning affects the point ordering and each call of Cantera just starts its iterations from Tguess = Tlastpoint).

Prometheus does one step better by providing an API to specify Tguess. So for us, Tguess = the temperature that the given point was the last time. Because we store T(i.e. at runtime and at I/O time), we have T available to use for Tguess, but if we don't store it, then our Tguess is lost.

  • We could just give up and set Tguess = 300 (or whatever is appropriate from user-chosen units) and be done; this soln has some pretty hefty performance implications
  • or we can accept that we get a slightly different answer when we restart (this verges on unacceptable).
  • or we can define a function that will get a deterministic value for Tguess (i.e. Tguess = approximate_temp(state)) and use that as Tguess [ my preferred solution ].
  • or we can write out temperature as a restart quantity and restart it just like state [ current practice ]

verification the same advanced state is simulated at step N + M regardless of intermediate restarts

For multi-step time integration (i.e. not us, yet), this would entail also saving a good chunk of time stepper state (vs. just re-bootstrapping the time stepper). How important is this "exact restart"?

Experience tells me that deterministic restart is quite important, but I can also imagine some cases in which that would not be a show-stopper. We should bring this up with the physics guys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants