Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final solution dump after exception can run into "file exists" error #167

Open
inducer opened this issue Dec 3, 2020 · 6 comments
Open

Comments

@inducer
Copy link
Contributor

inducer commented Dec 3, 2020

This should not happen IMO.

@inducer
Copy link
Contributor Author

inducer commented Dec 3, 2020

To reproduce, run vortex-mpi.py from b2c9a41 with this diff:

diff --git a/examples/vortex-mpi.py b/examples/vortex-mpi.py
index b61148f..3e265cc 100644
--- a/examples/vortex-mpi.py
+++ b/examples/vortex-mpi.py
@@ -61,8 +61,8 @@ def main(ctx_factory=cl.create_some_context):
                 allocator=cl_tools.MemoryPool(cl_tools.ImmediateAllocator(queue)))
 
     dim = 2
-    nel_1d = 16
-    order = 3
+    nel_1d = 160
+    order = 5
     exittol = .09
     t_final = 0.1
     current_cfl = 1.0

(Observe how it's not adjusting the time step, hence the solution disagrees.)

Status: step=20 t=0.02000000000000001
------- P(-1.71, 1.42)
------- T(-0.00817, 0.00622)
------- dt=0.001 cfl=1.0
------- errors=0.427, 4.97, 0.449, 0.411
Checkpointing final state ...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/andreask_work/src/env-3.8/lib/python3.8/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "/home/andreask_work/src/env-3.8/lib/python3.8/site-packages/mpi4py/run.py", line 196, in main
    run_command_line(args)
  File "/home/andreask_work/src/env-3.8/lib/python3.8/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "/usr/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "vortex-mpi.py", line 177, in <module>
    main()
  File "vortex-mpi.py", line 167, in main
    my_checkpoint(current_step, t=current_t,
  File "vortex-mpi.py", line 148, in my_checkpoint
    return exact_sim_checkpoint(discr, initializer, visualizer, eos,
  File "/home/andreask_work/src/mirgecom/mirgecom/simutil.py", line 129, in exact_sim_checkpoint
    visualizer.write_parallel_vtk_file(
  File "/home/andreask_work/src/meshmode/meshmode/discretization/visualization.py", line 468, in write_parallel_vtk_file
    self.write_vtk_file(
  File "/home/andreask_work/src/meshmode/meshmode/discretization/visualization.py", line 626, in write_vtk_file
    raise FileExistsError("output file '%s' already exists"
FileExistsError: output file 'vortex-000020-0000.vtu' already exists

@MTCam
Copy link
Member

MTCam commented Dec 3, 2020

Correct, that's why we need this check:

    if current_t != checkpoint_t:
       if rank == 0:
          logger.info("Checkpointing final state ...")
       my_checkpoint(current_step, t=current_t,
                  dt=(current_t - checkpoint_t),
                  state=current_state)

This check was complained about and commented by someone. But it served a purpose, and that was to make sure we don't try to overwrite the soln file that was just written in the last step in case the time stepping stops right on an output step.

Edit:
We can't always count on the code stopping at an output step. When it doesn't, we want to make sure to write the last dump after stepping exits.

@inducer
Copy link
Contributor Author

inducer commented Dec 3, 2020

IMO we should find a solution that doesn't require repeating this logic in every driver. And without a comment, I (that someone you refer to IIRC) simply didn't get the logic, so it's not impossible that someone else might not either.

@MTCam
Copy link
Member

MTCam commented Dec 4, 2020

IMO we should find a solution that doesn't require repeating this logic in every driver. And without a comment, I (that someone you refer to IIRC) simply didn't get the logic, so it's not impossible that someone else might not either.

Agreed. FWIW, I didn't think it was you personally that commented it. I think it very likely was me responding to complaints about it - but then I never properly fixed it up. Ideally, there would have been a test surrounding this to expose and address the need. Can we unit test the drivers?

I couldn't agree more about not leaving this to the drivers to get right or wrong. I'm also sensitive to exporting too much to the back-end, however.

@majosm
Copy link
Collaborator

majosm commented Dec 4, 2020

I'm a little unclear on what the desired behavior actually should be in this case.

If the intent is to write a final dump after a mismatch has occurred, isn't this what should happen if the file already exists? The alternative would be to force it to be overwritten, which seems wrong if we're not allowing that behavior in other circumstances. I guess you could do a final dump to a different file on error (casename + "_error" or something) and allow that to be overwritten. Maybe not a great idea either since that would make it harder to load all the files into vis software.

@majosm
Copy link
Collaborator

majosm commented Dec 4, 2020

I get it now. 🙂 The issue is that it gives that error even if the file didn't exist before the run started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants