Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI vs pytest #69

Open
inducer opened this issue Aug 13, 2020 · 7 comments
Open

MPI vs pytest #69

inducer opened this issue Aug 13, 2020 · 7 comments
Assignees

Comments

@inducer
Copy link
Contributor

inducer commented Aug 13, 2020

  • We'll need to run (pytest) tests under MPI, to test the distributed-memory functionality.
  • These need to run during CI.
  • We will also want to be able to run this on our target platforms, i.e. the big DOE machines.
  • We need to decide between "pytest inside MPI" (i.e. mpiexec python -m pytest) and "MPI inside pytest" (as meshmode currently does.

If we choose "pytest inside MPI", then pytest-mpi might come in handy.

cc @lukeolson @MTCam

@majosm
Copy link
Collaborator

majosm commented Aug 24, 2020

Update on this: I tried out pytest-mpi as an alternative to the current mpiexec-inside-test approach, and I'm not super happy with it. It seems to produce separate pytest output for each rank (similar to what you would see if you ran multiple separate pytests simulatenously), and in some cases it comes out a bit garbled. It also doesn't provide much in the way of options for specifying how many ranks/nodes/etc. to use.

So instead I took a stab at generalizing the current approach so that we can customize the behavior for different platforms. The proof-of-concept code can be seen here and here. Essentially what it does now is check for an environment variable set by the user (MPI_EXECUTOR_TYPE) that specifies which MPI execution method to use ('basic' for mpiexec, 'slurm' for srun, etc.) and then set up the launching command accordingly. (If the environment variable is not set, the tests are skipped.) I also moved the test function call out of main and into little on-the-fly scripts that get passed to python via the -c flag so that multiple MPI tests can be placed in the same file. Seems a little easier to understand what's happening that way too.

I'm not entirely satisfied with this yet (I suspect there's a way to further simplify the test_script stuff; looking for suggestions), but it seems like this could work. I would of course eventually move the executor definitions somewhere else so that they could be used by other packages (is there a good place to put these?).

Edit: I set up the slurm and LC-LSF executors to be used from inside an interactive job submitted by the user with salloc/lalloc.

@MTCam
Copy link
Member

MTCam commented Aug 25, 2020

@majosm Here is what I'm doing currently in TEESD to handle batching, platform-dependent spawn commands, etc.

I would be pretty excited about improving that if you find a better way!

@majosm
Copy link
Collaborator

majosm commented Aug 25, 2020

Parsl actually has some infrastructure for this too (Execution Providers and Launchers). I wonder if we could nudge them into splitting it off into a standalone package at some point.

@majosm
Copy link
Collaborator

majosm commented Aug 31, 2020

@inducer Re: Different behavior for subprocess.call vs. os.system on lassen: I set up an example (source here). The script inside that gets executed via MPI prints its rank and also creates some empty files (to check whether it's just a stdout capturing issue or not). I tried two sets of tests; in the first I just print out the script source (command in print_script) in order to try to rule out any formatting issues (since the formatting is a little bit nasty at the moment), and in the second I actually run the script (command in run_script).

Results:

MPI + print_script:

Version 1: Works
Version 2: Works
Version 3: Doesn't work (prints lrun help message; error code 1)

MPI + run_script:

Version 1: Works
Version 2: Doesn't work (no stdout, doesn't create files, no error)
Version 3: Doesn't work (prints lrun help message; error code 1)

Seems like version 2 is having trouble running python from the subprocess for some reason. Not sure what's going on with version 3. Any ideas?

Edit: Version 3 works for both cases if I do " ".join(command). I guess I'm supposed to pass a single string if using shell=True?

@majosm
Copy link
Collaborator

majosm commented Sep 1, 2020

Also, where would be a good place to stash these executor definitions?

@majosm
Copy link
Collaborator

majosm commented Sep 1, 2020

Looks like mpi4py uses unittest, not pytest. And they call MPI outside the test script. I don't see any launcher-handling code that we could borrow (just some CI configuration scripts for a few different platforms).

I'll see if I can find any other Python codebases that use MPI.

@majosm
Copy link
Collaborator

majosm commented Sep 1, 2020

Dang. Well, the pickle version was looking pretty nice until I set up a test that used an array context and ran into this:

        pickled_test = pickle.dumps(test).hex()
>       pickled_args = pickle.dumps(args).hex()
E       AttributeError: Can't pickle local object 'pytest_generate_tests_for_pyopencl_array_context.<locals>.ArrayContextFactory'

test_partition.py:131: AttributeError

Unless someone happens to know a workaround, I think that spells doom for this approach...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants