parallel function mapping #1493

cowlicks · 2015-05-04T20:06:58Z

This is the spiritual successor to PR #723

It allows you to break an image into chunks and map a function, in parallel, over the chunks. The chunks' boundary conditions and boundary depth can be specified to match those the function needs.

This adds dask as a dependency, and is based off dask master.

cowlicks · 2015-05-04T20:08:16Z

skimage/util/process.py

+__all__ = ['process_chunks']
+
+
+def process_chunks(function, array, chunks=None, depth=None, mode=None):


Suggestions for a better name?

Agree that we could improve upon process.

This could also probably use a docstring saying that it splits apart an array and executes a normal function in parallel on overlapping chunks of the array.

mrocklin · 2015-05-04T20:53:47Z

This PR came out of a conversation with @stefanv

jni · 2015-05-05T05:51:04Z

Cool! @stefanv you're ok with adding this dependency? I'm a big fan of @mrocklin's work so I'd be in favour of this. The downstream dependencies are pretty minor (dill and toolz).

JDWarner · 2015-05-05T05:54:17Z

@jni for purposes of discussion I believe it's possible to specify dask[array] as a sole dependency, avoiding toolz and dill entirely.

I really like what I see here, pending a little API/naming discussion.

mrocklin · 2015-05-05T05:54:44Z

We need to know if this is the right API to include in scikit-image. It would be great to get feedback from the scikit-image community about what would be most useful. Would it be possible to have someone play-test this?

(Also, to give credit, this is @cowlicks work, not mine :))

mrocklin · 2015-05-05T05:59:38Z

I believe it's possible to specify dask[array] as a sole dependency, avoiding toolz and dill entirely.

As of 0.4.0 dask[array] requires numpy, toolz, psutil, chest, chest requires heapdict. We can remove chest and psutil somewhat cheaply if necessary, though both are fairly lightweight and decently stable. Toolz would be a bit more of a pain but is also possible.

jni · 2015-05-05T06:09:47Z

Since toolz is pure-python, I would favour just adding the dependency. I adore toolz. =)

I can playtest as soon as it's stable. Thoughts about the API:

process_chunks looks like a good interface to me overall, but
I'd like a chunks=None mode that automatically determines chunking. This could split the image into equally-sized chunks based on the number of processors. (See skimage.util.regular_grid for something similar.)
maybe an **extra_kwargs argument would be useful to pass to the input function. Using functools.partial or toolz.curry would work, but I believe most of our users would be more comfortable with pass-through kwargs.

Thank you!

cowlicks · 2015-05-05T16:53:07Z

I was avoiding pass through kwargs because I was worried about cluttering the interface. Also some cases like gaussian_filter require positional args too.

mrocklin · 2015-05-05T16:56:52Z

I suggest args=, kwargs= optional keyword arguments a la multiprocessing.Pool.apply_async

cowlicks · 2015-05-05T22:01:42Z

Dask will have to support numpy 1.6. It wraps numpy.isclose which was not added until 1.7. So this is breaking.

mrocklin · 2015-05-05T22:07:31Z

Seems doable

cowlicks · 2015-05-05T22:37:27Z

skimage/util/process.py

+__all__ = ['process_chunks']
+
+
+def _get_chunks(shape, ncpu):


@jni here is my naive chunk size selection based on multiprocessing.cpu_count.

@cowlicks doesn't this overchunk? (that's a fun word.) e.g. if I have 4 processors, it's going to give me back 16 chunks, right? I think

the division should be by the ceiling of the ndim-th root of the number of cpus, and

In cases of non-even division, the initial chunk size should be 1 + the division result, so that you get a decent-sized remainder.

Here's what a doctest should look like imho:

>>> _get_chunks((4, 4), 4) (2, 2) >>> _get_chunks((4, 4), 2) (2, 2) >>> _get_chunks((5, 5), 2) (3, 3)

If we want to get more clever, this'd be a fun one:

>>> _get_chunks((2, 4), 2) (2, 2)

@jni You are correct, I was thinking about this in one dimension. This currently gives you ncpu * len(shape) chunks. I'll fix this.

Also _get_chunks should return a tuple of tuples. One tuple for each dimension listing the size of each chunk in that dimension. So:

>>> _get_chunks((4, 4), 4) ((2, 2), (2, 2)) >>> _get_chunks((4, 4), 2) ((2, 2), (4,)) >>> _get_chunks((5, 5), 2) ((2, 3), (5,)) >>> _get_chunks((2, 4), 2) ((1, 1), (4,)) # ? or it might be better to have ((2,), (2, 2))

We could also have the get_chunks function in skimage just return a single tuple of blocksizes. This is probably more intuitive for users and dask.array functions will convert to the more explicit tuple of tuples form automatically.

@jni About your second point: this would make the last remainder smaller. I want to avoid this because you could accidentally have your boundary depth be larger than the last chunk. If we just extend the last chunk by the remainder, we know the safe upper bound for boundary depth would be min(shape) // ciel(ncpu ** (1./len(shape)))

@jni Try this. It could definitely be optimized. But I think that could get complicated quickly.

Thanks for this! I am realising that I didn't understand the chunk specification format... And I still don't! I was indeed thinking about block sizes, as @mrocklin suggests. (h5py, for example, specifies chunking in this way.) Can you explain the format a bit more? Maybe a few examples of how a chunk tuple gets unpacked to array indices? This could go in a Notes section for the process_chunks function docstring. I don't imagine I'll be the first to give something completely different to what's expected!

Storage formats like h5py have a bit of an easier time with chunk sizes, they can rely on them being uniform. When you start actually computing on your arrays your chunk sizes can change and these changes can be heterogeneous.

Chunks is defined here http://dask.pydata.org/en/latest/array-design.html and generally means the sizes of each block along each dimension. These can vary along each axis but must be Cartesian within the array.

@jni the docs @mrocklin linked to explain it better than I can.

Currently _get_chunks returns a sequence of chunksizes (chunk and block are synonymous) along each dimension. This way we can specify what happens at the boundaries to make sure they are not smaller than the ghosting depth.

If _get_chunks just returned one chunk's dimensions so that this chunk was tiled across the array (this is what you are suggesting I think), the chunks on the boundaries could be truncated so that the ghosting depth might end up larger than a chunk. Which causes issues.

@cowlicks those docs are indeed great! Could you add a link in this docstring under "References"?

jni · 2015-05-06T02:06:48Z

@cowlicks I would add the extra args and kwargs to the end of the argument list, calling them extra_arguments and extra_keywords, which is a common pattern in scipy, e.g. for generic_filter, and even in other parts of scikit-image.

mrocklin · 2015-05-06T21:00:18Z

skimage/util/process.py

+    >>> _get_chunks((5, 5), 2)
+    ((2, 3), (5,))
+    >>> _get_chunks((2, 4), 2)
+    ((1, 1), (4,))


Ideally we would split the larger dimensions more aggressively than the smaller. I agree with your sentiment though that we should probably wait before optimizing this much further; this is a potential rabbit hole.

In dask.array I'm honestly fine requiring users to think about chunk sizes. Automatic solutions in n-dimensions are likely to fail decently often and users should probably be made somewhat aware of what's going on. So far people also seem willing to think about chunk sizes; it doesn't seem to be that alien to people yet.

For reference, Distarray does something similar here.

Similar application, but it tries a lot harder to make a good guess.

cowlicks · 2015-05-06T21:01:30Z

@jni, extra keyword args in addition to the ones already in there? Or should I move these to the end of the signature?

jni · 2015-05-07T02:10:23Z

@cowlicks I was thinking,

move those to the end of the signature,
add them to the docstring, and
rename them extra_arguments and extra_keywords, which both matches scipy and to me more explicitly implies that these are for the passed function, rather undocumented args and kwargs for process_chunks.

The last bit is my opinion only and I'd consider it optional, since there's examples of different nomenclature elsewhere. But I still would strongly recommend it, perhaps pending feedback from others in this discussion.

stefanv · 2015-05-11T02:17:03Z

skimage/util/process.py

+    return tuple(chunks)
+
+
+def process_chunks(function, array, chunks=None, depth=0,


Maybe "apply_chunks" is more intuitive, given that a function is the first parameter?

@stefanv I like it!

Ok I'm changing the module name from process.py to apply.py too, likewise for the tests.

@jni

renamed them to extra_arguments and extra_keywords suggested by @jni

mrocklin · 2015-05-19T23:39:35Z

It looks like @cowlicks has been doing some trial and error testing here against the scikit-image build/dependencies system with little luck. This now seems fairly scikit-image focused rather than dask focused; perhaps scikit-image folk could take over from here?

stefanv · 2015-05-19T23:44:40Z

Yes, I pinged @blink1073 above. He knows most about our current testing infrastructure.

jni · 2015-05-20T00:30:25Z

What's this about optional dependencies? I thought we decided to make dask[array] and toolz actual dependencies? At any rate, optional dependencies should fail gracefully:

import dask.array as da

becomes:

try:
    import dask.array as da
except ImportError:
    raise NotImplementedError("apply_parallel requires dask[array]")

or:

try:
    import dask.array as da
except ImportError:
    import warnings
    warnings.warn("dask array is not installed; apply_parallel will only use a single process")
# [code to do this gracefully]

But, as I mentioned, I would actually prefer an actual dependency.

stefanv · 2015-05-20T01:27:34Z

I also thought this would be an actual dependency specified in requirements.txt.

mrocklin · 2015-05-20T01:28:30Z

I think that @cowlicks did too, but when he added it to requirements.txt there was an issue with .travis.yml. Apparently you weren't pulling from the standard PyPI repositories but only accepted wheels from some other location.

blink1073 · 2015-05-20T01:42:32Z

tools/travis_script.sh

@@ -67,6 +67,8 @@ fi

 retry pip install -q tifffile

+pip install 'dask[array]>=0.5.0'


Try moving this to this line

mrocklin · 2015-05-21T18:06:04Z

DEPENDS.txt

@@ -52,6 +52,9 @@ functionality is only available with the following installed:
 * `imread <http://pythonhosted.org/imread/>`__
  Optional io plugin providing most standard `formats <http://pythonhosted.org//imread/formats.html>`__.

+* `dask array<dask.pydata.org/en/latest/>`__


This probably needs an http://

mrocklin · 2015-05-21T20:39:49Z

Hrm, the failure in travis.ci doesn't appear related to changes in this pull request.

stefanv · 2015-05-21T21:08:59Z

This looks fantastic! Thank you @cowlicks, @mrocklin and @jni and @JDWarner for reviewing.

Add parallel function mapping

cowlicks reviewed May 4, 2015
View reviewed changes

cowlicks reviewed May 5, 2015
View reviewed changes

mrocklin reviewed May 6, 2015
View reviewed changes

cowlicks force-pushed the chunk-map branch from 6adfbc9 to 5ed748a Compare May 7, 2015 20:53

cowlicks closed this May 10, 2015

cowlicks reopened this May 10, 2015

cowlicks force-pushed the chunk-map branch from 5ed748a to 38bde99 Compare May 10, 2015 23:02

stefanv reviewed May 11, 2015
View reviewed changes

cowlicks and others added 9 commits May 19, 2015 17:05

Add latest dask to .travis.yml

bf3bb8d

Add latest dask to appveyor

54e95cf

pep8

4a47127

Smarter chunk selection.

6be2b1b

Move the function's args and kws to the end of the signature

08dcf4a

renamed them to extra_arguments and extra_keywords suggested by @jni

process -> apply, more docs

ae73d03

Use my dask branch instead of master

b2e4468

cpu_count may raise NotImplementedError

7ac7e10

Rename apply and apply_chunks to apply_parallel

a258fc7

cowlicks force-pushed the chunk-map branch from 094af3b to a1a5cae Compare May 19, 2015 22:06

cowlicks added 5 commits May 19, 2015 17:14

Install dask[array] from pypi for travis and appveyor

9be6424

Use absolute import

0e573f7

Add dask to DEPENDS.txt

c885a9e

Add to release notes.

9d11b38

Defer import errors of dask array until runtime.

a8d4995

cowlicks force-pushed the chunk-map branch from a1a5cae to a8d4995 Compare May 19, 2015 22:14

blink1073 reviewed May 20, 2015
View reviewed changes

cowlicks added 2 commits May 21, 2015 12:12

Move dask intall to travis_befor_install.sh

d6c028a

Remove dask install from travis_script.sh

14a1858

mrocklin reviewed May 21, 2015
View reviewed changes

Correct link in DEPENDS.txt

2ebcadb

stefanv added a commit that referenced this pull request May 21, 2015

Merge pull request #1493 from cowlicks/chunk-map

a65b8af

Add parallel function mapping

stefanv merged commit a65b8af into scikit-image:master May 21, 2015

		__all__ = ['process_chunks']


		def process_chunks(function, array, chunks=None, depth=None, mode=None):

		return tuple(chunks)


		def process_chunks(function, array, chunks=None, depth=0,

		@@ -67,6 +67,8 @@ fi

		retry pip install -q tifffile

		pip install 'dask[array]>=0.5.0'

parallel function mapping #1493

parallel function mapping #1493

Conversation

cowlicks commented May 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 4, 2015

jni commented May 5, 2015

JDWarner commented May 5, 2015

mrocklin commented May 5, 2015

mrocklin commented May 5, 2015

jni commented May 5, 2015

cowlicks commented May 5, 2015

mrocklin commented May 5, 2015

cowlicks commented May 5, 2015

mrocklin commented May 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jni commented May 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cowlicks commented May 6, 2015

jni commented May 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 19, 2015

stefanv commented May 19, 2015

jni commented May 20, 2015

stefanv commented May 20, 2015

mrocklin commented May 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 21, 2015

stefanv commented May 21, 2015