Is it possible to jointly execute two expressions derived from a common expensive ancestor expression? #8277

ogrisel · 2024-02-08T09:35:42Z

ogrisel
Feb 8, 2024

Consider the following code snippet (which can only run in the the-epic-split branch at the time of writing):

>>> import ibis
... import numpy as np
... 
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏ 
CPU times: user 15 s, sys: 1.81 s, total: 16.8 s
Wall time: 7.71 s
(0.49996058048305153, 0.5000561795455769)

The train.a.mean() and test.a.mean() expression share a common ancestor expression (the line that defines test.

I know that dask could be able to schedule a graph with several output expressions. Is this this the case for the Ibis API?

I have tried to play with the .cache operation in #8054 (comment) but I was wondering if there wasn't a way to let Ibis (or its bacckend) automatically dot it for me without me having to think about where is the best way to insert a cache operation (and also to garbage collect the cache as soon as no longer needed for this particular execution).

I suspect that join execution of multi-output expressions might be challenging to implement for SQL backends but maybe I missed something obvious.

Answered by cpcloud

Jul 18, 2024

Thanks for opening this discussion!

Are you familiar with any papers or research that discuss methods for determining when to cache?

I think for now it's probably going to be up to the user to determine where to cache, as it's probably going to be application specific.

Choosing, for example, a strategy that always caches cheap-to-store-but-expensive-to-compute (like a complex aggregation) wouldn't work in your scenario, since a sampled dataset might be much smaller than the full dataset, but still far too large to store.

View full answer

cpcloud · 2024-07-18T13:12:15Z

cpcloud
Jul 18, 2024
Maintainer

Thanks for opening this discussion!

Are you familiar with any papers or research that discuss methods for determining when to cache?

I think for now it's probably going to be up to the user to determine where to cache, as it's probably going to be application specific.

Choosing, for example, a strategy that always caches cheap-to-store-but-expensive-to-compute (like a complex aggregation) wouldn't work in your scenario, since a sampled dataset might be much smaller than the full dataset, but still far too large to store.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to jointly execute two expressions derived from a common expensive ancestor expression? #8277

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is it possible to jointly execute two expressions derived from a common expensive ancestor expression? #8277

ogrisel Feb 8, 2024

Replies: 1 comment

cpcloud Jul 18, 2024 Maintainer

ogrisel
Feb 8, 2024

cpcloud
Jul 18, 2024
Maintainer