-
Consider the following code snippet (which can only run in the >>> import ibis
... import numpy as np
...
... test_frac = 0.25
... table = ibis.memtable({"a": np.random.default_rng(0).uniform(size=int(1e8))})
... t = table.mutate(_id=ibis.row_number())
... test = t.sample(fraction=test_frac, seed=0)
... t = t.mutate(is_test=t._id.isin(test._id))
... train = t.filter(~t.is_test)
... %time train.a.mean().execute(), test.a.mean().execute()
100% ▕████████████████████████████████████████████████████████████▏
CPU times: user 15 s, sys: 1.81 s, total: 16.8 s
Wall time: 7.71 s
(0.49996058048305153, 0.5000561795455769) The I know that dask could be able to schedule a graph with several output expressions. Is this this the case for the Ibis API? I have tried to play with the I suspect that join execution of multi-output expressions might be challenging to implement for SQL backends but maybe I missed something obvious. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thanks for opening this discussion! Are you familiar with any papers or research that discuss methods for determining when to cache? I think for now it's probably going to be up to the user to determine where to cache, as it's probably going to be application specific. Choosing, for example, a strategy that always caches cheap-to-store-but-expensive-to-compute (like a complex aggregation) wouldn't work in your scenario, since a sampled dataset might be much smaller than the full dataset, but still far too large to store. |
Beta Was this translation helpful? Give feedback.
Thanks for opening this discussion!
Are you familiar with any papers or research that discuss methods for determining when to cache?
I think for now it's probably going to be up to the user to determine where to cache, as it's probably going to be application specific.
Choosing, for example, a strategy that always caches cheap-to-store-but-expensive-to-compute (like a complex aggregation) wouldn't work in your scenario, since a sampled dataset might be much smaller than the full dataset, but still far too large to store.