Read-in Performance #35

regDaniel · 2022-11-17T15:21:02Z

This issue is more a documentation for us. We try to optimize the read-in with @clairemerker

Some timings:

cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', encode_cf=("time", "geography", "vertical")) ~ 270 s
cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~ 40 s
da = xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~4-5 s (lazy loading, with subsequent da.load() 80s)
da = xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer", "short_name":"T"}, }, encode_cf=("time", "geography", "vertical")) ~8 s

more timings with Dask (open 10 icon forecast files and extract one variable):

single core: ~80-90 s
chunking vertical layers (Chunksize=1, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 30 Workers): 45 s
interestingly a list comprehension with xr.open_dataset followed by a xr.concat is ~5-10% faster than xr.open_mfdataset.

when first merging files with cat:

chunking vertical layers (Chunksize=1, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 20 Workers): 40 s
--> the cfgrib overhead doesn't really vanish here.

All timings were tested on Tsa reading from /store, reading from /scratch reduces read-in times by approximately 10%.

The text was updated successfully, but these errors were encountered:

regDaniel · 2022-12-19T09:06:21Z

I think, we gained some more experience with this during the development of icon_timeseries. Can we close this one @clairemerker or do you think it is still relevant for iconarray? If yes, I should probably update the timings.

clairemerker · 2022-12-19T09:11:06Z

In a sense the issue is still relevant, @victoria-cherkas and I will write a new version of open_dataset() for iconarray based on what we learned in icon-timeseries. No need to update the timings in my opinion, but maybe keep the issue open, we can close it after the new implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read-in Performance #35

Read-in Performance #35

regDaniel commented Nov 17, 2022 •

edited

Loading

regDaniel commented Dec 19, 2022

clairemerker commented Dec 19, 2022

Read-in Performance #35

Read-in Performance #35

Comments

regDaniel commented Nov 17, 2022 • edited Loading

regDaniel commented Dec 19, 2022

clairemerker commented Dec 19, 2022

regDaniel commented Nov 17, 2022 •

edited

Loading