Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-in Performance #35

Open
regDaniel opened this issue Nov 17, 2022 · 2 comments
Open

Read-in Performance #35

regDaniel opened this issue Nov 17, 2022 · 2 comments

Comments

@regDaniel
Copy link

regDaniel commented Nov 17, 2022

This issue is more a documentation for us. We try to optimize the read-in with @clairemerker

Some timings:

  • cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', encode_cf=("time", "geography", "vertical")) ~ 270 s
  • cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~ 40 s
  • da = xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~4-5 s (lazy loading, with subsequent da.load() 80s)
  • da = xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer", "short_name":"T"}, }, encode_cf=("time", "geography", "vertical")) ~8 s

more timings with Dask (open 10 icon forecast files and extract one variable):

  • single core: ~80-90 s
  • chunking vertical layers (Chunksize=1, 20 Workers): 40 s
  • chunking vertical layers (Chunksize=2, 20 Workers): 40 s
  • chunking vertical layers (Chunksize=2, 30 Workers): 45 s
  • interestingly a list comprehension with xr.open_dataset followed by a xr.concat is ~5-10% faster than xr.open_mfdataset.

when first merging files with cat:

  • chunking vertical layers (Chunksize=1, 20 Workers): 40 s
  • chunking vertical layers (Chunksize=2, 20 Workers): 40 s
  • --> the cfgrib overhead doesn't really vanish here.

All timings were tested on Tsa reading from /store, reading from /scratch reduces read-in times by approximately 10%.

@regDaniel
Copy link
Author

I think, we gained some more experience with this during the development of icon_timeseries. Can we close this one @clairemerker or do you think it is still relevant for iconarray? If yes, I should probably update the timings.

@clairemerker
Copy link
Collaborator

In a sense the issue is still relevant, @victoria-cherkas and I will write a new version of open_dataset() for iconarray based on what we learned in icon-timeseries. No need to update the timings in my opinion, but maybe keep the issue open, we can close it after the new implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants