Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree.to_zarr with append_dim fails on empty groups #9858

Open
5 tasks done
aladinor opened this issue Dec 5, 2024 · 7 comments
Open
5 tasks done

DataTree.to_zarr with append_dim fails on empty groups #9858

aladinor opened this issue Dec 5, 2024 · 7 comments
Labels
bug topic-combine combine/concat/merge topic-DataTree Related to the implementation of a DataTree class topic-zarr Related to zarr storage library

Comments

@aladinor
Copy link
Contributor

aladinor commented Dec 5, 2024

What happened?

I found out that when dealing with nested DataTree and storing/saving in zarr (dt.to_zarr()), it creates an empty Dataset at the root level. I was trying to append an isomorphic datatree to an existing datatree stored in Zarr. It fails when passing the mode and dim parameters since the Dataset at the root level has no dimension or coordinates.

What did you expect to happen?

I wondered why we need an empty Dataset at the root level without data or coordinates. However, I am not sure if this is the best way to merge/concat/append two DataTrees.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import pandas as pd


ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

# Split dataset in two
ds_first_half = ds.isel(time=range(0, int(len(ds.time)/2)))
ds_second_half = ds.isel(time=range(int(len(ds.time)/2), int(len(ds.time))))

# creating a pressure datarray
pressure_data = 1000.0 + 5 * np.random.randn(100, 3, 4)
lons = np.linspace(-120, -90, 4)
lats = np.linspace(25, 55, 3)
times = pd.date_range("2018-01-01", periods=100)

pressure = xr.DataArray(
    pressure_data,
    coords=[times, lats, lons],
    dims=["time", "lat", "lon"]
).to_dataset(name="pressure")

# splitting pressure in two
press_fh = pressure.isel(time=range(0, int(len(pressure.time)/2)))
press_sh = pressure.isel(time=range(int(len(pressure.time)/2), int(len(pressure.time))))

# first half dtree
dt_fh = xr.DataTree.from_dict(
    {
        "/temp": ds_first_half,
        "/pressure": press_fh,
    }
)

store = "dtree.zarr"
dt_fh.to_zarr(
    store,
    consolidated=True,
)

# second half dtree
dt_sh = xr.DataTree.from_dict(
    {
        "/temp": ds_second_half,
        "/pressure": press_sh,
    }
)

dt_sh.to_zarr(
    store,
    mode="a-",
    consolidated=True,
    append_dim="time",
)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

this is the error triggered by the empty `Dataset` at the root level


Traceback (most recent call last):
  File "/snap/pycharm-community/425/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/snap/pycharm-community/425/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 65, in <module>
    dt_fh.to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree.py", line 1699, in to_zarr
    _datatree_to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree_io.py", line 123, in _datatree_to_zarr
    ds.to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 2184, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 1920, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/zarr.py", line 889, in store
    raise ValueError(
ValueError: append_dim='time' does not match any existing dataset dimensions {}

Environment

python: 3.12 \n xarray version: '2024.10.1.dev59+g700191b9'
@aladinor aladinor added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 5, 2024
@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 5, 2024
@TomNicholas
Copy link
Member

Thanks for reporting - this seems like a bug with DataTree.to_zarr(append_dim=...) (/ it was just never actually implemented / tested...)

I am not sure if this is the best way to merge/concat/append two DataTrees.

I would expect the solution to have to consider #9106 and #9778.

I wondered why we need an empty Dataset at the root level without data or coordinates.

I'm not sure exactly what you mean. A DataTree node without any data or coordinate variables would become an empty Dataset if you called .to_dataset, but that's not avoidable.

@TomNicholas
Copy link
Member

As for the actual error, I think first we should figure out what the expected behaviour would be for

ds1 = xr.Dataset()
ds2 = xr.Dataset()

ds1.to_zarr(store)
ds2.to_zarr(store, append_dim='time')

because a datatree with a single empty group should behave the same as that.

Concatenating two empty datasets along a new dimension does not raise:

In [6]: ds1 = xr.Dataset()

In [7]: ds2 = xr.Dataset()

In [8]: xr.concat([ds1, ds2], dim='time')
Out[8]: 
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

@TomNicholas TomNicholas added topic-zarr Related to zarr storage library topic-combine combine/concat/merge labels Dec 5, 2024
@TomNicholas TomNicholas changed the title Empty Dataset at the root level when saving DataTree in zarr DataTree.to_zarr with append_dim fails on empty groups Dec 5, 2024
@aladinor
Copy link
Contributor Author

aladinor commented Dec 6, 2024

Following up on this,

As for the actual error, I think first we should figure out what the expected behaviour would be for

ds1 = xr.Dataset()
ds2 = xr.Dataset()

ds1.to_zarr(store)
ds2.to_zarr(store, append_dim='time')

I got same error as in DataTree

Traceback (most recent call last):
  File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 84, in <module>
    tom_save()
  File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 11, in tom_save
    ds2.to_zarr(store, append_dim='time')
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 2184, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 1920, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/zarr.py", line 889, in store
    raise ValueError(
ValueError: append_dim='time' does not match any existing dataset dimensions {}

@TomNicholas
Copy link
Member

I got same error as in DataTree

That's interesting. I would have thought that if "append" is supposed to be synonymous with "concat" here then the behaviour of .to_zarr(append_dim=...) should be made more consistent with xr.concat(dim=...).

tagging @jhamman as zarr-in-xarray representative (and FYI @abarciauskas-bgse)

@aladinor
Copy link
Contributor Author

aladinor commented Dec 7, 2024

Another thing I noticed @TomNicholas, is that let's suppose we can avoid the issue with the empty Dataset at the root level, which can be achieved by adding

    for node in dt.subtree:
        at_root = node is dt
        if node.is_empty & node.is_root:  # New lines
            continue                                        # to avoid empty datasets

here,

https://github.com/pydata/xarray/blob/eac5105470ec7fec767e5897fefdec9320689184/xarray/core/datatree_io.py#L117C4-L120C69

Even if we avoid the empty Dataset issue at the root level, we will still encounter another challenge related to inheritance as at the root only inherited coordinates are retained. Thus, the append_dim wont be present at this level.

ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

# Split dataset in two
ds_first_half = ds.isel(time=range(0, int(len(ds.time)/2)))
ds_second_half = ds.isel(time=range(int(len(ds.time)/2), len(ds.time)))

ds_daily_fh = ds_first_half .resample(time="D").mean("time")
ds_weekly_fh = ds_first_half .resample(time="W").mean("time")
ds_monthly_fh = ds_first_half .resample(time="ME").mean("time")

# second half dtree
ds_daily_sh = ds_second_half.resample(time="D").mean("time")
ds_weekly_sh = ds_second_half.resample(time="W").mean("time")
ds_monthly_sh = ds_second_half.resample(time="ME").mean("time")

# first half dtree
dt_fh = xr.DataTree.from_dict(
  {
      "/temp": ds.drop_dims("time"),
      "/temp/daily": ds_daily_fh.drop_vars(["lat", "lon"]),
      "/temp/weekly": ds_weekly_fh.drop_vars(["lat", "lon"]),
      "/temp/monthly": ds_monthly_fh.drop_vars(["lat", "lon"]),
  }
)

print(dt_fh)

<xarray.DataTree>
Group: /
└── Group: /tempDimensions:  (lat: 25, lon: 53)
    │   Coordinates:
    │     * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0* lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
    ├── Group: /temp/dailyDimensions:  (time: 365, lat: 25, lon: 53)
    │       Coordinates:
    │         * time     (time) datetime64[ns] 3kB 2013-01-01 2013-01-02 ... 2013-12-31Data variables:
    │           air      (time, lat, lon) float64 4MB 241.9 242.3 242.7 ... 295.5 294.8
    ├── Group: /temp/weeklyDimensions:  (time: 53, lat: 25, lon: 53)
    │       Coordinates:
    │         * time     (time) datetime64[ns] 424B 2013-01-06 2013-01-13 ... 2014-01-05Data variables:
    │           air      (time, lat, lon) float64 562kB 245.3 245.2 245.0 ... 296.2 295.6
    └── Group: /temp/monthly
            Dimensions:  (time: 12, lat: 25, lon: 53)
            Coordinates:
              * time     (time) datetime64[ns] 96B 2013-01-31 2013-02-28 ... 2013-12-31
            Data variables:
                air      (time, lat, lon) float64 127kB 244.5 244.7 244.7 ... 297.4 297.4

Now, we can save the first half

store = "dtree_inherited.zarr"
dt_fh.to_zarr(
  store,
  consolidated=True,
  write_inherited_coords=True,
)

Then, when appending new information, it will generates the following error

dt_sh = xr.DataTree.from_dict(
    {
        "/temp": ds.drop_dims("time"),
        "/temp/daily": ds_daily_sh.drop_vars(["lat", "lon"]),
        "/temp/weekly": ds_weekly_sh.drop_vars(["lat", "lon"]),
        "/temp/monthly": ds_monthly_sh.drop_vars(["lat", "lon"]),
    }
)

dt_sh.to_zarr(
    store,
    mode="a-",
    consolidated=True,
    append_dim="time",
    write_inherited_coords=True,
)

Traceback (most recent call last):
  File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/snap/pycharm-community/436/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 85, in <module>
    main()
  File "/media/alfonso/drive/Alfonso/python/raw2zarr/issue-delete.py", line 75, in main
    dt_sh.to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree.py", line 1699, in to_zarr
    _datatree_to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/datatree_io.py", line 123, in _datatree_to_zarr
    ds.to_zarr(
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 2184, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/api.py", line 1920, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/alfonso/mambaforge/envs/raw2zarr/lib/python3.12/site-packages/xarray/backends/zarr.py", line 889, in store
    raise ValueError(
ValueError: append_dim='time' does not match any existing dataset dimensions {'lat': 25, 'lon': 53}

Let me know your thoughts.

@owenlittlejohns
Copy link
Contributor

The newly created issue (#9892) is to capture specifically the Dataset piece of this, which should be done prior to DataTree changes.

@eni-awowale
Copy link
Collaborator

eni-awowale commented Dec 14, 2024

From talking with @TomNicholas and @jhamman zarr currently does not support appending a new dimension with to_zarr for a xr.Dataset but it should for xr.DataTree in this case.
If you want to append a dimension that exists in one node but not another this should work by adding the existing dimension to the nodes that the dimension exists on and then skipping it on the nodes the dimension does not exist on.

Borrowing sample data from our unit tests:

time = xr.DataArray(data=["2022-01", "2023-01"], dims="time")
stations = xr.DataArray(data=list("abcdef"), dims="station")
lon = [-100, -80, -60]
lat = [10, 20, 30]
# Set up fake data
wind_speed = xr.DataArray(np.ones((2, 6)) * 2, dims=("time", "station"))
pressure = xr.DataArray(np.ones((2, 6)) * 3, dims=("time", "station"))
air_temperature = xr.DataArray(np.ones((2, 6)) * 4, dims=("time", "station"))
dewpoint = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station"))
infrared = xr.DataArray(np.ones((2, 3, 3)) * 6, dims=("time", "lon", "lat"))
true_color = xr.DataArray(np.ones((2, 3, 3)) * 7, dims=("time", "lon", "lat"))
sample_tree = xr.DataTree.from_dict(
    {
        "/": xr.Dataset(
            coords={"time": time},
        ),
        "/weather": xr.Dataset(
            coords={"station": stations},
            data_vars={
                "wind_speed": wind_speed,
                "pressure": pressure,
            },
        ),
        "/weather/temperature": xr.Dataset(
            data_vars={
                "air_temperature": air_temperature,
                "dewpoint": dewpoint,
            },
        ),
        "/satellite": xr.Dataset(
            coords={"lat": lat, "lon": lon},
            data_vars={
                "infrared": infrared,
                "true_color": true_color,
            },
        ),
    },)
store = 'test_data_append_works.zarr'
sample_tree.to_zarr(store)
sample_tree.to_zarr(store, mode='a', append_dim="time")

This works

But then when you drop the "time" dimension it does not:

sample_tree_drop = xr.DataTree.from_dict(
    {
        "/": xr.Dataset(
            coords={"time": time},
        ),
        "/weather": xr.Dataset(
            coords={"station": stations},
            data_vars={
                "wind_speed": wind_speed,
                "pressure": pressure,
            },
        ).drop_dims('time'),
        "/weather/temperature": xr.Dataset(
            data_vars={
                "air_temperature": air_temperature,
                "dewpoint": dewpoint,
            },
        ),
        "/satellite": xr.Dataset(
            coords={"lat": lat, "lon": lon},
            data_vars={
                "infrared": infrared,
                "true_color": true_color,
            },
        ),
    },)
store = 'test_data_append_works.zarr'
sample_tree.to_zarr(store)
sample_tree.to_zarr(store, mode='a', append_dim="time")

You get a ValueError

ValueError: append_dim='time' does not match any existing dataset dimensions {'lat': 2, 'lon': 3}

But the 'time' dimension does exist for other data variables in different nested nodes.

sample_tree_drop['/weather/temperature'].air_temperature.dims

Returns

('time', 'station')

Instead of raising a ValueError we should be skipping appending along that dimension for that node.

EDIT:

TLDR #9892 might be on hold but this issue can be fixed before that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-combine combine/concat/merge topic-DataTree Related to the implementation of a DataTree class topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

4 participants