Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe slicing bug when slicing a cube with a boolean ndarray #6251

Closed
valeriupredoi opened this issue Dec 10, 2024 · 17 comments
Closed

Severe slicing bug when slicing a cube with a boolean ndarray #6251

valeriupredoi opened this issue Dec 10, 2024 · 17 comments
Assignees

Comments

@valeriupredoi
Copy link

valeriupredoi commented Dec 10, 2024

🐛 Bug Report

Hi folks, buggy-bug here, please see the following:

MRE

import netCDF4
import iris
import dask
print(dask.__version__)
import numpy as np


fil = "tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc"

def do_lazy_iris(c):
    bo = np.zeros(c.core_data().shape[0], dtype=bool)
    bo[1560:1860] = True
    d = c[bo, ...]
    print("Lazy Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_realized_iris(c):
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1560:1860] = True
    d = c[bo, ...]
    print("Realized Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_netCDF4():
    ds = netCDF4.Dataset(fil)
    print(ds["tos"][1660, 30, 50])
    print(ds["tos"][1616, 30, 50])
    print(ds["tos"][1704, 30, 50])


cub = iris.load_cube(fil)
do_lazy_iris(cub)
print("\n")
do_realized_iris(cub)
print("\n")
do_netCDF4()

file can be downloaded directly from https://esgf.ceda.ac.uk/thredds/catalog/esg_cmip6/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/tos/gn/v20190308/catalog.html?dataset=esg_cmip6/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Omon/tos/gn/v20190308/tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc

Environment

Things that matter:

  • iris==3.11.0
  • dask: 2024.8 or 2024.11
  • netCDF4 (latest version, doesn't matter)
  • all else should be the same

Results

Run the above Python code for the two versions of Dask (2024.8 and 2024.11) and you'll see the differences 😁

Fixing the issue

Hints:

  • There is some very deterministic rotation of the data from lazy vs realized, the data is the same but there is an offset of indices of exactly 44 (in this case), so the data, albeit having the same data points values, is symmetrically shifted, I'm sure that if you look at the code you will find this extra indexing that's not needed 😃
  • no such issues when slicing with int ndarrays, gotta be boolean
  • it's be good to have a test that checks against netCDF4 values, like in my MRE

Good luck 🍺

@valeriupredoi valeriupredoi changed the title Sever slicing bug when slicing a cube with a boolean ndarray Severe slicing bug when slicing a cube with a boolean ndarray Dec 10, 2024
@valeriupredoi
Copy link
Author

Whoops forgot to tag poor @schlunma who worked heaps on isolating the behaviour out of the stack 🍺

@valeriupredoi
Copy link
Author

Just as an extra clue: moving the slice forward in time, and asking for elements equally shifted in time, results in consitent results lazy-realized-netCDF4:

import netCDF4
import iris
import dask
print(dask.__version__)
import numpy as np


fil = "tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc"

def do_lazy_iris(c):
    bo = np.zeros(c.core_data().shape[0], dtype=bool)
    bo[1460:1860] = True
    d = c[bo, ...]
    print("Lazy Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_realized_iris(c):
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1460:1860] = True
    d = c[bo, ...]
    print("Realized Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_netCDF4():
    ds = netCDF4.Dataset(fil)
    print(ds["tos"][1560, 30, 50])
    print(ds["tos"][1516, 30, 50])
    print(ds["tos"][1604, 30, 50])


cub = iris.load_cube(fil)
do_lazy_iris(cub)
print("\n")
do_realized_iris(cub)
print("\n")
do_netCDF4()

and in the same vein, trying to retrieve the data elements frm the example above (the MRE), with a slice shifted in time, results in consistent results too:

import netCDF4
import iris
import dask
print(dask.__version__)
import numpy as np


fil = "tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc"

def do_lazy_iris(c):
    bo = np.zeros(c.core_data().shape[0], dtype=bool)
    bo[1460:1860] = True
    d = c[bo, ...]
    print("Lazy Iris")
    print(d[200, 30, 50].data)
    print(d[156, 30, 50].data)
    print(d[244, 30, 50].data)


def do_realized_iris(c):
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1460:1860] = True
    d = c[bo, ...]
    print("Realized Iris")
    print(d[200, 30, 50].data)
    print(d[156, 30, 50].data)
    print(d[244, 30, 50].data)


def do_netCDF4():
    ds = netCDF4.Dataset(fil)
    print(ds["tos"][1660, 30, 50])
    print(ds["tos"][1616, 30, 50])
    print(ds["tos"][1704, 30, 50])


cub = iris.load_cube(fil)
do_lazy_iris(cub)
print("\n")
do_realized_iris(cub)
print("\n")
do_netCDF4()

so there is something very odd when starting the slice at time index 1560 out of 1980 index length

@valeriupredoi
Copy link
Author

...and shifting the slice forward in time, results in consistent results:

import netCDF4
import iris
import dask
print(dask.__version__)
import numpy as np


fil = "tos_Omon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc"

def do_lazy_iris(c):
    bo = np.zeros(c.core_data().shape[0], dtype=bool)
    bo[1660:1860] = True
    d = c[bo, ...]
    print("Lazy Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_realized_iris(c):
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1660:1860] = True
    d = c[bo, ...]
    print("Realized Iris")
    print(d[100, 30, 50].data)
    print(d[56, 30, 50].data)
    print(d[144, 30, 50].data)


def do_netCDF4():
    ds = netCDF4.Dataset(fil)
    print(ds["tos"][1760, 30, 50])
    print(ds["tos"][1716, 30, 50])
    print(ds["tos"][1804, 30, 50])


cub = iris.load_cube(fil)
do_lazy_iris(cub)
print("\n")
do_realized_iris(cub)
print("\n")
do_netCDF4()

anything special about 1560 out of 1980? Apart from 1650 / 1980 = 0.78787878... which is a beautiful number 😁

@bouweandela
Copy link
Member

@valeriupredoi To clarify if a bug was solved or introduced: which of the two versions of Dask produces the correct numbers?

@valeriupredoi
Copy link
Author

ah a good point @bouweandela 🍺 :

  • dask <=2024.8.2: no difference between any sliced cube ie lazy slice == realized slice == netCDF4-python data
  • dask >2024.8.2: lazy slice has data rotated (given the above slice only, starting at index 1560) but realized slice == netCDF4-python data

@bouweandela
Copy link
Member

Thanks, that is very useful information. Have you been able to reproduce this using only Dask arrays? Or does the issue only show up when the data is loaded through Iris?

@valeriupredoi
Copy link
Author

let me try get it to go through netCDF4+Dask only, can't really load the data in a safe way in any other way (so that I preserve some iris-like structure)

@valeriupredoi
Copy link
Author

valeriupredoi commented Dec 11, 2024

OK, netCDF4-python+Dask with MRE:

def do_netCDF4_Dask(c):
    ds = netCDF4.Dataset(fil)
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1560:1860] = True
    dat = da.array(ds["tos"])[bo, ...]
    print(dat[100, 30, 50].compute())
    print(dat[56, 30, 50].compute())
    print(dat[144, 30, 50].compute())

gives correct ie unrotated results for both dask 2024.11 (12 too) and 2024.8

Here's another clue: I switched to iris==3.9, and this buggy behaviour disappears! Let me try go to 3.10 too Scratch that sorry, dask went down to 2024.8, iris==3.9 and dask==2024.11 is still showing the buggy behaviour!

@stephenworsley
Copy link
Contributor

Curiously, I can't seem to replicate this behaviour directly from the `NetCDFDataProxy object.

from iris.fileformats.netcdf import NetCDFDataProxy

def do_netcdfproxy(c):
    bo = np.zeros(c.data.shape[0], dtype=bool)
    bo[1560:1860] = True
    print("NetCDF Proxy")
    proxy = NetCDFDataProxy(shape=(1980, 384, 320), dtype=np.float32(), path=fil, variable_name="tos", fill_value=999)

    dask_proxy = da.from_array(proxy, chunks=(300, 384, 320))
    dask_slice = dask_proxy[bo, ...]

    print(dask_slice[100, 30, 50].compute())
    print(dask_slice[56, 30, 50].compute())
    print(dask_slice[144, 30, 50].compute())

This seems like a pretty tricky one to debug. I think for the time being a dask pin is probably sensible.

@stephenworsley
Copy link
Contributor

It's also worth noting that this problem doesn't seem to be due to iris specific indexing since the core data dask array also has this problem.

def do_lazy_iris_core(c):
    bo = np.zeros(c.core_data().shape[0], dtype=bool)
    bo[1560:1860] = True
    print("Lazy Iris core data")
    dc = c.core_data()[bo, ...]
    print(dc[100, 30, 50].compute())
    print(dc[56, 30, 50].compute())
    print(dc[144, 30, 50].compute())

@valeriupredoi
Copy link
Author

@stephenworsley thanks for looking into this! Am afraid this is an iris issue, only for the lazy data array though. Please see #6251 (comment) - that uses pure netCDF4+Dask, and all is fine with both the lazy, and the realized array.

A pin on Dask is really unwelcome, since that version we'd pin on is already 4+ months old. An alternative, that I would propose to my colleagues, would be to use the int indexed ndarray, literally converting the boolean slice to an int dtype-d array. Let me know if you need me to run any more fixes.

Also pinging @trexfeathers in case he's seen such behaviour in the past 🍺

@trexfeathers
Copy link
Contributor

trexfeathers commented Dec 18, 2024

A pin on Dask is really unwelcome, since that version we'd pin on is already 4+ months old. An alternative, that I would propose to my colleagues, would be to use the int indexed ndarray, literally converting the boolean slice to an int dtype-d array. Let me know if you need me to run any more fixes.

@valeriupredoi we've discussed this at length, and we think there's a significant risk of others also encountering this problem, and potentially in ways where a workaround is not possible. We don't know how long it will take to develop a full understanding of the problem, and also how long it might take to fix, especially with the Christmas period coming up. To avoid leaving our users exposed to this problem, a pin is unfortunately necessary.

We're going to implement the pin in a bugfix release, which should capture most 'naive' users, including those on PyPI. If the pin is a problem for you then you can specify a different Iris version in your requirements.

@valeriupredoi
Copy link
Author

valeriupredoi commented Dec 18, 2024

many thanks for the heads up @trexfeathers 🍺 Sounds reasonable to me, completely understand the Xmas bit - I myself have just about an hour left to work this year, and am not gonna ask others nor give myself more work 😀 Have a great Xmas break @stephenworsley and @trexfeathers - see you folks in the new year 🎄

@valeriupredoi
Copy link
Author

@HGWright this is not completed, the pin on iris is but a workaround to avoid it 🍺

@valeriupredoi
Copy link
Author

It is indeed solved upstream in Dask, see the issue dask/dask#11614 but I suggest keeping the issue open until both Dask release the version with the fix, and iris unpins Dask, I'd also pin Dask to that version in your next release 🍺🎄

@ESadek-MO
Copy link
Contributor

ESadek-MO commented Dec 20, 2024

Hey @valeriupredoi, the work to unpin dask has been captured in #6264, so I think this is good to remain closed (assuming there are no issues with yesterdays v3.11.1 release)?

@valeriupredoi
Copy link
Author

well, it's up to you guys to decide how to handle an issue, my take is keep it open as long as the bug is still out there, the workaround the bug (ie pin Iris) is done, but the actual bug is still lurking (in Dask, as we've seen, until Dask release a new version, after the Xmas hols). At any rate, you folks have a good break and happy Xmas! 🎄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants